【计算引擎】大数据处理的一些方法及代码

spark

1 累加历史

1.1 spark sql 使用窗口函数累加历史数据

sqlContext.sql(
"""
  select pcode,event_date,sum(duration) over (partition by pcode order by event_date asc) as sum_duration
  from userlogs_date
""").show

1.2 使用Column提供的over 函数,传入窗口操作

import org.apache.spark.sql.expressions._   
val first_2_now_window = Window.partitionBy("pcode").orderBy("event_date")  
df_userlogs_date.select(    
     $"pcode",   
     $"event_date",  
     sum($"duration").over(first_2_now_window).as("sum_duration")    
).show

1.3 累加一段时间范围内

  • 1.3.1 累加历史所有:
select pcode,event_date,sum(duration) over (partition by pcode order by event_date asc) as sum_duration from userlogs_date
select pcode,event_date,sum(duration) over (partition by pcode order by event_date asc rows between unbounded preceding and current row) as sum_duration from userlogs_date
Window.partitionBy("pcode").orderBy("event_date").rowsBetween(Long.MinValue,0)
Window.partitionBy("pcode").orderBy("event_date")

上边四种写法完全相等

  • 1.3.2 累加N天之前,假设N=3
select pcode,event_date,sum(duration) over (partition by pcode order by event_date asc rows between 3 preceding and current row) as sum_duration from userlogs_date
Window.partitionBy("pcode").orderBy("event_date").rowsBetween(-3,0) 
  • 1.3.3 累加前N天,后M天: 假设N=3 M=5
select pcode,event_date,sum(duration) over (partition by pcode order by event_date asc rows between 3 preceding and 5 following ) as sum_duration from userlogs_date
Window.partitionBy("pcode").orderBy("event_date").rowsBetween(-3,5)
  • 1.3.4 累加该分区内所有行
select pcode,event_date,sum(duration) over (partition by pcode order by event_date asc rows between unbounded preceding and unbounded following ) as sum_duration from userlogs_date
Window.partitionBy("pcode").orderBy("event_date").rowsBetween(Long.MinValue,Long.MaxValue)

2 统计全部

2.1 spark sql 使用rollup添加all统计

sqlContext.sql(
"""
  select pcode,event_date,sum(duration) as sum_duration
  from userlogs_date_1
  group by pcode,event_date with rollup
  order by pcode,event_date
""").show()

2.2 spark sql 使用rollup添加all统计

sqlContext.sql(
"""
  select pcode,event_date,sum(duration) as sum_duration
  from userlogs_date_1
  group by pcode,event_date with rollup
  order by pcode,event_date
""").show()

3 行转列 ->pivot

val userlogs_date_all = sqlContext.sql("select dcode, pcode,event_date,sum(duration) as duration from userlogs group by dognum, pcode,event_date ")
userlogs_date_all.registerTempTable("userlogs_date_all")
val dates = userlogs_date_all.select($"event_date").map(row => row.getAs[String]("event_date")).distinct().collect().toList
userlogs_date_all.groupBy($"dcode", $"pcode").pivot("event_date", dates).sum("duration").na.fill(0).show

4 空值处理

4.1 对指定的列空值填充

 val res2=data1.na.fill(value="wangxiao111",cols=Array("gender","yearsmarried") )

 val res3=data1.na.fill(Map("gender"->"wangxiao222","yearsmarried"->"wangxiao567") )

4.2 删除某列的非空且非NaN的低于10的

data1.a.drop(10,Array("gender","yearsmarried"))

5 scala 使用自定义UDF函数

5.1 sql中使用UDF方法一

//需要导入相关的包
import org.apache.spark.sql.functions.{udf,col}
val separator = ","
    def mergeCols(merge_row: Row): String = {
        merge_row.toSeq.foldLeft("")(_ + separator + _).substring(1)
    }
    //注册函数
    val mergeColsUDF = udf(mergeCols _)
    df.select($"col4",$"col5",mergeColsUDF(struct($"col1", $"col2", $"col3")).as("value")).show()

或者

import spark.implicits._
//找变态函数
    def findHentai(sex:String,dressing:String): String ={
      if(sex =="boy" && dressing == "裙子") "变态" else "正常"
    }
spark.udf.register("find_hentai",findHentai _ )
spark.sql(s"select id,sex,dressing,find_the_one(sex,dressing) as tag from boys_and_girls")

5.2 sql中使用UDF方法二

  #使用callUDF
  def period_classification(a: Int): String ={
    var period_cla = "morning_period"
    if ( a >= 6 && a < 12 ){ period_cla = "morning_period"}
    else if ( a >= 12 && a < 13 ){ period_cla = "noon_period"}
    else if ( a >= 13 && a < 20){ period_cla = "afternoon_period"}
    else if ( a >= 20 || a < 6){ period_cla = "night_period"}
    return period_cla
  }
  import spark.implicits._
  spark.udf.register("period_classification",period_classification _)
  
  val calllogo_2 = calllogo.withColumn("call_period", callUDF("period_classification", $"inivrdt_period"))

或者

 val df2 = df1.select(callUDF("findNewWord", $"text") as "combineString",$"save_time" as "case_date").where("combineString is not null")

5.3 自定义UDAF

  • 除了逐行处理数据的udf,还有比较常见的就是聚合多行处理udaf,自定义聚合函数。类比rdd编程就是map和reduce算子的区别。
    自定义UDAF,需要extends org.apache.spark.sql.expressions.UserDefinedAggregateFunction,并实现接口中的8个方法。

举个例子,写一个取众数聚合函数。在我们通常在聚合统计的时候可能会受某条脏数据的影响。对于一个app日志聚合的时候,有id与ip,原则上一个id有一个ip,但是在多条数据里有一条ip是错误的或者为空的,这时候group能会聚合成两条数据了就,如果使用max,min对ip也进行聚合,那也不太合理,这时候可以进行投票,去类似多数对结果,从而聚合后只有一个设备。

import org.apache.spark.sql.Row
import org.apache.spark.sql.expressions.{MutableAggregationBuffer, UserDefinedAggregateFunction}
import org.apache.spark.sql.types._


/**
  * 自定义聚合函数:众数(取列内频率最高的一条)
  */
class UDAFGetMode extends UserDefinedAggregateFunction{
  override def inputSchema: StructType = {
    StructType(StructField("inputStr",StringType,true):: Nil)
  }

  override def bufferSchema: StructType = {
    StructType(StructField("bufferMap",MapType(keyType = StringType,valueType = IntegerType),true):: Nil)
  }

  override def dataType: DataType = StringType

  override def deterministic: Boolean = false

  //初始化map
  override def initialize(buffer: MutableAggregationBuffer): Unit = {
    buffer(0) = scala.collection.immutable.Map[String,Int]()
  }

  //如果包含这个key则value+1,否则 写入key,value=1
  override def update(buffer: MutableAggregationBuffer, input: Row): Unit = {
    val key = input.getAs[String](0)
    val immap = buffer.getAs[scala.collection.immutable.Map[String,Int]](0)
    val bufferMap = scala.collection.mutable.Map[String,Int](immap.toSeq: _*)
    val ret = if (bufferMap.contains(key)){
      val new_value = bufferMap.get(key).get + 1
      bufferMap.put(key,new_value)
      bufferMap
    }else{
      bufferMap.put(key,1)
      bufferMap
    }
    buffer.update(0,ret)
  }

  override def merge(buffer1: MutableAggregationBuffer, buffer2: Row): Unit = {
    //合并两个map 相同的key的value累加
    //http://www.cnblogs.com/tugeler/p/5134862.html
    buffer1.update(0,( buffer1.getAs[scala.collection.immutable.Map[String,Int]](0) /: buffer2.getAs[scala.collection.immutable.Map[String,Int]](0) ) { case (map, (k,v)) => map + ( k -> (v + map.getOrElse(k, 0)) ) })
  }

  override def evaluate(buffer: Row): Any = {
    //返回值最大的key
    var max_vale = 0
    var max_key = ""
    buffer.getAs[scala.collection.immutable.Map[String,Int]](0).foreach{
      x=>
        val key = x._1
        val value = x._2
        if(value>max_vale) {
          max_vale=value
          max_key=key
        }
    }
    max_key
  }
}

  • udaf的使用
spark.udf.register("get_mode", new com.meitu.utils.spark.udaf.UDAFGetMode)

import spark.implicits._

val df = Seq(
  (1, "10.10.1.1", "start"),
  (1, "10.10.1.1", "search"),
  (2, "123.123.123.1", "search"),
  (1, "10.10.1.0", "stop"),
  (2, "123.123.123.1", "start")
).toDF("id", "ip", "action")

df.createOrReplaceTempView("tb")

spark.sql(s"select id,get_mode(ip) as u_ip,count(*) as cnt from tb group by id").show()

5.4 hive udf/udaf

  • java写给hive用的,在spark sql里也能用,只需要导入hive-udf的jar包,然后用这种方式注册使用。
spark.sql("CREATE TEMPORARY FUNCTION UDAFAll AS 'com.meitu.utils.hive.udaf.UDAFAll' ")

spark.sql(s"select id,UDAFAll(action) from tb group by id").show()

pyspark

1 UDF

1.1 Pandas_udf使用

def trick7(self):
    df = self.session.createDataFrame(
        [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)], ("id", "v"))
    
    @F.pandas_udf("id long", F.PandasUDFType.GROUPED_MAP)  
    def normalize(pdf):
        v = pdf.v
        return pdf.assign(v=(v - v.mean()) / v.std())[["id"]]
    
    df.groupby("id").apply(normalize).show()

1.2 udf

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("Python UDF example").getOrCreate() 

df = spark.read.json("temperatures.json")
df.createOrReplaceTempView("citytemps")

# Register the UDF with our SparkSession 
spark.udf.register("CTOF", lambda degreesCelsius: ((degreesCelsius * 9.0 / 5.0) + 32.0))

spark.sql("SELECT city, CTOF(avgLow) AS avgLowF, CTOF(avgHigh) AS avgHighF FROM citytemps").show()

1.3 udaf

UDAF now only supports defined in Scala and Java(spark 2.0)
udaf必须继承类UserDefinedAggregateFunction,所以只能在scala或者java里写,在pyspark里调用。

1.4 hive udf/udaf

sqlContext.sql("CREATE TEMPORARY FUNCTION UDAFAll AS 'com.meitu.utils.hive.udaf.UDAFAll'")

sqlContext.sql(s"select id,UDAFAll(action) from tb group by id").show()

1.5 scala udf from python

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("Scala UDAF from Python example").getOrCreate()

df = spark.read.json("inventory.json")
df.createOrReplaceTempView("inventory")

spark.sparkContext._jvm.com.cloudera.fce.curtis.sparkudfexamples.scalaudaffrompython.ScalaUDAFFromPythonExample.registerUdf()

spark.sql("SELECT Make, SUMPRODUCT(RetailValue,Stock) as InventoryValuePerMake FROM inventory GROUP BY Make").show()
  • 0
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

孟知之

如果能帮助到你们,可否点个赞?

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值