spark-实操笔记

最新推荐文章于 2023-05-06 00:10:33 发布

Alien_lily

最新推荐文章于 2023-05-06 00:10:33 发布

阅读量245

点赞数

文章标签： spark

本文链接：https://blog.csdn.net/Alien_lily/article/details/82021974

版权

获取当前日期

  def getNowDate():String={
    var now = new Date()
    var dateFormat = new SimpleDateFormat("yyyy-MM-dd")
    var today = dateFormat.format( now )
    today
  }

获取以前的日期

  def getPreday(days:Int):String= {
    val dateFormat = new SimpleDateFormat("yyyy-MM-dd")
    val cal = Calendar.getInstance()
    cal.setTime(new Date())
    cal.add(Calendar.DATE, -days)
    var pre_days = dateFormat.format(cal.getTime())
    pre_days
  }

日期转换：2018-07-03 转换为20180703

start_date.toString.replace("-", "")

数据字段格式转换

result = result.withColumn("MAX_DEC",col("MAX_DEC").cast("Int"))

条件格式

var black_radius =20
var geohash_prec = ""
    if (black_radius == 20) {
      geohash_prec = "geohash_8"
    } else {
      geohash_prec = "geohash_7"
    }

对黑点行为总数量超过量>=20）的区域从大到小排

val w6 = Window.partitionBy("corp_id", "line_id", "stop_dec_time", "max_dec", "date_range","BLACK_RADIUS").orderBy(col("BLACK_NUM").desc_nulls_last)
var DANGER_R = result.withColumn("order", row_number().over(w6))

字符串拼接
20180703-20180705

 high_time_period = high_time_period.withColumn("HIGH_PERIODS", concat(col("begin_time"), lit("-"), col("end_time")))

数据级别标记：排名前3的数据：高，排名最后的3数据：低，其它数据级别为：中
对某一列数据从大到小排序，并选择前20条数据进行数据程度标识，如：

val w = Window.partitionBy("col1", "col2", "col3", "col4", "col5","col6").orderBy(col("NUM").desc_nulls_last)//对数据从大到小排
var DANGER_R = result.withColumn("order", row_number().over(w))
var max_order = DANGER_R.groupBy("col1", "col2", "col3", "col4",  "col5","col6").agg(max("order").as("max_order"))
DANGER_R = DANGER_R.join(max_order,Seq("col1", "col2", "col3", "col4",  "col5","col6"),"left")
DANGER_R = DANGER_R.withColumn("DANGER_R", when(DANGER_R("order") <= 3, "高").otherwise(when(DANGER_R("order") >= col("max_order") - 2, "低").otherwise("中")))
DANGER_R

条件join

df = df.join(time_period, col("rcrd_time2") < col("END_TIME") && col("rcrd_time2") >= col("BEGIN_TIME"), "left").drop("rcrd_time2")

截取字符串

df = df.withColumn("rcrd_time2", substring(col("rcrd_time"), 12, 5))//第12位开始之后的5个字符

spark 时间戳转换成日期格式from_unixtime，日期转换为时间戳unix_timestamp

  sc_data =  sc_data.withColumn("Time", unix_timestamp(col("end_time"))-1) 
  sc_data = sc_data.withColumn("start_time",from_unixtime(col("Time"))).drop("Time") //增加一列开始时间

Alien_lily

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
spark-实操笔记

获取当前日期 def getNowDate():String={ var now = new Date() var dateFormat = new SimpleDateFormat(&quot;yyyy-MM-dd&quot;) var today = dateFormat.format( now ) today }获取以前的日期 def getPreda...
复制链接

扫一扫