dataframe scala 修改值_Scala Spark DataFrame用udf返回值修改列

最新推荐文章于 2023-09-14 11:22:53 发布

于日

最新推荐文章于 2023-09-14 11:22:53 发布

阅读量221

点赞数

文章标签： dataframe scala 修改值

版权声明：本文为博主原创文章，遵循 CC 4.0 BY-SA 版权协议，转载请附上原文出处链接和本声明。

本文链接：https://blog.csdn.net/weixin_35952352/article/details/112046739

版权

I have a spark dataframe which has a timestamp field and i want to convert this to long datatype. I used a UDF and the standalone code works fine but when i plug to to a generic logic where any timestamp will need to be converted i m not ble to get it working.Issue is how can i assing the return value from UDF back to the dataframe column

Below is the code snippet

val spark: SparkSession = SparkSession.builder().master("local[*]").appName("Test3").getOrCreate();

import org.apache.spark.sql.functions._

val sqlContext = spark.sqlContext

val df2 = sqlContext.jsonRDD(spark.sparkContext.parallelize(Array(

"""{"year":2012, "make": "Tesla", "model": "S", "comment": "No Comment", "blank": "","manufacture_ts":"2017-10-16 00:00:00"}""",

"""{"year":1997, "make": "Ford", "model": "E350", "comment": "Get one", "blank": "","manufacture_ts":"2017-10-16 00:00:00"}""",

)))

val convertTimeStamp = udf { (manTs :java.sql.Timestamp) =>

manTs.getTime

}

df2.withColumn("manufacture_ts",getTime(df2("manufacture_ts"))).show

+-----+----------+-----+--------------+-----+----+

| |No Comment|Tesla| 1508126400000| S|2012|

| | Get one| Ford| 1508126400000| E350|1997|

| | |Chevy| 1508126400000| Volt|2015|

+-----+----------+-----+--------------+-----+----+

Now i want to invoke this from a dataframe to be clled on all columns which are of type long

object Test4 extends App{

val spark: SparkSession = SparkSession.builder().master("local[*]").appName("Test").getOrCreate();

import spark.implicits._

import scala.collection.JavaConversions._

val long : Long = "1508299200000".toLong

val data = Seq(Row("10000020_LUX_OTC",long,"2020-02-14"))

val schema = List( StructField("rowkey",StringType,true)

,StructField("order_receipt_dt",LongType,true)

,StructField("maturity_dt",StringType,true))

val dataDF = spark.createDataFrame(spark.sparkContext.parallelize(data),StructType(schema))

val modifedDf2= schema.foldLeft(dataDF) { case (newDF,StructField(name,dataType,flag,metadata)) =>

newDF.withColumn(name,DataTypeUtil.transformLong(newDF,name,dataType.typeName))

modifedDf2,show

}

}

val convertTimeStamp = udf { (manTs :java.sql.Timestamp) =>

manTs.getTime

}

def transformLong(dataFrame: DataFrame,name:String, fieldType:String):Column = {

import org.apache.spark.sql.functions._

fieldType.toLowerCase match {

case "timestamp" => convertTimeStamp(dataFrame(name))

case _ => dataFrame.col(name)

}

}

解决方案

Maybe your udf crashed if the timestamp is nullYou can do :

use unix_timestamp instead of UDF.. or make your UDF null-safe

only apply on fields which need to be converted.

Given the data:

import spark.implicits._

import org.apache.spark.sql.functions._

import org.apache.spark.sql.types.TimestampType

val df = Seq(

(1L,Timestamp.valueOf(LocalDateTime.now()),Timestamp.valueOf(LocalDateTime.now()))

).toDF("id","ts1","ts2")

you can do:

val newDF = df.schema.fields.filter(_.dataType == TimestampType).map(_.name)

.foldLeft(df)((df,field) => df.withColumn(field,unix_timestamp(col(field))))

newDF.show()

which gives:

+---+----------+----------+

| id| ts1| ts2|

+---+----------+----------+

| 1|1589109282|1589109282|

+---+----------+----------+

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
dataframe scala 修改值_Scala Spark DataFrame用udf返回值修改列

I have a spark dataframe which has a timestamp field and i want to convert this to long datatype. I used a UDF and the standalone code works fine but when i plug to to a generic logic where any times...
复制链接

扫一扫

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。