Intro
DataFrame中有一列是String格式,字符串类型为"yyyyMMdd",需要把它转换成"timestamp"。可能有很多方法,udf啦等等,这里放一个相对简单的。
构造数据
import org.apache.spark.sql.functions._
import spark.implicits._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window
val df = Seq(
("A1", 25, 1,0.64,0.36,"20200101"),
("A1", 26, 1,0.34,0.66,"20200102"),
("B1", 27, 0,0.55,0.45,"20200103"),
("C1", 30, 0,0.14,0.86,"20200104")
).toDF("id", "age", "label","pro0","pro1","dateStr")
df.printSchema()
df.show()
Intitializing Scala interpreter ...
Spark Web UI available at http://DESKTOP-LAO32FQ:4043
SparkContext available as 'sc' (version = 2.4.4, master = local[*], app id = local-1583251961417)
SparkSession available as 'spark'
root
|-- id: string (nullable = true)
|-- age: integer (nullable = false)
|-- label: integer (nullable = false)
|-- pro0: double (nullable = false)
|-- pro1: double (nullable = false)
|-- dateStr: string (nullable = true)
+---+---+-----+----+----+--------+
| id|age|label|pro0|pro1| dateStr|
+---+---+-----+----+----+--------+
| A1| 25| 1|0.64|0.36|20200101|
| A1| 26| 1|0.34|0.66|20200102|
| B1| 27| 0|0.55|0.45|20200103|
| C1| 30| 0|0.14|0.86|20200104|
+---+---+-----+----+----+--------+
import org.apache.spark.sql.functions._
import spark.implicits._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window
df: org.apache.spark.sql.DataFrame = [id: string, age: int ... 4 more fields]
列类型转换
转换之后,时分秒均为0
df.withColumn("date",unix_timestamp(col("dateStr"),"yyyyMMdd").cast("timestamp")).show()
+---+---+-----+----+----+--------+-------------------+
| id|age|label|pro0|pro1| dateStr| date|
+---+---+-----+----+----+--------+-------------------+
| A1| 25| 1|0.64|0.36|20200101|2020-01-01 00:00:00|
| A1| 26| 1|0.34|0.66|20200102|2020-01-02 00:00:00|
| B1| 27| 0|0.55|0.45|20200103|2020-01-03 00:00:00|
| C1| 30| 0|0.14|0.86|20200104|2020-01-04 00:00:00|
+---+---+-----+----+----+--------+-------------------+
2020-03-04 于南京市栖霞区