几种给Dataset增加列的方式
首先创建一个DF对象:
scala> spark.version
res0: String = 2.2.0.cloudera1
scala> val df = spark.createDataset(Seq(("key1", 23, 1.0), ("key1", 10, 2.0))).toDF("id", "rsrp", "rsrq")
df: org.apache.spark.sql.DataFrame = [id: string, rsrp: int ... 1 more field]
scala> df.show
+----+----+----+
| id|rsrp|rsrq|
+----+----+----+
|key1| 23| 1.0|
|key1| 10| 2.0|
+----+----+----+
scala> df.printSchema
root
|-- id: string (nullable = true)
|-- rsrp: integer (nullable = false)
|-- rsrq: double (nullable = false)
第一种方式:使用lit()增加常量(固定值)
可以是字符串类型,整型
scala> df.withColumn("sinurl", lit(12)).show
+----+----+----+------+
| id|rsrp|rsrq|sinurl|
+----+----+----+------+
|key1| 23| 1.0| 12|
|key1| 10| 2.0| 12|
+----+----+----+------+
scala> df.withColumn("type", lit("mr")).show
+----+----+----+----+
| id|rsrp|rsrq|type|
+----+----+----+----+
|key1| 23| 1.0| mr|
|key1| 10| 2.0| mr|
+----+----+----+----+
注意:
lit()是spark自带的函数,需要import org.apache.spark.sql.functions
Since 1.3.0
def lit(literal: Any): Column Creates a Column of literal value. The passed in object is returned directly if it is already a Column. If the object is a Scala Symbol, it is converted into a Column also. Otherwise, a new Column is created to represent the literal value.
第二种方式:使用当前已有的某列的变换新增
scala> df.withColumn("rsrp2", $"rsrp"*2).show
+----+----+----+-----+
| id|rsrp|rsrq|rsrp2|
+----+----+----+-----+
|key1| 23| 1.0| 46|
|key1| 10| 2.0| 20|
+----+----+----+-----+
第三种方式:使用select函数增加列
import static org.apache.spark.sql.functions.col;
import java.text.SimpleDateFormat;
import java.util.Date;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;
import org.apache.spark.sql.functions;
import org.apache.spark.sql.api.java.UDF1;
import org.apache.spark.sql.types.DataTypes;
...
private final SimpleDateFormat srcSdf = new SimpleDateFormat("yyyy-MM-dd HH:00:00");
private final SimpleDateFormat destSdf = new SimpleDateFormat("yyyy-MM-dd 00:00:00");
public Dataset<Row> handler(Dataset<Row> esDataset){
UDF1 date_fomat = new UDF1<String, String>() {
private static final long serialVersionUID = 1L;
public String call(final String value) throws Exception {
Date date = srcSdf.parse(value);
return destSdf.format(date);
}
};
sparkSession.udf().register("date_fomat_func", date_fomat, DataTypes.StringType);
UDF1 to_long = new UDF1<Long, Long>() {
private static final long serialVersionUID = 1L;
public Long call(final Long value) throws Exception {
Date date = srcSdf.parse(String.valueOf(value));
return destSdf.parse(destSdf.format(date)).getTime();
}
};
sparkSession.udf().register("to_long_func", to_long, DataTypes.LongType);
esDataset=esDataset.withColumn("scan_start_time", functions.callUDF("date_fomat_func", col("scan_start_time")));
esDataset=esDataset.withColumn("scan_stop_time", functions.callUDF("date_fomat_func", col("scan_stop_time")));
esDataset=esDataset.withColumn("timestamp", functions.callUDF("to_long_func", col("timestamp")));
return esDataset;
}
...
scala
scala> import org.apache.spark.sql.types.DataTypes;
import org.apache.spark.sql.types.DataTypes
scala> df.select(col("*"),
| udf{
| (e:Int) =>
| if(e == "23") {
| 1
| } else {
| 2
| }
| }.apply(df("rsrp")).cast(DataTypes.DoubleType).as("rsrp_udf")
| ).show
+----+----+----+--------+
| id|rsrp|rsrq|rsrp_udf|
+----+----+----+--------+
|key1| 23| 1.0| 2.0|
|key1| 10| 2.0| 2.0|
+----+----+----+--------+
scala> df.select(col("*"),
| when(df("rsrp") > 10, lit(">10")).when(df("rsrp") === 10, "=10").otherwise("<10").as("rsrp_compare10")
| ).show
+----+----+----+--------------+
| id|rsrp|rsrq|rsrp_compare10|
+----+----+----+--------------+
|key1| 23| 1.0| >10|
|key1| 10| 2.0| =10|
+----+----+----+--------------+
第四种方式:case when当参数嵌套udf
df.withColumn("r",
when($"rsrp".isNull, lit(null))
.otherwise(udf1($"rsrp"))
.cast(DataTypes.IntegerType)
)
第五种方式:使用expr()函数
scala> df.withColumn("rsrp4", expr("rsrp * 4")).show
+----+----+----+-----+
| id|rsrp|rsrq|rsrp4|
+----+----+----+-----+
|key1| 23| 1.0| 92|
|key1| 10| 2.0| 40|
+----+----+----+-----+
Dataset删除列
scala> df.drop("rsrp").show
+----+----+
| id|rsrq|
+----+----+
|key1| 1.0|
|key1| 2.0|
+----+----+
scala> df.drop("rsrp","rsrq").show
+----+
| id|
+----+
|key1|
|key1|
+----+
Dataset替换null列
首先,在hadoop目录/user/spark/test.csv
[spark@master ~]$ hadoop fs -text /user/spark/test.csv
key1,key2,key3,key4,key5
aaa,1,2,t1,4
bbb,5,3,t2,8
ccc,2,2,,7
,7,3,t1,
bbb,1,5,t3,0
,4,,t1,8