Spark:几种给Dataset增加列的方式、Dataset删除列、Dataset替换null列

68 篇文章 2 订阅

几种给Dataset增加列的方式

首先创建一个DF对象:

scala> spark.version
res0: String = 2.2.0.cloudera1

scala> val df = spark.createDataset(Seq(("key1", 23, 1.0), ("key1", 10, 2.0))).toDF("id", "rsrp", "rsrq")
df: org.apache.spark.sql.DataFrame = [id: string, rsrp: int ... 1 more field]

scala> df.show
+----+----+----+
|  id|rsrp|rsrq|
+----+----+----+
|key1|  23| 1.0|
|key1|  10| 2.0|
+----+----+----+

scala> df.printSchema
root
 |-- id: string (nullable = true)
 |-- rsrp: integer (nullable = false)
 |-- rsrq: double (nullable = false)

第一种方式:使用lit()增加常量(固定值)

可以是字符串类型,整型

scala> df.withColumn("sinurl", lit(12)).show 
+----+----+----+------+
|  id|rsrp|rsrq|sinurl|
+----+----+----+------+
|key1|  23| 1.0|    12|
|key1|  10| 2.0|    12|
+----+----+----+------+

scala> df.withColumn("type", lit("mr")).show 
+----+----+----+----+
|  id|rsrp|rsrq|type|
+----+----+----+----+
|key1|  23| 1.0|  mr|
|key1|  10| 2.0|  mr|
+----+----+----+----+

注意:

lit()是spark自带的函数,需要import org.apache.spark.sql.functions

Since 1.3.0
def lit(literal: Any): Column Creates a Column of literal value. The passed in object is returned directly if it is already a Column. If the object is a Scala Symbol, it is converted into a Column also. Otherwise, a new Column is created to represent the literal value.

第二种方式:使用当前已有的某列的变换新增

scala> df.withColumn("rsrp2", $"rsrp"*2).show 
+----+----+----+-----+
|  id|rsrp|rsrq|rsrp2|
+----+----+----+-----+
|key1|  23| 1.0|   46|
|key1|  10| 2.0|   20|
+----+----+----+-----+

第三种方式:使用select函数增加列

import static org.apache.spark.sql.functions.col;
import java.text.SimpleDateFormat;
import java.util.Date;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;
import org.apache.spark.sql.functions;
import org.apache.spark.sql.api.java.UDF1;
import org.apache.spark.sql.types.DataTypes;
...
    private final SimpleDateFormat srcSdf = new SimpleDateFormat("yyyy-MM-dd HH:00:00");
    private final SimpleDateFormat destSdf = new SimpleDateFormat("yyyy-MM-dd 00:00:00");
    
    public Dataset<Row> handler(Dataset<Row> esDataset){
        UDF1 date_fomat = new UDF1<String, String>() {
            private static final long serialVersionUID = 1L;

            public String call(final String value) throws Exception {
                Date date = srcSdf.parse(value);
                return destSdf.format(date);
            }
        };
        sparkSession.udf().register("date_fomat_func", date_fomat, DataTypes.StringType);

        UDF1 to_long = new UDF1<Long, Long>() {
            private static final long serialVersionUID = 1L;

            public Long call(final Long value) throws Exception {
                Date date = srcSdf.parse(String.valueOf(value));
                return destSdf.parse(destSdf.format(date)).getTime();
            }
        };
        sparkSession.udf().register("to_long_func", to_long, DataTypes.LongType);

        esDataset=esDataset.withColumn("scan_start_time", functions.callUDF("date_fomat_func", col("scan_start_time")));
        esDataset=esDataset.withColumn("scan_stop_time", functions.callUDF("date_fomat_func", col("scan_stop_time")));
        esDataset=esDataset.withColumn("timestamp", functions.callUDF("to_long_func", col("timestamp")));
        
        return esDataset;
    }
...

scala

scala> import org.apache.spark.sql.types.DataTypes;
import org.apache.spark.sql.types.DataTypes
scala> df.select(col("*"), 
     |     udf{
     |         (e:Int) =>
     |             if(e == "23") {
     |                 1
     |             } else {
     |                 2
     |             }
     |     }.apply(df("rsrp")).cast(DataTypes.DoubleType).as("rsrp_udf")
     | ).show
+----+----+----+--------+
|  id|rsrp|rsrq|rsrp_udf|
+----+----+----+--------+
|key1|  23| 1.0|     2.0|
|key1|  10| 2.0|     2.0|
+----+----+----+--------+
scala> df.select(col("*"),
     |     when(df("rsrp") > 10, lit(">10")).when(df("rsrp") === 10, "=10").otherwise("<10").as("rsrp_compare10")
     | ).show
+----+----+----+--------------+
|  id|rsrp|rsrq|rsrp_compare10|
+----+----+----+--------------+
|key1|  23| 1.0|           >10|
|key1|  10| 2.0|           =10|
+----+----+----+--------------+

第四种方式:case when当参数嵌套udf

df.withColumn("r",
   when($"rsrp".isNull, lit(null))
       .otherwise(udf1($"rsrp"))
       .cast(DataTypes.IntegerType)
)

第五种方式:使用expr()函数

scala> df.withColumn("rsrp4", expr("rsrp * 4")).show
+----+----+----+-----+
|  id|rsrp|rsrq|rsrp4|
+----+----+----+-----+
|key1|  23| 1.0|   92|
|key1|  10| 2.0|   40|
+----+----+----+-----+

Dataset删除列

scala> df.drop("rsrp").show
+----+----+
|  id|rsrq|
+----+----+
|key1| 1.0|
|key1| 2.0|
+----+----+


scala> df.drop("rsrp","rsrq").show
+----+
|  id|
+----+
|key1|
|key1|
+----+

Dataset替换null列

首先,在hadoop目录/user/spark/test.csv

[spark@master ~]$ hadoop fs -text /user/spark/test.csv
key1,key2,key3,key4,key5
aaa,1,2,t1,4
bbb,5,3,t2,8
ccc,2,2,,7
,7,3,t1,
bbb,1,5,t3,0
,4,,t1,8

 

转 :https://www.cnblogs.com/yy3b2007com/p/9872492.html

  • 1
    点赞
  • 4
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值