Spark：几种给Dataset增加列的方式、Dataset删除列、Dataset替换null列

最新推荐文章于 2024-05-25 16:31:51 发布

天地不仁以万物为刍狗

最新推荐文章于 2024-05-25 16:31:51 发布

阅读量1.9k

点赞数 1

分类专栏： Spark Spark

Spark 同时被 2 个专栏收录

88 篇文章 0 订阅

订阅专栏

Spark

68 篇文章 2 订阅

订阅专栏

几种给Dataset增加列的方式

首先创建一个DF对象：

scala> spark.version
res0: String = 2.2.0.cloudera1

scala> val df = spark.createDataset(Seq(("key1", 23, 1.0), ("key1", 10, 2.0))).toDF("id", "rsrp", "rsrq")
df: org.apache.spark.sql.DataFrame = [id: string, rsrp: int ... 1 more field]

scala> df.show
+----+----+----+
|  id|rsrp|rsrq|
+----+----+----+
|key1|  23| 1.0|
|key1|  10| 2.0|
+----+----+----+

scala> df.printSchema
root
 |-- id: string (nullable = true)
 |-- rsrp: integer (nullable = false)
 |-- rsrq: double (nullable = false)

第一种方式：使用lit()增加常量（固定值）

可以是字符串类型，整型

scala> df.withColumn("sinurl", lit(12)).show 
+----+----+----+------+
|  id|rsrp|rsrq|sinurl|
+----+----+----+------+
|key1|  23| 1.0|    12|
|key1|  10| 2.0|    12|
+----+----+----+------+

scala> df.withColumn("type", lit("mr")).show 
+----+----+----+----+
|  id|rsrp|rsrq|type|
+----+----+----+----+
|key1|  23| 1.0|  mr|
|key1|  10| 2.0|  mr|
+----+----+----+----+

注意：

lit()是spark自带的函数，需要import org.apache.spark.sql.functions

Since 1.3.0
def lit(literal: Any): Column Creates a Column of literal value. The passed in object is returned directly if it is already a Column. If the object is a Scala Symbol, it is converted into a Column also. Otherwise, a new Column is created to represent the literal value.

第二种方式：使用当前已有的某列的变换新增

scala> df.withColumn("rsrp2", $"rsrp"*2).show 
+----+----+----+-----+
|  id|rsrp|rsrq|rsrp2|
+----+----+----+-----+
|key1|  23| 1.0|   46|
|key1|  10| 2.0|   20|
+----+----+----+-----+

第三种方式：使用select函数增加列

import static org.apache.spark.sql.functions.col;
import java.text.SimpleDateFormat;
import java.util.Date;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;
import org.apache.spark.sql.functions;
import org.apache.spark.sql.api.java.UDF1;
import org.apache.spark.sql.types.DataTypes;
...
    private final SimpleDateFormat srcSdf = new SimpleDateFormat("yyyy-MM-dd HH:00:00");
    private final SimpleDateFormat destSdf = new SimpleDateFormat("yyyy-MM-dd 00:00:00");
    
    public Dataset<Row> handler(Dataset<Row> esDataset){
        UDF1 date_fomat = new UDF1<String, String>() {
            private static final long serialVersionUID = 1L;

            public String call(final String value) throws Exception {
                Date date = srcSdf.parse(value);
                return destSdf.format(date);
            }
        };
        sparkSession.udf().register("date_fomat_func", date_fomat, DataTypes.StringType);

        UDF1 to_long = new UDF1<Long, Long>() {
            private static final long serialVersionUID = 1L;

            public Long call(final Long value) throws Exception {
                Date date = srcSdf.parse(String.valueOf(value));
                return destSdf.parse(destSdf.format(date)).getTime();
            }
        };
        sparkSession.udf().register("to_long_func", to_long, DataTypes.LongType);

        esDataset=esDataset.withColumn("scan_start_time", functions.callUDF("date_fomat_func", col("scan_start_time")));
        esDataset=esDataset.withColumn("scan_stop_time", functions.callUDF("date_fomat_func", col("scan_stop_time")));
        esDataset=esDataset.withColumn("timestamp", functions.callUDF("to_long_func", col("timestamp")));
        
        return esDataset;
    }
...

scala

scala> import org.apache.spark.sql.types.DataTypes;
import org.apache.spark.sql.types.DataTypes
scala> df.select(col("*"), 
     |     udf{
     |         (e:Int) =>
     |             if(e == "23") {
     |                 1
     |             } else {
     |                 2
     |             }
     |     }.apply(df("rsrp")).cast(DataTypes.DoubleType).as("rsrp_udf")
     | ).show
+----+----+----+--------+
|  id|rsrp|rsrq|rsrp_udf|
+----+----+----+--------+
|key1|  23| 1.0|     2.0|
|key1|  10| 2.0|     2.0|
+----+----+----+--------+

scala> df.select(col("*"),
     |     when(df("rsrp") > 10, lit(">10")).when(df("rsrp") === 10, "=10").otherwise("<10").as("rsrp_compare10")
     | ).show
+----+----+----+--------------+
|  id|rsrp|rsrq|rsrp_compare10|
+----+----+----+--------------+
|key1|  23| 1.0|           >10|
|key1|  10| 2.0|           =10|
+----+----+----+--------------+

第四种方式：case when当参数嵌套udf

df.withColumn("r",
   when($"rsrp".isNull, lit(null))
       .otherwise(udf1($"rsrp"))
       .cast(DataTypes.IntegerType)
)

第五种方式：使用expr()函数

scala> df.withColumn("rsrp4", expr("rsrp * 4")).show
+----+----+----+-----+
|  id|rsrp|rsrq|rsrp4|
+----+----+----+-----+
|key1|  23| 1.0|   92|
|key1|  10| 2.0|   40|
+----+----+----+-----+

Dataset删除列

scala> df.drop("rsrp").show
+----+----+
|  id|rsrq|
+----+----+
|key1| 1.0|
|key1| 2.0|
+----+----+


scala> df.drop("rsrp","rsrq").show
+----+
|  id|
+----+
|key1|
|key1|
+----+

Dataset替换null列

首先，在hadoop目录/user/spark/test.csv

[spark@master ~]$ hadoop fs -text /user/spark/test.csv
key1,key2,key3,key4,key5
aaa,1,2,t1,4
bbb,5,3,t2,8
ccc,2,2,,7
,7,3,t1,
bbb,1,5,t3,0
,4,,t1,8

转：https://www.cnblogs.com/yy3b2007com/p/9872492.html

天地不仁以万物为刍狗

关注

1
点赞
踩
4

收藏

觉得还不错? 一键收藏
0
评论
Spark：几种给Dataset增加列的方式、Dataset删除列、Dataset替换null列

几种给Dataset增加列的方式首先创建一个DF对象：scala> spark.versionres0: String = 2.2.0.cloudera1scala> val df = spark.createDataset(Seq(("key1", 23, 1.0), ("key1", 10, 2.0))).toDF("id", "rsrp", "rsrq")df:...
复制链接

扫一扫