2021.5.20 sparksql的Dataframe的api（更新中）

最新推荐文章于 2024-06-30 01:49:55 发布

超可爱慕之

最新推荐文章于 2024-06-30 01:49:55 发布

阅读量501

点赞数 1

分类专栏：大数据

本文链接：https://blog.csdn.net/m0_48758256/article/details/117094231

版权

大数据专栏收录该内容

41 篇文章 8 订阅

订阅专栏

本文详细介绍了如何使用Spark DataFrame进行数据连接，包括join操作和不同类型字段的处理，还展示了explode函数的应用。通过实例演示了不同类型数据插入、map类型操作和条件表达式在DataFrame中的使用。

摘要由CSDN通过智能技术生成

实际使用dataFrame的api的时候的时候才发现忘记的差不多了，常用的api做了一个整理，但是会涉及到公司的代码没有办法拿出来。
下面会写一些测试案例：

记录一下朋友的环境参数及版本号，以备以后查找使用，上次帮朋友下载依赖，自己的环境配置找不到了emmm：

<properties>
    <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
    <maven.compiler.source>1.8</maven.compiler.source>
    <maven.compiler.target>1.8</maven.compiler.target>
    <hive-common.version>3.0.0-cdh6.3.2</hive-common.version>
    <hive-client.version>3.0.0-cdh6.3.2</hive-client.version>
    <hadoop-hdfs.version>3.0.0-cdh6.3.2</hadoop-hdfs.version>
    <spark.version>2.4.0-cdh6.3.2</spark.version>
</properties>

没事我还有spark-shell，艰难：

第一个是表的连接：

join方法：
使用spark连接的hive数据库，用的是sparksql的方式生成的dataframe
1.我的exam库里有这么一张表：ex_exam_record

//使用exam库
scala> spark.sql("use exam")
res38: org.apache.spark.sql.DataFrame = []
//查看一下exam库里有哪些表
scala> spark.sql("show tables").show
+--------+----------------+-----------+
|database|       tableName|isTemporary|
+--------+----------------+-----------+
|    exam| ex_exam_anlysis|      false|
|    exam|ex_exam_question|      false|
|    exam|  ex_exam_record|      false|
+--------+----------------+-----------+
//查看一下表结构
scala> spark.sql("desc ex_exam_record").show
+-----------+---------+-------+
|   col_name|data_type|comment|
+-----------+---------+-------+
|   topic_id|   string|   null|
| student_id|   string|   null|
|question_id|   string|   null|
|      score|    float|   null|
+-----------+---------+-------+
//查询表中的10行结果，如图所示
scala> spark.sql("select * from ex_exam_record limit 10").show
+--------+-------------+-----------+-----+
|topic_id|   student_id|question_id|score|
+--------+-------------+-----------+-----+
|34434412|8195023659593|       8080|  1.0|
|34434459|8195023659592|       3762|  0.0|
| 3443449|8195023659591|       1736|  0.0|
|34434473|8195023659591|       7982|  0.0|
|34434444|8195023659595|       4753|  0.0|
|34434425|8195023659597|       7130|  1.0|
|34434497|8195023659594|       3365|  1.0|
|34434473|8195023659597|       7680|  1.0|
|34434417|8195023659593|       3727|  0.0|
|34434454|8195023659598|       6108|  1.0|
+--------+-------------+-----------+-----+

2.把这张表生成一个dataFrame，命名就叫recordDf

//生成recordDf
scala> val recordDf =  spark.sql("select * from ex_exam_record")
recordDf: org.apache.spark.sql.DataFrame = [topic_id: string, student_id: string ... 2 more fields]
//展示前10行
scala> recordDf.show(10)
+--------+-------------+-----------+-----+
|topic_id|   student_id|question_id|score|
+--------+-------------+-----------+-----+
|34434412|8195023659593|       8080|  1.0|
|34434459|8195023659592|       3762|  0.0|
| 3443449|8195023659591|       1736|  0.0|
|34434473|8195023659591|       7982|  0.0|
|34434444|8195023659595|       4753|  0.0|
|34434425|8195023659597|       7130|  1.0|
|34434497|8195023659594|       3365|  1.0|
|34434473|8195023659597|       7680|  1.0|
|34434417|8195023659593|       3727|  0.0|
|34434454|8195023659598|       6108|  1.0|
+--------+-------------+-----------+-----+
only showing top 10 rows

3.这时候为了展示关联，我想设置一个有相同字段的表，这时候新建一张表，假设叫test1
我们看到record表里的topic_id是string类型的
我们可以假设一张新的表，topic_id是int类型的

scala> spark.sql("create table test1 ( topic_id int comment '主题id', teacher_id int comment '教师id' )")
res46: org.apache.spark.sql.DataFrame = []

scala> spark.sql("desc test1")
res47: org.apache.spark.sql.DataFrame = [col_name: string, data_type: string ... 1 more field]

scala> spark.sql("desc test1").show()
+----------+---------+-------+
|  col_name|data_type|comment|
+----------+---------+-------+
|  topic_id|      int| 主题id|
|teacher_id|      int| 教师id|
+----------+---------+-------+

4.像表中插入一些数据，看到原表中有这样的id：34434412和34434459

scala> spark.sql("insert into table test1 values (34434412,1),(34434459,2)")
res50: org.apache.spark.sql.DataFrame = []
scala> spark.sql("select * from test1").show
+--------+----------+
|topic_id|teacher_id|
+--------+----------+
|34434412|         1|
|34434459|         2|
+--------+----------

5.这两个字段就插入成功了，这时候尝试进行join操作，注意现在的topic_id类型是不同的

//先把test1表建成dataFrame
scala> val test1Df = spark.sql("select * from test1")
test1Df: org.apache.spark.sql.DataFrame = [topic_id: int, teacher_id: int]
scala> test1Df.show()
+--------+----------+
|topic_id|teacher_id|
+--------+----------+
|34434412|         1|
|34434459|         2|
+--------+----------+
//然后尝试将recordDf和test1Df进行join操作
scala> test1Df.join(recordDf,recordDf("topic_id")===test1Df("topic_id"),"left").show(5)
+--------+----------+--------+-------------+-----------+-----+
|topic_id|teacher_id|topic_id|   student_id|question_id|score|
+--------+----------+--------+-------------+-----------+-----+
|34434412|         1|34434412|8195023659593|       8080|  1.0|
|34434412|         1|34434412|8195023659591|       8775|  0.0|
|34434412|         1|34434412|8195023659594|       2439|  0.5|
|34434412|         1|34434412|8195023659593|       9496|  1.0|
|34434412|         1|34434412|8195023659598|       7854|  0.0|
+--------+----------+--------+-------------+-----------+-----+
only showing top 5 rows
scala> recordDf.join(test1Df,recordDf("topic_id")===test1Df("topic_id"),"left").show(5)
+--------+-------------+-----------+-----+--------+----------+
|topic_id|   student_id|question_id|score|topic_id|teacher_id|
+--------+-------------+-----------+-----+--------+----------+
|34434412|8195023659593|       8080|  1.0|34434412|         1|
|34434459|8195023659592|       3762|  0.0|34434459|         2|
| 3443449|8195023659591|       1736|  0.0|    null|      null|
|34434473|8195023659591|       7982|  0.0|    null|      null|
|34434444|8195023659595|       4753|  0.0|    null|      null|
+--------+-------------+-----------+-----+--------+----------+
only showing top 5 rows
//这两个类型不同也join上了，查看一下dataFrame的schema
scala> recordDf.schema
res60: org.apache.spark.sql.types.StructType = StructType(StructField(topic_id,StringType,true), StructField(student_id,StringType,true), StructField(question_id,StringType,true), StructField(score,FloatType,true))
scala> test1Df.schema
res61: org.apache.spark.sql.types.StructType = StructType(StructField(topic_id,IntegerType,true), StructField(teacher_id,IntegerType,true))

6.可以看到在join的时候，如果类型不同，会尝试做自动的类型转换，记录一下手动类型转换的方法

scala> val newRecordDf = recordDf.withColumn("new_topic_id",col("topic_id") cast "Int")
scala> newRecordDf.schema
res64: org.apache.spark.sql.types.StructType = StructType(
StructField(topic_id,StringType,true), 
StructField(student_id,StringType,true), 
StructField(question_id,StringType,true), 
StructField(score,FloatType,true), 
StructField(new_topic_id,IntegerType,true)
)
//结论：可以看到这里是新增了一个列new_topic_id，类型是IntegerType
//     原来的列topic_id并没有消失，而且类型也没有发生改变
scala> val newRecordDf = recordDf.withColumn("topic_id",col("topic_id") cast "Int")
scala> newRecordDf.schema
res65: org.apache.spark.sql.types.StructType = StructType(
StructField(topic_id,IntegerType,true), 
StructField(student_id,StringType,true), 
StructField(question_id,StringType,true), 
StructField(score,FloatType,true)
)
//结论：如果新增的列名与原列名相同，则会直接进行覆盖，将其类型进行转换

5.23更新：
explode的使用方法：
对于array或者map类型的字段做处理

org.apache.spark.sql.AnalysisException: 
cannot resolve 'explode(exam.ex_exam_record.`topic_id`)' due to data type mismatch: 
input to function explode should be array or map type, not string;
修改表ddl语句
ALTER TABLE c_employee ADD COLUMNS (work string); -- 添加列
spark.sql("ALTER TABLE test1 ADD COLUMNS (mapCol map<string,string>) ")

.withColumn("region_country", explode(col("region_countries")))

问题1：
map类型使用insert into的方式该如何插入数据？
不使用insert into的方式又可以怎么加载数据？将文本文件load到hdfs的对应目录下，注意分隔符\001 \002 \003这种默认的分隔符

增加一个列并给新增的列赋值：

scala> test1Df.withColumn("topic_id1",lit($"topic_id")).show
+--------+----------+---------+
|topic_id|teacher_id|topic_id1|
+--------+----------+---------+
|34434412|         1| 34434412|
|34434459|         2| 34434459|
+--------+----------+---------+
scala> test1Df.withColumn("topic_id1",lit("topic_id")).show
+--------+----------+---------+
|topic_id|teacher_id|topic_id1|
+--------+----------+---------+
|34434412|         1| topic_id|
|34434459|         2| topic_id|
+--------+----------+---------+

类似于case when在dataframe的实现：

scala> recordDf.show(10)
+--------+-------------+-----------+-----+
|topic_id|   student_id|question_id|score|
+--------+-------------+-----------+-----+
|34434412|8195023659593|       8080|  1.0|
|34434459|8195023659592|       3762|  0.0|
| 3443449|8195023659591|       1736|  0.0|
|34434473|8195023659591|       7982|  0.0|
|34434444|8195023659595|       4753|  0.0|
|34434425|8195023659597|       7130|  1.0|
|34434497|8195023659594|       3365|  1.0|
|34434473|8195023659597|       7680|  1.0|
|34434417|8195023659593|       3727|  0.0|
|34434454|8195023659598|       6108|  1.0|
+--------+-------------+-----------+-----+
only showing top 10 rows

scala> recordDf.withColumn("grade",when(col("score")>0.0,lit("优秀")).otherwise(col("score"))).show(10)
+--------+-------------+-----------+-----+-----+
|topic_id|   student_id|question_id|score|grade|
+--------+-------------+-----------+-----+-----+
|34434412|8195023659593|       8080|  1.0| 优秀|
|34434459|8195023659592|       3762|  0.0|  0.0|
| 3443449|8195023659591|       1736|  0.0|  0.0|
|34434473|8195023659591|       7982|  0.0|  0.0|
|34434444|8195023659595|       4753|  0.0|  0.0|
|34434425|8195023659597|       7130|  1.0| 优秀|
|34434497|8195023659594|       3365|  1.0| 优秀|
|34434473|8195023659597|       7680|  1.0| 优秀|
|34434417|8195023659593|       3727|  0.0|  0.0|
|34434454|8195023659598|       6108|  1.0| 优秀|
+--------+-------------+-----------+-----+-----+
only showing top 10 rows

总结一下取一个列的值的几种方法：

col("score")
$"score"
recordDf("score")

表关联join的几种方式:
主要注意的是left_anti这种方式，是选择关联不上的左表的字段

scala> recordDf.join(test1Df,recordDf("topic_id")===test1Df("topic_id"),"left").show(5)
+--------+-------------+-----------+-----+--------+----------+
|topic_id|   student_id|question_id|score|topic_id|teacher_id|
+--------+-------------+-----------+-----+--------+----------+
|34434412|8195023659593|       8080|  1.0|34434412|         1|
|34434459|8195023659592|       3762|  0.0|34434459|         2|
| 3443449|8195023659591|       1736|  0.0|    null|      null|
|34434473|8195023659591|       7982|  0.0|    null|      null|
|34434444|8195023659595|       4753|  0.0|    null|      null|
+--------+-------------+-----------+-----+--------+----------+
only showing top 5 rows


scala> recordDf.join(test1Df,recordDf("topic_id")===test1Df("topic_id"),"left_anti").show(5)
+--------+-------------+-----------+-----+
|topic_id|   student_id|question_id|score|
+--------+-------------+-----------+-----+
| 3443449|8195023659591|       1736|  0.0|
|34434473|8195023659591|       7982|  0.0|
|34434444|8195023659595|       4753|  0.0|
|34434425|8195023659597|       7130|  1.0|
|34434497|8195023659594|       3365|  1.0|
+--------+-------------+-----------+-----+
only showing top 5 rows


scala> recordDf.join(test1Df,recordDf("topic_id")===test1Df("topic_id")).show(5)
+--------+-------------+-----------+-----+--------+----------+
|topic_id|   student_id|question_id|score|topic_id|teacher_id|
+--------+-------------+-----------+-----+--------+----------+
|34434412|8195023659593|       8080|  1.0|34434412|         1|
|34434459|8195023659592|       3762|  0.0|34434459|         2|
|34434459|8195023659591|       5657|  0.5|34434459|         2|
|34434412|8195023659591|       8775|  0.0|34434412|         1|
|34434459|8195023659596|       8248|  0.5|34434459|         2|
+--------+-------------+-----------+-----+--------+----------+
only showing top 5 rows

scala> recordDf.join(test1Df,recordDf("topic_id")===test1Df("topic_id"),"right").show(5)
+--------+-------------+-----------+-----+--------+----------+
|topic_id|   student_id|question_id|score|topic_id|teacher_id|
+--------+-------------+-----------+-----+--------+----------+
|34434412|8195023659593|       8080|  1.0|34434412|         1|
|34434412|8195023659591|       8775|  0.0|34434412|         1|
|34434412|8195023659594|       2439|  0.5|34434412|         1|
|34434412|8195023659593|       9496|  1.0|34434412|         1|
|34434412|8195023659598|       7854|  0.0|34434412|         1|
+--------+-------------+-----------+-----+--------+----------+
only showing top 5 rows

好多这种写法 :_* 选择数组中的全部元素

val selectedFields = "student_id,question_id,score"
recordDf.selectExpr(selectedFields.split(","): _*).show(5)
+-------------+-----------+-----+
|   student_id|question_id|score|
+-------------+-----------+-----+
|8195023659593|       8080|  1.0|
|8195023659592|       3762|  0.0|
|8195023659591|       1736|  0.0|
|8195023659591|       7982|  0.0|
|8195023659595|       4753|  0.0|
+-------------+-----------+-----+
only showing top 5 rows
选择数组中的全部元素
scala> recordDf.selectExpr(selectedFields.split(",")(0) ).show(5)
+-------------+
|   student_id|
+-------------+
|8195023659593|
|8195023659592|
|8195023659591|
|8195023659591|
|8195023659595|
+-------------+
only showing top 5 rows
选择数组中下标为0的元素

scala> recordDf.select("topic_id","student_id").show(2)
+--------+-------------+
|topic_id|   student_id|
+--------+-------------+
|34434412|8195023659593|
|34434459|8195023659592|
+--------+-------------+
only showing top 2 rows

scala> recordDf.select($"topic_id",$"student_id").show(2)
+--------+-------------+
|topic_id|   student_id|
+--------+-------------+
|34434412|8195023659593|
|34434459|8195023659592|
+--------+-------------+
only showing top 2 rows


scala> val aaa = recordDf("topic_id")
aaa: org.apache.spark.sql.Column = topic_id

scala> recordDf.select(aaa,$"student_id").show(2)
+--------+-------------+
|topic_id|   student_id|
+--------+-------------+
|34434412|8195023659593|
|34434459|8195023659592|
+--------+-------------+
only showing top 2 rows

scala> val aaa=$"topic_id"
aaa: org.apache.spark.sql.ColumnName = topic_id
scala> val aaa="topic_id"
aaa: String = topic_id