创建DataFrames
一、spark-shell版本
1)spark-shell版本
spark中已经创建好了SparkContext和SQLContext对象
2)代码:
spark-shell命令
./spark-shell --master spark://hdp-1:7077 --executor-memory 500m --total-executor-cores 1
//创建了一个数据集,实现了并行化
var seq = Seq(("1","xiaoming",15),("2","xiaohong",20),("3","xiaoben",10))
加载信息:seq: Seq[(String, String, Int)] = List((1,xiaoming,15), (2,xiaohong,20), (3,xiaoben,10))
var rdd1 =sc.parallelize(seq)
加载信息: rdd1: org.apache.spark.rdd.RDD[(String, String, Int)] = ParallelCollectionRDD[0] at parallelize at <console>:26
将当前的rdd对象转换为DataFrame对象(数据信息和数据结构信息存储到DataFrame)
//_1:string,_2:string,3:int
rdd1.toDF
//在使用toDF进行转换的时候,空参的情况下。默认是+数据 作为列名,数字从1开始逐渐递增
val df = rdd1.toDF("id","name","age")
加载信息:df: org.apache.spark.sql.DataFrame = [id: string, name: string ... 1 more field]
_1:列名,String当前列的数据类型
//查看数据 show 算子来打印,show是一个action类型 算子
查看数据
df.show
查询结果:
+---+--------+---+
| id| name|age|
+---+--------+---+
| 1|xiaoming| 15|
| 2|xiaohong| 20|
| 3| xiaoben| 10|
+---+--------+---+
DSL 风格语法
1.查询:
1.1查询姓名
df.select("name").show
结果:
+--------+
| name|
+--------+
|xiaoming|
|xiaohong|
| xiaoben|
+--------+
1.2查询姓名和年龄
df.select("name","age").show
结果:
+--------+---+
| name|age|
+--------+---+
|xiaoming| 15|
|xiaohong| 20|
| xiaoben| 10|
+--------+---+
//条件过滤
//参数必须是一个字符串,filter中的表达式也需要时一个字符串
1.3查询年龄>10岁的姓名和年龄
df.select("name","age").filter("age >10").show
结果:
+--------+---+
| name|age|
+--------+---+
|xiaoming| 15|
|xiaohong| 20|
+--------+---+
//2.参数是类名col (“列名”)
df.select("name","age").filter(col("age") >10).show
结果:
+--------+---+
| name|age|
+--------+---+
|xiaoming| 15|
|xiaohong| 20|
+--------+---+
//3.分组统计个数
df.groupBy("age").count().show()
结果:
+---+-----+
|age|count|
+---+-----+
| 20| 1|
| 15| 1|
| 10| 1|
+---+-----+
//4.打印DataFrame结构信息
df.printSchema
结果:
root
|-- id: string (nullable = true)
|-- name: string (nullable = true)
|-- age: integer (nullable = false)
Sql 风格语法:
1.将DataFrame注册成表(临时表),