文章目录
一、sparkSession
指向所有方法的实体类是sparkSession
, 使用SparkSession.build().builder()
创建
SparkSession session = SparkSession.builder()
.appName("Java Spark SQL basic example").config("spark.some.config.option","some-value").getOrCreate();
二、创建数据集
从json数据中创建DataFrames
//创建spark session
SparkSession spark = SparkSession.builder().master("local").appName("Java Spark SQL basic example").config("spark.some.config.option","some-value").getOrCreate();
//创建DataFrames
Dataset<Row> df = spark.read().json("src/main/resources/people.json");
df.show();
output:
+----+-------+
| age| name|
+----+-------+
|null|Michael|
| 30| Andy|
| 19| Justin|
+----+-------+
二、无类型数据集操作
df.printSchema();
root
|-- age: long (nullable = true)
|-- name: string (nullable = true)
df.select("name").show();
+-------+
| name|
+-------+
|Michael|
| Andy|
| Justin|
+-------+
df.select(col("name"),col("age").plus(1)).show();
+-------+---------+
| name|(age + 1)|
+-------+---------+
|Michael| null|
| Andy| 31|
| Justin| 20|
+-------+---------+
df.filter(col("age").gt(21)).show();
+---+----+
|age|name|
+---+----+
| 30|Andy|
+---+----+
df.groupBy("gae").count().show();
+----+-----+
| age|count|
+----+-----+
| 19| 1|
|null| 1|
| 30| 1|
+----+-----+
三、 以编程方式运行sql查询
df.createOrReplaceTempView("people");
Dataset<Row> sql = spark.sql("SELECT * FROM people");
sql.show();
+----+-------+
| age| name|
+----+-------+
|null|Michael|
| 30| Andy|
| 19| Justin|
+----+-------+
四、 全局临时视图
df.createGlobalTempView("people");
spark.sql("SELECT * FROM global_temp.people").show();
+----+-------+
| age| name|
+----+-------+
|null|Michael|
| 30| Andy|
| 19| Justin|
+----+-------+
#全局视图时跨会话的
spark.newSession().sql("SELECT * FROM global_temp.people").show();
+----+-------+
| age| name|
+----+-------+
|null|Michael|
| 30| Andy|
| 19| Justin|
+----+-------+
五、 创建数据集
数据集和RRD相似,但是数据集不是使用Java或者Kryo序列化的,而是使用特殊的编码去序列化对象,用于在网络上进行数据传输和处理,由于动态编码,因此可以不进行接码使用像过滤,排序,hash的操作。
SparkSession spark = SparkSession.builder().master("local")
.appName("Java Spark SQL basic example")
.config("spark.some.config.option","some-value").getOrCreate();
People people = new People();
people.setName("Andy");
people.setAge(32);
//设置Javabeans编码
Encoder<People> bean = Encoders.bean(People.class);
Dataset<People> dataset = spark.createDataset(Collections.singletonList(people), bean);
dataset.show();
+---+----+
|age|name|
+---+----+
| 32|Andy|
+---+----+
//Encoder为大多数普通类设置编码
Encoder<Integer> integerEncoder = Encoders.INT();
Dataset<Integer> primitiveDS = spark.createDataset(Arrays.asList(1,2,3),integerEncoder);
Dataset<Integer> transformedDS = primitiveDS.map((MapFunction<Integer, Integer>) value -> value+1,integerEncoder);
transformedDS.collect();
//从DataFrame 转换成成DataSet
Dataset<People> df = spark.read().json("src/main/resources/people.json").as(bean);
df.show();
+----+-------+
| age| name|
+----+-------+
|null|Michael|
| 30| Andy|
| 19| Justin|
+----+-------+
五、 与RDD交互交互
Spark SQL支持两种方法将RDD转换成DataSet,第一种方法使用反射。第二种方法是通过实现接口。
SparkSession spark = SparkSession.builder().master("local")
.appName("Java Spark SQL basic example")
.config("spark.some.config.option","some-value").getOrCreate();
//从文本文件创建一个People RDD对象
JavaRDD<People> peopleRDD = spark.read().textFile("src/main/resources/people.txt").javaRDD().map(line->{
String[] parts = line.split(", ");
People people = new People();
people.setName(parts[0]);
people.setAge(Integer.parseInt(parts[1].trim()));
return people;
});
//通过JavaRDD获得一个DataSet
Dataset<Row> peopleDF = spark.createDataFrame(peopleRDD,People.class);
//创建视图
peopleDF.createOrReplaceTempView("people");
//SQL语句搜索
Dataset<Row> teenagersDF = spark.sql("SELECT name FROM people WHERE age BETWEEN 13 AND 19");
Encoder<String> stringEncoder =Encoders.STRING();
Dataset<String> teenagerNamesByIndexDF= teenagersDF.map((MapFunction<Row, String>) row -> "Name:"+row.getString(0),stringEncoder);
teenagerNamesByIndexDF.show();
+-----------+
| value|
+-----------+
|Name:Justin|
+-----------+
//通过字段名字访问
Dataset<String> byFileName = teenagersDF.map((MapFunction<Row, String>) row -> "Name"+row.getAs("name"),stringEncoder);
byFileName.show();
+-----------+
| value|
+-----------+
|Name:Justin|
+-----------+
五、 指定特殊的模式
- 从源RDD创建行RDD
- 创建StructField,该StructField与上步骤数据类型匹配
- 通过SparkSession提供的createDataFrame方法将模式应用于RDD行。
类型于自己根据数据创建DataSet
SparkSession spark = SparkSession.builder().master("local")
.appName("Java Spark SQL basic example")
.config("spark.some.config.option","some-value").getOrCreate();
JavaRDD<String> peopleRDD = spark.sparkContext().textFile("src/main/resources/people.txt", 1).toJavaRDD();
String schemaString = "name age";
List<StructField> fields = new ArrayList<>();
for (String filedName:schemaString.split(" ")
) {
StructField field = DataTypes.createStructField(filedName,DataTypes.StringType,true);
fields.add(field);
}
StructType schema = DataTypes.createStructType(fields);
//将RDD转换成行
JavaRDD<Row> rowRDD = peopleRDD.map((Function<String, Row>) record -> {
String[] attributes = record.split(",");
return RowFactory.create(attributes[0], attributes[1].trim());
});
Dataset<Row> peopleDataFrame = spark.createDataFrame(rowRDD, schema);
//创建一个视图
peopleDataFrame.createOrReplaceTempView("people");
//sql语句执行查询
Dataset<Row> reslut = spark.sql("SELECT * FROM people");
Dataset<String> name = reslut.map((MapFunction<Row, String>) row -> "Name:" + row.get(0), Encoders.STRING());
name.show();
}
+------------+
| value|
+------------+
|Name:Michael|
| Name:Andy|
| Name:Justin|
+------------+
六、 标量函数
标量函数:对单行执行操作返回单个值。
七、 聚集函数
聚合函数:对行族执行计算,返回单个值。如count(),avg(),max()
总结
本片博客只要介绍了SQL的入门案例,该案例均来源于官方。其中使用到的数据来源于官方安装包。