微信公众号:大数据开发运维架构
关注可了解更多大数据相关的资讯。问题或建议,请公众号留言;
如果您觉得“大数据开发运维架构”对你有帮助,欢迎转发朋友圈
相关知识回顾:
Spark支持两个方式将RDD转换成DataFrame
1.反射:将schema信息定义在一个单独的class中,通过这个scheme转换成对应的DataFrame,这种方式简单,但不建议用,因为scala的case class最多只支持22个字段,所以必须要自己开发一个类,实现product接口。
2.自定义编程:通过编程接口,自己构建StruntType,将RDD转换成对应的DataFrame,这种方式稍微麻烦,官网手册列出大体三个步骤:
![172c3f2f4667722c7aeb683e5a4dfc81.png](https://i-blog.csdnimg.cn/blog_migrate/da9fccc55f524b214b7b8c1b5d2c30c3.jpeg)
翻译一下大体意思:
1.创建RDD转换成JavaRDD
2.按照Row的数据结构定义StructType
3.基于StructType使用createDataFrame创建DataFrame
数据准备:
第一个json文件student.json
{"name":"ljs1","score":85}{"name":"ljs2","score":99}{"name":"ljs3","score":74}
第二个json数据,直接写在了代码的第46-49行中,可直接查看代码获取
代码实例:
package com.unicom.ljs.spark220.study;import org.apache.spark.SparkConf;import org.apache.spark.api.java.JavaPairRDD;import org.apache.spark.api.java.JavaRDD;import org.apache.spark.api.java.JavaSparkContext;import org.apache.spark.api.java.function.Function;import org.apache.spark.api.java.function.PairFunction;import org.apache.spark.sql.*;import org.apache.spark.sql.types.DataTypes;import org.apache.spark.sql.types.StructField;import org.apache.spark.sql.types.StructType;import scala.Tuple2;import java.util.ArrayList;import java.util.List;/** * @author: Created By lujisen * @company ChinaUnicom Software JiNan * @date: 2020-01-28 21:08 * @version: v1.0 * @description: com.unicom.ljs.spark220.study */public class JoinJsonData { public static void main(String[] args) { SparkConf sparkConf = new SparkConf().setMaster("local[*]").setAppName("JoinJsonData"); JavaSparkContext sc=new JavaSparkContext(sparkConf); SQLContext sqlContext=new SQLContext(sc); Dataset studentDS = sqlContext.read().json("D:dataMLspark1student.json"); studentDS.registerTempTable("student_score"); Dataset studentNameScoreDS = sqlContext.sql("select name,score from student_score where score > 82"); List studentNameList= studentNameScoreDS.javaRDD().map(new Function() { @Override public String call(Row row){ return row.getString(0); } }).collect(); System.out.println(studentNameList.toString()); List studentJsons=new ArrayList<>(); studentJsons.add("{"name":"ljs1","age":18}"); studentJsons.add("{"name":"ljs2","age":17}"); studentJsons.add("{"name":"ljs3","age":19}"); JavaRDD studentInfos = sc.parallelize(studentJsons); Dataset studentNameScoreRDD = sqlContext.read().json(studentInfos); studentNameScoreRDD.schema(); studentNameScoreRDD.show(); studentNameScoreRDD.registerTempTable("student_age"); String sql2="select name,age from student_age where name in ("; for(int i=0;i studentNameAgeDS = sqlContext.sql(sql2); JavaPairRDD> studentNameScoreAge = studentNameScoreDS.toJavaRDD().mapToPair(new PairFunction() { @Override public Tuple2 call(Row row) throws Exception { return new Tuple2(row.getString(0), Integer.valueOf(String.valueOf(row.getLong(1)))); } }).join(studentNameAgeDS.toJavaRDD().mapToPair(new PairFunction() { @Override public Tuple2 call(Row row) throws Exception { return new Tuple2(row.getString(0), Integer.valueOf(String.valueOf(row.getLong(1)))); } })); JavaRDD studentNameScoreAgeRow = studentNameScoreAge.map(new Function>, Row>() { @Override public Row call(Tuple2> v1) throws Exception { return RowFactory.create(v1._1, v1._2._1, v1._2._2); } }); List structFields=new ArrayList<>(); structFields.add(DataTypes.createStructField("name",DataTypes.StringType,true)); structFields.add(DataTypes.createStructField("score",DataTypes.IntegerType,true)); structFields.add(DataTypes.createStructField("age",DataTypes.IntegerType,true)); StructType structType= DataTypes.createStructType(structFields); Dataset dataFrame = sqlContext.createDataFrame(studentNameScoreAgeRow, structType); dataFrame.schema(); dataFrame.show(); dataFrame.write().format("json").mode(SaveMode.Append).save("D:dataMLspark1studentNameScoreAge"); }}