1:动态创建DataFream
准备数据用于测试
李三 男 15
李四 女 16
王五 人妖 17
赵六 神 18
代码 ********************
第一步: 先创建所需对象
final SparkConf conf = new SparkConf();
conf.setMaster("local");
conf.setAppName("test");
final JavaSparkContext jsc = new JavaSparkContext(conf);
final SparkSession spark = SparkSession.builder().config(conf).getOrCreate();
第二步:上干货
final JavaRDD<String> aRdd = jsc.textFile("D:/a.txt");
final JavaRDD<Row> rowRdd = aRdd.map(new Function<String, Row>()
{
@Override
public Row call(String v1) throws Exception
{
return RowFactory.create(
v1.split(" ")[0],
v1.split(" ")[1],
Integer.parseInt(v1.split(" ")[2]));
}
});
final List<StructField> list = Arrays.asList(
DataTypes.createStructField("name", DataTypes.StringType, true),
DataTypes.createStructField("sex", DataTypes.StringType, true),
DataTypes.createStructField("age", DataTypes.IntegerType, true)
);
final StructType schema = DataTypes.createStructType(list);
final Dataset<Row> dataFrame = spark.createDataFrame(rowRdd, schema);
dataFrame.show(); `
动态创建机制注意:***********************
1.创建row顺序要和动态创建Schema的顺序一致
2.生成的DataFrame不会按照列的Ascii码排序
第三步:延伸代码 映射为一张‘表’执行SQL
dataFrame.createTempView("staff");
final Dataset<Row> sql = spark.sql("select * from staff where age>16");
sql.show();
sql.printSchema();
sql.foreach(new ForeachFunction<Row>()
{
@Override
public void call(Row row) throws Exception
{
System.out.println(row.get(0));
System.out.println(row.get(1));
System.out.println(row.get(2));
}
});
2:反射创建DataFream
准备数据用于测试
a b c
a b c
a a a
d e f
a b c
创建一个工具类 做反射类用。
public class Word implements Serializable
{
private String word;
public String getCount() { return word;}
public void setWord(String count) { this.word= word; }
}
代码 ********************
第一步: 先创建所需对象
final SparkConf conf = new SparkConf();
conf.setMaster("local");
conf.setAppName("test");
final JavaSparkContext jsc = new JavaSparkContext(conf);
final SparkSession spark = SparkSession.builder().config(conf).getOrCreate();
第二步:上干货
final JavaRDD<String> rdd = sc.textFile("D:/月考.txt");
final JavaRDD<String> flat_rdd = rdd.flatMap(new FlatMapFunction<String, String>()
{
@Override
public Iterator<String> call(String s) throws Exception
{
List<String> list = new ArrayList<>();
list.add(s.split(" ")[0]);
list.add(s.split(" ")[1]);
list.add(s.split(" ")[2]);
return list.iterator();
}
});
-------------------------------------------------------------
final JavaRDD<Word> mapRdd = flat_rdd.map(new Function<String, Word>() {
@Override
public Word call(String v1) throws Exception {
final long serialVersionUID = 5863629841241187531L;
final Word word = new Word();
word.setWord(v1);
return word;
}
});
**最后一步**
final Dataset<Row> dataFrame = spark.createDataFrame(mapRdd, Word.class);
反射创建机制注意:***********************
1.定义类要实现序列化接口
2.加载成的DataFrame 中的列会按照Ascii码排序
3.自定义类访问级别是public