定义
type DataFrame = org.apache.spark.sql.Dataset[org.apache.spark.sql.Row]
创建DataFrame
使用read相关方法创建
使用read方法获取DataFrameReader阅读器的对象,然后调用相应的读取文件的方法
val df: DataFrame = spark.read.json("./data/emp.json")
read可否直接读成Dataset格式呢?不可以,因为ds得有个具体的泛型啊!就是你先得定义一个Emp类来映射你要读取的文件
使用java Bean+反射 创建
步骤总结
1 创建一个java类,
2创建一个List,元素为上面创建的类
3利用createDataFrame房间DF对象
object DataFrameDemo extends App {
val spark: SparkSession = SparkSession.builder()
.appName("dfdemo")
.master("local")
.getOrCreate()
//创建Student列表
val students = List(new Student(1001, "zs", "f", 23),
new Student(1002, "ls", "m", 13),
new Student(1004, "lq", "f", 29),
new Student(1003, "kk", "m", 20))
//支持Scala和Java集合之间互操作性的隐式转换集合。
import scala.collection.JavaConversions._
//将schema应用于Javabean上
private val frame: DataFrame = spark.createDataFrame(students, classOf[Student])
frame.show()
}
创建JavaBean类
/**
* javaBean规范
* 1 成员私有化
* 2提供公有的get set方法
* 3构造器至少两个,一个无参,一个全参
* 4重写equals/hashcode.toString方法
*/
public class Student {
private int id;
private String name;
private String gender;
private int age;
public Student(){}
public Student(int id, String name, String gender, int age) {
this.id = id;
this.name = name;
this.gender = gender;
this.age = age;
}
public int getId() {
return id;
}
public String getName() {
return name;
}
public String getGender() {
return gender;
}
public int getAge() {
return age;
}
public void setId(int id) {
this.id = id;
}
public void setName(String name) {
this.name = name;
}
public void setGender(String gender) {
this.gender = gender;
}
public void setAge(int age) {
this.age = age;
}
@Override
public boolean equals(Object o) {
if (this == o) return true;
if (o == null || getClass() != o.getClass()) return false;
Student student = (Student) o;
return id == student.id &&
age == student.age &&
Objects.equals(name, student.name) &&
Objects.equals(gender, student.gender);
}
@Override
public int hashCode() {
return Objects.hash(id, name, gender, age);
}
@Override
public String toString() {
return "Student{" +
"id=" + id +
", name='" + name + '\'' +
", gender='" + gender + '\'' +
", age=" + age +
'}';
}
}
使用动态编程方式
/**
* Row:代表的是二维表中的一行记录,或者就是一个Java对象
* StructType:是该二维表的元数据信息,是StructField的集合
* StructField:是该二维表中某一个字段/列的元数据信息(主要包括,列名,类型,是否可以为null)
*/
object DataFrameDemo1 extends App {
val spark: SparkSession = SparkSession.builder()
.appName("SparkSQL")
.master("local")
.getOrCreate()
private val value: RDD[Row] = spark.sparkContext.parallelize(List(
Row(1, "李伟", 1, 180.0),
Row(2, "汪松伟", 2, 179.0),
Row(3, "常洪浩", 1, 183.0),
Row(4, "麻宁娜", 0, 168.0)
))
private val structType = types.StructType(List(
StructField("id", DataTypes.IntegerType, nullable = false),
StructField("name", DataTypes.StringType, nullable = false),
StructField("gender", DataTypes.IntegerType, nullable = false),
StructField("height", DataTypes.DoubleType, nullable = false)
))
val frame: DataFrame = spark.createDataFrame(value, structType)
frame.show()
}
DataFrame对象转RDD对象
在上面例子创建的继承上添加如下代码
private val rdd: RDD[Row] = frame.rdd
rdd.foreach(println)
结果如下
[1,李伟,1,180.0]
[2,汪松伟,2,179.0]
[3,常洪浩,1,183.0]
[4,麻宁娜,0,168.0]
DataFrame转Dataset
as方法可以转,传入一个泛型就可以.