API:
模拟数据:
//construct with code:
var seq = Seq(
Person("1", "one"),
Person("2", "two"),
Person("3", "three"),
Person("4", "four", "sensitive"),
Person("5", "five", "sensitive"),
Person("6", "size", "sensitive")
)
数据计算:
RDD | DF(DataFrame) | DS(DataSet) |
var rdd1 = spark.sparkContext.parallelize(seq) println("rdd1.getNumPartitions", rdd1.getNumPartitions) rdd1.take(10).foreach(println) println("rdd1 take() end") | var df1 = spark.createDataFrame(seq) df1.show() println("df1 show() end") | var ds1 = spark.createDataset(seq) ds1.show() ds1.explain() println("ds1.rdd.getNumPartitions", ds1.rdd.getNumPartitions) println("ds1 show() end") |
rdd2.take(10).foreach(println) | implicit var rowEncoder: Encoder[Row] = ExpressionEncoder() var df3 = df1.repartition($"tpe").mapPartitions(_1 => { val list = new scala.collection.mutable.ArrayBuffer[Row]() _1.foreach(_1 => { var r = Row(_1(0), _1(1) + " chagned by df", _1(2)) list += r } ) list.iterator }) df3.explain() df3.show() | var ds3 = ds1.repartition(10).mapPartitions(_1 => { val list = new ListBuffer[Person]() _1.foreach(p => { p.name = "changed name" list += p }) list.iterator }) ds3.explain() ds3.show() |
输入/输出
通过DS/DF的read/write对象实现分布式输入/输出。
官方支持:文件,JDBC:Data Sources - Spark 3.5.1 Documentation
第三方还有REDIS, EXCEL等:
<dependency>
<groupId>com.redislabs</groupId>
<artifactId>spark-redis_2.12</artifactId>
<version>3.1.0</version>
</dependency>
<dependency>
<groupId>com.crealytics</groupId>
<artifactId>spark-excel_2.11</artifactId>
<version>0.13.7</version>
</dependency>