RDD
创建RDD
- 读取文件 sc.textFile
- 并行化 sc.parallelize
- 其他方式
RDD操作
- Transfermation
- union
- intersection
- distinct
- groupByKey
- reduceByKey
- sortByKey
- join leftOuterJoin rightOuterJoin
- aggregate
- Action
- reduce
- count
- first
- take
- takeSample
- takeOrdered
- saveAsTextFile
- countByKey
- foreach
DataFrame
DataSet
DataFrame to RDD
Difference | RDD | DataFrame | DataSet |
---|---|---|---|
区别一 | 不支持sparksql | 支持 | 支持 |
区别二 | DataSet[Row] |
相互转化
行转列 | RDD | DataFrame | DataSet |
---|---|---|---|
RDD | - | val rdd = sc.textFile("") case class Person(name: String, age: String) val a = rdd.map(_.split(",")).map{ line => Person(line(0), line(1))}.toDF | rdd = sc.textFile("") case class Person(name: String, age: String) val a = rdd.map(_.split(",")).map{ line => Person(line(0), line(1))}.toDS |
DataFrame | val rdd1 = testDF.rdd | - | val testDS = testDF.as[Coltest] |
DataSet | val rdd2 = testDS.rdd | val testDF = testDS.toDF | - |
数据类型
- local vector (dense, sparse)
- labeled point
- LabeledPoint to Libsvm
- local matrix
- distribute matrix
- Row matrix
- IndexedRowMatrix
- CoordinateMatrix
- BlockMatrix
读取文件类型
json parquet jdbc orc libsvm csv text
val peopleDF = spark.read.format("json").load("examples/src/main/resources/people.json")