Spark基础
- 0、
- 一个RDD分区对应一个task
遇到一个action产生一个job
遇到一次shuffle划分一个stage
- 1、RDD的创建
- 从本地文件或HDFS上加载数据
val rdd = sc.textFile(文件地址) - 手动创建
val rdd = sc.parallelize(Array(1,2,3,4,5))
val rdd = sc.parallelize(List(1,2,3,4,5)) - 从已有RDD中转换
- 2、RDD操作
- 转换(transform)
- 转换操作是懒操作,不会马上执行
- filter(line => line.contains(“Spark”))
- map(line => (line,1))
- flatMap(line => line.split(" "))
- reduceByKey((x,y) => x+y)
- groupByKey().map(x => (x._1,x._2.sum)
- sortByKey(false)/sortBy(_._2,false) 按key排序/自定义排序
- mapValues
- join
- …
- 动作(Action)
- 真正触发计算
- count
- first
- collect
- take
- reduce
- foreach 无返回值
- …
- 持久化
- cache(内存持久化)
- persist
- unpersist 去持久化
- RDD分区
- 增加程序并行度、降低通信开销
- repartition
- coalesce
- 自定义分区
import org.apache.spark.{Partitioner,SparkContext,SparkConf}
class MyPartitioner(numParts:Int) extends Partitioner{
override def numPartitions:Int = numParts
override def getPartition(key:Any):Int={
key.toString.toIntt%10
}
}
object TestPartitioner{
def main(args:Array[String]){
val conf = new SparkConf()
val sc = new SparkContext(conf)
val data = sc.parallelize(1 to 1o,5)
data.map((_,1)).partitionBy(new MyPartitioner(10)).map(_._1).saveAsTextFile(".....")
}
- 3、RDD转为DataFrame
- 基于反射机制
1.加载隐式方法
import spark.implicits._
2.定义样例类(声明列名及列对应的类型)
case class person(name:String,age:Int)
3.转换
rdd.map(line => person(line(0),line(1).trim.toInt)).toDF() - 基于编程方式
import org.apache.spark.sql.types._
import org.apache.spark.sql.Row
1.构建表头
val fields = Array(StructField(“name”,StringType,true),StructField(“age”,IntegerType,true))
val schema = StructType(fields)
2.创建表记录
val rowRDD = rdd.map(line => Row(line(0),line(1).trim.toInt))
3.转换
val peopleDF = spark.createDataFrame(rowRDD,schema)