Spark Core
RDD五大特性:
- A list of partitions:一系列的分区
- A function for computing each split:对每一个分片做计算
- A list of dependencies on other
- Optionally, a Partitioner for key-value RDDs (e.g. to say that the RDD is hash-partitioned)
- Optionally, a list of preferred locations to compute each split on (e.g. block locations for an HDFS file)
一个Partitions对应一个Task
scala> val L = Array(1,2,3,4,5)
L: Array[Int] = Array(1, 2, 3, 4, 5)
scala> sc.parallelize(L,3).collect
res7: Array[Int] = Array(1, 2, 3, 4, 5)
scala> sc.parallelize(L,2).collect
res8: Array[Int] = Array(1, 2, 3, 4, 5)
UI上面的job中tasks会分别显示3个task和2个task
The first thing a Spark program must do is to create a SparkContext object, which tells Spark how to access a cluster. To create a SparkContext you first need to build a SparkConf object that contains information about your application.
/**
服务器是不需要new sc的 已经自动定义了
如果需要定义参数传递到spark中去,必须使用spark.开头
setAppName和setMaster底层已经写好了spark.
*/
package spark
import org.apache.spark.{SparkConf, SparkContext}
object RDDApp {
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setAppName("App01").setMaster("local")
val sc = new SparkContext(conf)
println(sc.appName)
println(sc.master)
sc.stop()
}
}
RDD的创建:
1.sc.parallelize
scala> val L = Array(1,2,3,4,5)
L: Array[Int] = Array(1, 2, 3, 4, 5)
scala> val rdd = sc.parallelize(L)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[1] at parallelize at <console>:26
scala> rdd.reduce(_+_)
res1: Int = 15
2.External Datasets
- sc.textFile(path) 返回MappartitionsRDD
scala> sc.textFile("/home/wzj/data/test.txt")
res0: org.apache.spark.rdd.RDD[String] = /home/wzj/data/test.txt MapPartitionsRDD[1] at textFile at <console>:25
scala> sc.textFile("/home/wzj/data/test.txt").collect
res1: Array[String] = Array(study,study,study, no,no, i)
- sc.wholeTextFile 返回路径+内容
3.transformations/actions 通过算子计算一个RDD返回一个新的RDD
transformations特性:lazy 不会立刻执行,只有遇到第一个actions才会提交作业
scala> val a = sc.parallelize(List(1,2,3,4,5))
a: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[8] at parallelize at <console>:24
scala> val b = a.map(_*2)
b: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[9] at map at <console>:25
scala> b.collect
res5: Array[Int] = Array(2, 4, 6, 8, 10)
transformations算子:
1.map/mapPartitions
map 作用于每一个元素
mapPartitions 作用于每一个分区
scala> b.mapPartitions(partition => partition.map(_*2)).collect
res6: Array[Int] = Array(4, 8, 12, 16, 20)
2.filter 过滤
rdd.filter(条件)
3.zip 拉链
rdd1.zip(rdd2)
分区,元素个数必须保持一致
4.sortby/sortbykey
scala> val rdd1 = sc.parallelize(Array(("qwe",20),("asd",15),("zxc",35)))
rdd1: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[0] at parallelize at <console>:24
scala> rdd1.sortBy(x =>x._2).collect()
res0: Array[(String, Int)] = Array((asd,15), (qwe,20), (zxc,35))
scala> rdd1.map(x =>(x._2,x._1)).sortByKey().map(x=>(x._2,x._1)).collect()
res2: Array[(String, Int)] = Array((asd,15), (qwe,20), (zxc,35))
scala> rdd1.map(x =>(x._2,x._1)).sortByKey(false).map(x=>(x._2,x._1)).collect()
res3: Array[(String, Int)] = Array((zxc,35), (qwe,20), (asd,15))
5.reduceByKey/groupByKey
groupByKey进行shuffer的数据量远大于reduceByKey,reduceByKey有局部聚合的功能相当于mr中的Combine,所以优先选择reducebykey
scala> rdd2.groupByKey()
res4: org.apache.spark.rdd.RDD[(String, Iterable[Int])] = ShuffledRDD[13] at groupByKey at <console>:26
scala> rdd2.groupByKey().collect
res5: Array[(String, Iterable[Int])] = Array((a,CompactBuffer(1, 99)), (b,CompactBuffer(2)), (c,CompactBuffer(23)))
scala> rdd2.groupByKey().map(x=>(x._1,x._2.sum)).collect
res7: Array[(String, Int)] = Array((a,100), (b,2), (c,23))
scala> rdd2.reduceByKey(_+_).collect
res9: Array[(String, Int)] = Array((a,100), (b,2), (c,23))
scala> rdd2.reduceByKey((x,y)=>x+y).collect
res11: Array[(String, Int)] = Array((a,100), (b,2), (c,23))
- join/leftOuterJoin/leftOuterJoin/fullOuterJoin
rdd1.join(rdd2)
排序:
actions算子:
1.collect 数据全部加载进内存,所以适用于测试小数据,数据过大会产生OOM
2.first 返回第一个元素
3.take(num) 取前num个元素
4.count 返回元素个数
5.top(num) 取排序后的前num个元素
6.takeOrdered
7.reduce
8.countByKey
9.lookup eg:rdd.lookup(“key”)
写数据到本地
import org.apache.spark.{SparkConf, SparkContext}
object test {
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setAppName("app001").setMaster("local")
val sc = new SparkContext(conf)
saveTextFile(sc)
sc.stop()
}
def saveTextFile(sc:SparkContext): Unit ={
val rdd1 = sc.parallelize(Array(("qwe",10),("asd",123),("zxc",34)))
val out = "out"
rdd1.saveAsTextFile(out)
}
}
写数据到HDFS
import org.apache.spark.{SparkConf, SparkContext}
object test {
def main(args: Array[String]): Unit = {
System.setProperty("HADOOP_USER_NAME","wzj")
val conf = new SparkConf().setAppName("app001").setMaster("local")
val sc = new SparkContext(conf)
saveTextFile(sc)
sc.stop()
}
def saveTextFile(sc:SparkContext): Unit ={
val rdd1 = sc.parallelize(Array(("qwe",10),("asd",123),("zxc",34)))
sc.hadoopConfiguration.set("fs.defaultFS","hdfs://hadoop001:9000")
sc.hadoopConfiguration.set("dfs.client.use.datanode.hostname","true")
val out = "hdfs://hadoop001:9000/test"
rdd1.saveAsTextFile(out)
}
}
欢迎关注公众号,一起愉快的交流