Spark RDDResilient DistributedDataset) 运算

1建⽴立intRDD并转换为Array

val intRDD = sc.parallelize(List(3,1,2,5,5))

intRDD.collect()

2 建⽴stringRDD并转换为Array
val stringRDD =
sc.parallelize(List(“Apple","Orange","Banana","Grape","Ap
ple"))
stringRDD.collect()

(1)具名函数备注 一行一行输入
| return (x + 1)
| }

(2) 匿名函数
intRDD.map(x => x + 1).collect()
(3)匿名函数+匿名参数
intRDD.map(_ + 1).collect()

4 map字符串运算
stringRDD.map(x=>"fruit:" + x).collect()

5 filter数字运算
intRDD.filter(x => x < 3).collect()
intRDD.filter(_ < 3).collect()

6 filter字符串运算
stringRDD.filter(x => x.contains(“ra")).collect()

7 distinct运算
intRDD.distinct().collect()
stringRDD.distinct().collect()

8 randomSplit运算
val sRDD = intRDD.randomSplit(Array(0.4,0.6))
sRDD.size
sRDD(0).collect()
sRDD(1).collect()

9 groupBy运算
val gRDD = intRDD.groupBy(x => {if(x % 2 == 0) "even"
else “odd"}).collect()
gRDD(0)
gRDD(1)

10 RDD转换运算
val intRDD1 = sc.parallelize(List(3,1,2,5,5))
val intRDD2 = sc.parallelize(List(5,6))
val intRDD3 = sc.parallelize(List(2,7))
(1)union并集运算
intRDD1.union(intRDD2).union(intRDD3).collect()
(intRDD1++ intRDD2++ intRDD3).collect()
(2)intersection交集运算
intRDD1.intersection(intRDD2).collect()
(3)subtract差集运算
intRDD1.subtract(intRDD2).collect()
(4)cartesian笛卡尔积运算

11 RDD基本动作运算
(1)读取运算
intRDD.first()
intRDD.take(2)
intRDD.takeOrdered(3)
intRDD.takeOrdered(3).(Ordering[Int].reverse)
(2)统计运算
intRDD.stats()
intRDD.min()
intRDD.max()
intRDD.stdev()
intRDD.count()
intRDD.sum()
intRDD.mean()

12 RDD Key-Value 基本转换运算
val kvRDD1 = sc.parallelize(List((3,4),(3,6),(5,6),
(1,2)))

//查看键
kvRDD1.keys.collect()

//查看值
kvRDD1.values.collect()

//key小于5的
kvRDD1.filter{case(key,value) => key < 5}.collect()

//值小于5的
kvRDD1.filter{case(key,value) => value < 5}.collect()

//进行map运算
kvRDD1.mapValues(x => x*x).collect()

//进行排序按照关键字的大小 默认true
kvRDD1.sortByKey(true).collect()
kvRDD1.sortByKey().collect()
kvRDD1.sortByKey(false).collect()

//相同的key加起来
kvRDD1.reduceByKey((x,y)=>x+y).collect()

//reduce简写的运算方式
kvRDD1.reduceByKey(_+_).collect()
13 多个RDD Key-Value 转换运算
val kvRDD1 = sc.parallelize(List((3,4),(3,6),(5,6),
(1,2)))
val kvRDD2 = sc.parallelize(List((3,8)))

//链接并打印 相同key就匹配
kvRDD1.join(kvRDD2).foreach(println)

//左链接
kvRDD1.leftOuterJoin(kvRDD2).foreach(println)

//右链接
kvRDD1.rightOuterJoin(kvRDD2).foreach(println)

//键值对的差集运算
kvRDD1.subtract(kvRDD2).collect()
14 Key-Value 动作运算
kvRDD1.first()
kvRDD1.take(2)
val kvFirst = kvRDD1.first
kvFirst._1
kvFirst._2

//统计key的个数
kvRDD1.countByKey()

//map运算
val KV=kvRDD1.collectAsMap()
KV(3)
KV(1)

//所有关键字是3的值
kvRDD1.lookup(3)
kvRDD1.lookup(5)
val kvFruit=sc.parallelize(List((1, “apple”),
(2,”orange”),(3, “banana”),(4, “grape”)))
val fruitMap = kvFruit.collectAsMap()
val fruitIds=sc.paralelize(List(2,4,1,3))
val fruitNames=fruitIds.map(x=>fruitMap(x)).collect
val kvFruit=sc.parallelize(List((1, “apple”),
(2,”orange”),(3, “banana”),(4, “grape”)))
val fruitMap = kvFruit.collectAsMap()
val fruitIds=sc.parallelize(List(2,4,1,3))
val fruitNames =
fruitIds.map(x=>bcFruitMap.value(x)).collect
16 accumulator累加器
val intRDD = sc.parallelize(List(3,1,2,5,5))
val total = sc.accumulator(0.0)
val num = sc.accumulator(0)
intRDD.foreach(i=>{
total += i
num += 1})
println(“total=”+total.value+“, num=”+num.value)
val avg=total.value / num.value
17 RDD Persistence持久化
(1)建⽴RDD范例
val intRddMemory = sc.parallelize(List(3,1,2,5,5))
intRddMemory.persist()
intRddMemory.unpersist()
(2) 设定存储等级
import org.apache.spark.storage.StorageLevel
val intRddMemoryAndDisk = sc.parallelize(List(3,1,2,5,5))
intRddMemoryAndDisk.persist(StorageLevel.MEMORY_AND_DISK)

intRddMemoryAndDisk.unpersist()

18 使⽤Spark建⽴WordCount（这个运算电脑要有之前的WordCount ）
:quit

Ls

cd workspace/

Ls

rm -R WordCount/

mkdir -p~/workspace/WordCount/data

cd ~/workspace/WordCount/data

gedit test.txt

Apple Apple Orange

Banana Grape Grape

valtextFile=sc.textFile("file:/home/zwxq/workspace/WordCount/data/test.txt")

val stringRDD=textFile.flatMap(line=>line.split(" "))

stringRDD.collect

val countsRDD=stringRDD.map(word=>(word,1)).reduceByKey(_+_)

countsRDD.saveAsTextFile("file:/home/zwxq/workspace/WordCount/data/output")