hadoop节点上进行spark的相关运算

版权声明:本文为博主原创文章,未经博主允许不得转载。 https://blog.csdn.net/qq_34491508/article/details/79969609

Spark RDDResilient DistributedDataset) 运算 

 在多节点hadoop的maseter中进行

登录spark    spark-shell --master local[*]

1建⽴立intRDD并转换为Array

val intRDD = sc.parallelize(List(3,1,2,5,5))

intRDD.collect()

 

2 建⽴stringRDD并转换为Array
val stringRDD =
sc.parallelize(List(“Apple","Orange","Banana","Grape","Ap
ple"))
stringRDD.collect() 

 

(1)具名函数备注 一行一行输入
def addone(x:Int):Int={
| return (x + 1)
| }
intRDD.map(addone).collect() 

 

 

(2) 匿名函数
intRDD.map(x => x + 1).collect()
(3)匿名函数+匿名参数
intRDD.map(_ + 1).collect() 

 

4 map字符串运算
stringRDD.map(x=>"fruit:" + x).collect() 

 

5 filter数字运算
intRDD.filter(x => x < 3).collect()
intRDD.filter(_ < 3).collect() 

 

6 filter字符串运算
stringRDD.filter(x => x.contains(“ra")).collect() 

 

7 distinct运算
intRDD.distinct().collect()
stringRDD.distinct().collect() 

 

8 randomSplit运算
val sRDD = intRDD.randomSplit(Array(0.4,0.6))
sRDD.size
sRDD(0).collect()
sRDD(1).collect()


9 groupBy运算
val gRDD = intRDD.groupBy(x => {if(x % 2 == 0) "even"
else “odd"}).collect()
gRDD(0)
gRDD(1) 

 

10 RDD转换运算
val intRDD1 = sc.parallelize(List(3,1,2,5,5))
val intRDD2 = sc.parallelize(List(5,6))
val intRDD3 = sc.parallelize(List(2,7))
(1)union并集运算
intRDD1.union(intRDD2).union(intRDD3).collect()
(intRDD1++ intRDD2++ intRDD3).collect()
(2)intersection交集运算
intRDD1.intersection(intRDD2).collect()
(3)subtract差集运算
intRDD1.subtract(intRDD2).collect()
(4)cartesian笛卡尔积运算


11 RDD基本动作运算
(1)读取运算
intRDD.first()
intRDD.take(2)
intRDD.takeOrdered(3)
intRDD.takeOrdered(3).(Ordering[Int].reverse)
(2)统计运算
intRDD.stats()
intRDD.min()
intRDD.max()
intRDD.stdev()
intRDD.count()
intRDD.sum()
intRDD.mean() 

 

 

 

 

 

 

12 RDD Key-Value 基本转换运算
val kvRDD1 = sc.parallelize(List((3,4),(3,6),(5,6),
(1,2)))

//查看键
kvRDD1.keys.collect()

//查看值
kvRDD1.values.collect()

//key小于5的
kvRDD1.filter{case(key,value) => key < 5}.collect()

//值小于5的
kvRDD1.filter{case(key,value) => value < 5}.collect()

//进行map运算
kvRDD1.mapValues(x => x*x).collect()

//进行排序按照关键字的大小 默认true
kvRDD1.sortByKey(true).collect()
kvRDD1.sortByKey().collect()
kvRDD1.sortByKey(false).collect()

//相同的key加起来
kvRDD1.reduceByKey((x,y)=>x+y).collect()

//reduce简写的运算方式
kvRDD1.reduceByKey(_+_).collect()
13 多个RDD Key-Value 转换运算
val kvRDD1 = sc.parallelize(List((3,4),(3,6),(5,6),
(1,2)))
val kvRDD2 = sc.parallelize(List((3,8)))

//链接并打印 相同key就匹配
kvRDD1.join(kvRDD2).foreach(println)

//左链接
kvRDD1.leftOuterJoin(kvRDD2).foreach(println)

//右链接
kvRDD1.rightOuterJoin(kvRDD2).foreach(println)

//键值对的差集运算
kvRDD1.subtract(kvRDD2).collect()
14 Key-Value 动作运算
kvRDD1.first()
kvRDD1.take(2)
val kvFirst = kvRDD1.first
kvFirst._1
kvFirst._2

//统计key的个数
kvRDD1.countByKey()

//map运算
val KV=kvRDD1.collectAsMap()
KV(3)
KV(1)

//所有关键字是3的值
kvRDD1.lookup(3)
kvRDD1.lookup(5)
15 Broadcast⼴播变量(共享常亮)
1)不使⽤Boradcast⼴播变量的情况
val kvFruit=sc.parallelize(List((1, “apple”),
(2,”orange”),(3, “banana”),(4, “grape”)))
val fruitMap = kvFruit.collectAsMap()
val fruitIds=sc.paralelize(List(2,4,1,3))
val fruitNames=fruitIds.map(x=>fruitMap(x)).collect
2)使⽤Boradcast⼴播变量的情况
val kvFruit=sc.parallelize(List((1, “apple”),
(2,”orange”),(3, “banana”),(4, “grape”)))
val fruitMap = kvFruit.collectAsMap()
val bcFruitMap=sc.broadcast(fruitMap)
val fruitIds=sc.parallelize(List(2,4,1,3))
val fruitNames =
fruitIds.map(x=>bcFruitMap.value(x)).collect
16 accumulator累加器
val intRDD = sc.parallelize(List(3,1,2,5,5))
val total = sc.accumulator(0.0)
val num = sc.accumulator(0)
intRDD.foreach(i=>{
total += i
num += 1})
println(“total=”+total.value+“, num=”+num.value)
val avg=total.value / num.value
17 RDD Persistence持久化
(1)建⽴RDD范例
val intRddMemory = sc.parallelize(List(3,1,2,5,5))
intRddMemory.persist()
intRddMemory.unpersist()
(2) 设定存储等级
import org.apache.spark.storage.StorageLevel
val intRddMemoryAndDisk = sc.parallelize(List(3,1,2,5,5))
intRddMemoryAndDisk.persist(StorageLevel.MEMORY_AND_DISK)

intRddMemoryAndDisk.unpersist()


18 使⽤Spark建⽴WordCount(这个运算电脑要有之前的WordCount )
:quit

Ls

cd workspace/

 Ls

 rm -R WordCount/

 mkdir -p~/workspace/WordCount/data

cd ~/workspace/WordCount/data

gedit test.txt

 

Apple Apple Orange

Banana Grape Grape

 

 

valtextFile=sc.textFile("file:/home/zwxq/workspace/WordCount/data/test.txt")

val stringRDD=textFile.flatMap(line=>line.split(" "))

stringRDD.collect

 

 

val countsRDD=stringRDD.map(word=>(word,1)).reduceByKey(_+_)

 countsRDD.saveAsTextFile("file:/home/zwxq/workspace/WordCount/data/output")



阅读更多

没有更多推荐了,返回首页