Spark学习笔记(1)--------基本函数
前言:
spark之精髓远未领略,基本的函数和指令只能说是初体验。希望日后可以将这个工具熟练掌握。。。
作者:Leige_Smart
语言:scala
#基本函数命令
#内容scala> rdd.foreach(println)
#字符串内容(自己瞎输的几个字符串)
leige;ddf;dfe;efefe;sdcd;dfe;eff;
fsdfe;fe;frgr;dcdc;
eff;leige;dfe;
efefe;dcdc;
@命令和执行结果
Input: val rddlength=rdd.map(s=>s.length).collect
Output: rddlength: Array[Int] = Array(25, 19, 8, 14, 11, 0)
功能: 将rdd中字符统计出来,并放在Array向量中
_.reduce()--
Input: val rddtotallength=rddlength.reduce((a,b)=>a+b)
Output: rddtotallength: Int = 77
功能: 将Array中的数字垒加起来
_.map(_XX)
Input: sc.parallelize(List(1,2,3,4,5,6))
Output: mapRdd = rdd.map(_*2) //这是典型的函数式编程
Input: mapRdd.collect()
Output:Array(2,4,6,8,10,12)
Input: 将元组前后两个元素调换
Output: val l1=sc.parallelize(List(('a',1),('a',2),('b',3)))
Input: val l2=l1.map(x=>(x._2,x._1))
Input: l2.collect
Output: Array[(Int, Char)] = Array((1,a), (2,a), (3,b))
_.filter(_XX)
Input: val filterRdd = mapRdd.filter(_ > 5)
Input: filterRdd.collect()
Output: Array(6,8,10,12)
_.count
功能: 计算行数
_.cache
功能: 把内容保存到内存中(如果在保存到内存后操作会快很多)
_.flatMap(_.split(";"))
功能: 去掉 ';'
Input: val rdd2=rdd.flatMap(_.split(";"))
Input: rdd2.collect
Output: Array[String] = Array(leige, ddf, dfe, efefe, sdcd, fsdfe, fe, frgr,dc,dfe, eff, eff, leige, dfe, efefe, dcdc, "")
_.flatMap(_.split(";").map((_,1)))
功能: 将每个元素变成一个元组
Input: val rdd3=rdd.flatMap(_.split(";").map((_,1)))
Output : Array[(String, Int)] = Array((leige,1), (ddf,1), (dfe,1), (efefe,1), (sdcd,1), (fsdfe,1), (fe,1), (frgr,1), (dcdc,1), (dfe,1), (eff,1), (eff @@,1), (leige,1), (dfe,1), (efefe,1), (dcdc,1), ("",1))
_.flatMap(_.split(";").map((_,1))).reduceByKey(_+_)
_.flatMap(_.split(";").map((_,1))).reduceByKey((a,b)=>a+b)
功能: 将元组统计求和
Input: Array[(String, Int)] = Array((sdcd,1), (dcdc,2), (fsdfe,1), ("",1), (ddf,1), (leige,2), (efefe,2), (frgr,1), (fe,1), (eff,2), (dfe,3))
rdd1 join rdd2
功能: 把两个list做笛卡尔积
Input: val rdd1 = sc.parallelize(List(('a',1),(‘a’, 2), ('b', 3)))
Input: val rdd2 = sc.parallelize(List(('a',4),(‘b’, 5)))
Output: val result_union = rdd1 join rdd2 //结果是把两个list做笛卡尔积,Array(('a', (1,4), ('a', (2,4), ('b', (3, 5)))
rdd1 union rdd2
功能: 把两个list合并
Input: val rdd1 = sc.parallelize(List(('a',1),(‘a’, 2)))
Input: val rdd2 = sc.parallelize(List(('b',1),(‘b’, 2)))
Output: val result_union = rdd1 union rdd2 //结果是把两个list合并成一个,List(('a',1),(‘a’, 2),('b',1),(‘b’, 2))
_.lookup('x')
功能: 把x对应value提出来组成一个seq
Input: val rdd=sc.parallelize(List(('a',1),('a',2),('b',1),('b',2)))
Input: rdd.lookup('a')
Input: rdd2.foreach(println)
Output: 1 2
_.sortByKey()/_.sortByKey(false)
功能: 按照键值排序/降序
Input: val l=sc.parallelize(List(('a',2),('b',4),('a',3),('b',1)))
Input: val l2=l.sortByKey()
Output: Array[(Char, Int)] = Array((a,2), (a,3), (b,4), (b,1)