目录
1.Rdd创建:
1.加载文件(textFile(url)):
> val lines = sc.textFile("file:///usr/local/spark/mycode/rdd/word.txt")
2.自己数据(parallelize):
scala>val array = Array(1,2,3,4,5)
scala>val rdd = sc.parallelize(array)
2.Rdd操作:
1.转换(map(func)):
将每一行取出来 空格拆分后 传到新的 RDD中。
scala> data=Array(1,2,3,4,5)
scala> val rdd1= sc.parallelize(data)
scala> val rdd2=rdd1.map(x=>x+10)
2.转换(filter(func))
筛选出满足函数func的元素
scala> val lines =sc.textFile(file:///usr/local/spark/mycode/rdd/word.txt)
scala> val linesWithSpark=lines.filter(line => line.contains("Spark"))
3.转换(flatMap(func) :
1:先把每行 用空格分开(拆成一个数组)
2.再把每个数组细分拆成 一个单词是一个单位的。
scala> val lines = sc.textFile("file:///usr/local/spark/mycode/rdd/word.txt")
scala> val words=lines.flatMap(line => line.split(" "))
4.转换(groupByKey()):
按Key分组,将K相同的分到一起。(上图中 “is”有三个结果,且都是”1”。)
5.转换(reduceByKey(func)):
相当于聚合之后将V的值变成个数
(ReduceByKey将V的个数汇总而groupByKey只是简单的将结果汇总。)
6.创建RDD时手动指定分区个数(sc.textFile(path, partitionNum))
scala> val array = Array(1,2,3,4,5)
scala> val rdd = sc.parallelize(array,2) //设置两个分区
7.实例:
scala> val lines = sc. //代码一行放不下,可以在圆点后回车,在下行继续输入 textFile("file:///usr/local/spark/mycode/wordcount/word.txt")
scala> val wordCount = lines.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey((a, b) => a + b)
//命令背下来,老师说必考,手写代码。下页有解释。 必须弄懂((a,b)=a+b)
scala> wordCount.collect()
scala> wordCount.foreach(println)
3.键值对RDD创建:
1.文件加载(map):
scala> val lines = sc.textFile("file:///usr/local/spark/mycode/pairrdd/word.txt")
scala> val pairRDD = lines.flatMap(line => line.split(“ ”)).map(word => (word,1)) //这条命令背下来,老师说必考 弄懂(word=>word)怎么用
scala> pairRDD.foreach(println)
2.reduceByKey(func):
scala> pairRDD.reduceByKey((a,b)=>a+b).foreach(println)
3.values:
scala> pairRDD.values
4.sortByKey();
scala> pairRDD.sortByKey()
scala> val d1 = sc.parallelize(Array((“c",8),(“b“,25),(“c“,17),(“a“,42),(“b“,4),(“d“,9),(“e“,17),(“c“,2),(“f“,29),(“g“,21),(“b“,9)))
scala> d1.reduceByKey(_+_).sortByKey(false).collect
res2: Array[(String, Int)] = Array((g,21),(f,29),(e,17),(d,9),(c,27),(b,38),(a,42))
scala> val d2 = sc.parallelize(Array((“c",8),(“b“,25),(“c“,17),(“a“,42),(“b“,4),(“d“,9),(“e“,17),(“c“,2),(“f“,29),(“g“,21),(“b“,9)))
scala> d2.reduceByKey(_+_).sortBy(_._2,false).collect
res4: Array[(String, Int)] = Array((a,42),(b,38),(f,29),(c,27),(g,21),(e,17),(d,9))
5.mapValues(func)
scala> pairRDD.mapValues(x => x+1)
res2: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[4] at mapValues at <console>:34
scala> pairRDD.mapValues(x => x+1).foreach(println)
6.join:
join就表示内连接。对于内连接,对于给定的两个输入数据集(K,V1)和(K,V2),只有在两个数据集中都存在的key才会被输出,最终得到一个(K,(V1,V2))类型的数据集。
scala> val pairRDD1 = sc.parallelize(Array(("spark",1),("spark",2),("hadoop",3),("hadoop",5)))
pairRDD1: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[24] at parallelize at <console>:27
scala> val pairRDD2 = sc.parallelize(Array(("spark","fast")))
pairRDD2: org.apache.spark.rdd.RDD[(String, String)] = ParallelCollectionRDD[25] at parallelize at <console>:27
scala> pairRDD1.join(pairRDD2)
res9: org.apache.spark.rdd.RDD[(String, (Int, String))] = MapPartitionsRDD[28] at join at <console>:32
scala> pairRDD1.join(pairRDD2).foreach(println)
(spark,(1,fast))
(spark,(2,fast))
7.综合实例:
scala> val rdd = sc.parallelize(Array(("spark",2),("hadoop",6),("hadoop",4),("spark",6)))
scala> rdd.mapValues(x => (x,1)).reduceByKey((x,y) => (x._1+y._1,x._2 + y._2)).mapValues(x => (x._1 / x._2)).collect()
老师说 这个 占位符 “_” 的必考。
res22: Array[(String, Int)] = Array((spark,4), (hadoop,5))
4.英文拼写释义
DAG: Direct Acyclic Graph有向无环图
REPL: Read-Eval-Print-Loop 交互式解释器
RDD: Resillient Distributed Dataset弹性分布式数据集
Lineage(血缘关系)
Fork拆分
Join重新组合
HDFS: Hadoop Distributed File System分布式文件系统
Spark Context:上下文
AST: Abstract Syntax Code抽象语法树
Catalyst: 函数式关系查询优化框架
JDBC: Java Database Connectivity Java数据库连接
数据仓库(Data Warehouse)
OLAP: On-Line Analytical Processing 联机分析处理
Yahoo! S4: Simple Scalable Streaming System 开源流计算平台
TCP: Transmission Control Protocol 传输控制协议
Dstream: Discretized Stream 离散化数据流
TF-IDF: term frequency–inverse document frequency 是一种用于信息检索与数据挖掘的常用加权技术
TF是词频(Term Frequency)
IDF是逆文本频率指数(Inverse Document Frequency)
独热编码: One-Hot Encoding
交叉验证(CrossValidator)
训练验证分割(TrainValidationSplit)