一、统计单词
hdfs的文件 /user/root/mapreduce/wordcount/input/wc.input 有以下内容:
hadoop hive
hive hadoop
hbase sqoop
hbase sqoop
hadoop hive
启动spark-shell
bin/spark-shell
读取wc.input 做为rdd
val rdd = sc.textFile("hdfs://hadoop-senior.ibeifeng.com:8020/user/root/mapreduce/wordcount/input/wc.input")
统计每个单词出现的次数
val wordCount = rdd.flatMap(line => line.split(" ")).map(x => (x, 1)).reduceByKey((x, y) => (x + y))
输出结果
wordCount.collect().foreach(println)
(hive,3)
(sqoop,2)
(hadoop,3)
(hbase,2)
也可以把结果保存在hdfs
wordCount.saveAsTextFile("hdfs://hadoop-senior.ibeifeng.com:8020/user/root/mapreduce/wordcount/sparkOutput")
读取保存在hdfs的文件内容
bin/hdfs dfs -text /user/root/mapreduce/wordcount/sparkOutput/part-00000
(hive,3)
(sqoop,2)
bin/hdfs dfs -text /user/root/mapreduce/wordcount/sparkOutput/part-00001
(hadoop,3)
(hbase,2)
把hdfs文件取回到本地; 读取part-00000和part-00001两个分区文件,在本地当前目录生成新文件 wc-sparkOutput
bin/hdfs dfs -getmerge /user/root/mapreduce/wordcount/sparkOutput/part-00000 /user/root/mapreduce/wordcount/sparkOutput/part-00001 wc-sparkOutput
读取wc-sparkOutput文件内容,wc-sparkOutput的内容由part-00000和part-00001两个分区文件合并
(hive,3)
(sqoop,2)
(hadoop,3)
(hbase,2)
再创建rdd时,可指定生成的分区数,textFile()第二个参数设置分区数
val rdd = sc.textFile("hdfs://hadoop-senior.ibeifeng.com:8020/user/root/mapreduce/wordcount/input/wc.input", 1)
二、统计单词,按键排序
sortByKey([ascending], [numTasks])
sortByKey方法有2个可选参数,ascending为true为按键升序,false为按键降序;默认为升序;
默认升序
wordCount.sortByKey().collect().foreach(println)
(hadoop,3)
(hbase,2)
(hive,3)
(sqoop,2)
true为升序
wordCount.sortByKey(true).collect().foreach(println)
(hadoop,3)
(hbase,2)
(hive,3)
(sqoop,2)
false为降序
wordCount.sortByKey(false).collect().foreach(println)
(sqoop,2)
(hive,3)
(hbase,2)
(hadoop,3)
二、统计单词,按值排序
按value进行降序排序
先对key、value进行调换,再按照key进行降序排序,再对key、value进行调换
wordCount.map(x => (x._2, x._1)).sortByKey(false).map(x => (x._2, x._1)).collect().foreach(println)
(hive,3)
(hadoop,3)
(sqoop,2)
(hbase,2)
三、统计单词,取TOP个单词
用top或take
val topWord = wordCount.map(x => (x._2, x._1)).sortByKey(false).map(x => (x._2, x._1)).top(3)
topWord: Array[(String, Int)] = Array((sqoop,2), (hive,3), (hbase,2))
四、统计单词,Group Top Key
要求:按键分组,每组取前N个
按键分组,每组取前N个,类似于 mapreduce 的二次排序;步骤分析:
1)按照第一个字段进行分组;键 "aa" 的值如下:
(aa, list(78, 80, 69, 97))
2)对第一个字段分组中的第二个字段进行排序;如果是降序,有
(aa, list(97,80, 78, 69))
3)获取每个分组的前N个值;如果N取3,有
(aa, list(97,80, 78))
在hdfs有文件 /user/root/spark/grouptop/input/score.input,每行都为键值对
bin/hdfs dfs -text spark/grouptop/input/score.input
aa 78
bb 98
aa 80
cc 98
aa 69
cc 87
bb 97
cc 86
aa 97
bb 78
bb 34
cc 85
bb 92
cc 72
bb 32
bb 23
生成rdd
val rdd = sc.textFile("hdfs://hadoop-senior.ibeifeng.com:8020/user/root/spark/grouptop/input/score.input")
步骤1: 每行按空格分割,分割后生成 Array
rdd.map(line => line.split(" ")).collect()
Array(aa, 78)
步骤2: 分割后的 Array,下标0的值做为key,下标1的值做为value,组成pair RDD
rdd.map(line => line.split(" ")).map(x => (x(0), x(1))).collect()
(aa, 78)
步骤3: 对所有的pair RDD 进行 groupByKey
rdd.map(line => line.split(" ")).map(x => (x(0), x(1))).groupByKey().collect()
(aa,CompactBuffer(78, 80, 69, 97)
步骤4: 对 (aa,CompactBuffer(78, 80, 69, 97) 进行转化为list
CompactBuffer是迭代器Iterable,Iterable可转化为 list,list有方法 sorted 进行排序
rdd.map(line => line.split(" ")).map(x => (x(0), x(1))).groupByKey().map(
x => {
val xx = x._1
val yy = x._2
yy.toList
}
).collect()
List(78, 80, 69, 97)
步骤5: 对list的元素进行排序,调用sorted 方法。sorted方法返回的是升序;
rdd.map(line => line.split(" ")).map(x => (x(0), x(1))).groupByKey().map(
x => {
val xx = x._1
val yy = x._2
yy.toList.sorted
}
).collect()
List(69, 78, 80, 97)
步骤6: 用reverse,返回相反序列的列表;原本的list是升序,用reverse方法,把list变为降序
rdd.map(line => line.split(" ")).map(x => (x(0), x(1))).groupByKey().map(
x => {
val xx = x._1
val yy = x._2
yy.toList.sorted.reverse
}
).collect()
List(97, 80, 78, 69)
步骤7:获取每个列表的前3个值;list列表有方法take(n),返回前n个元素
rdd.map(line => line.split(" ")).map(x => (x(0), x(1))).groupByKey().map(
x => {
val xx = x._1
val yy = x._2
yy.toList.sorted.reverse.take(3)
}
).collect()
List(97, 80, 78)
步骤8: 要求返回的是元组对
rdd.map(line => line.split(" ")).map(x => (x(0), x(1))).groupByKey().map(
x => {
val xx = x._1
val yy = x._2
(xx, yy.toList.sorted.reverse.take(3))
}
).collect()
(aa,List(97, 80, 78))
以上汇总成 groupTopKeyRdd,保存结果到hdfs
val groupTopKeyRdd = rdd.map(line => line.split(" ")).map(x => (x(0), x(1))).groupByKey().map(
x => {
val xx = x._1
val yy = x._2
(xx, yy.toList.sorted.reverse.take(3))
}
)
groupTopKeyRdd.saveAsTextFile("hdfs://hadoop-senior.ibeifeng.com:8020/user/root/spark/grouptop/output")
读取hdfs文件:
bin/hdfs dfs -text /user/root/spark/grouptop/output/part-00000
(aa,List(97, 80, 78))
(bb,List(98, 97, 92))
(cc,List(98, 87, 86))