统计单词数量
# 启动spark-shell
cd /opt/apache_hadoop/spark-2.2.1
bin/spark-shell
读取数据
val path = "/word/word.txt"
形成rdd
val rdd = sc.textFile(path)
读取每一行字符串数据并分隔成数组
val rdd1 = rdd.flatMap(line => line.split("\t"))
查看
rdd1.collect()
输出:
res1: Array[String] = Array(java, python, hadoop, scala, mysql, hdfs, hdfs, mapreduce,
yarn, hadoop, hadoop, scala, hive, hive, sqoop, hbase, kafka, hadoop, hbase, hadoop,
hive, flume, redis, redis, java, python, scala, sqoop, spark, spark, scala, zookeeper,
flume, hadoop, hdfs, hive)
遍历字符串数组,记录每个单词数
val rdd2 = rdd1.map(word => (word,1))
查看
rdd2.collect()
输出:
res2: Array[(String, Int)] = Array((java,1), (python,1), (hadoop,1), (scala,1),
(mysql,1), (hdfs,1), (hdfs,1), (mapreduce,1), (yarn,1), (hadoop,1), (hadoop,1),
(scala,1), (hive,1), (hive,1), (sqoop,1), (hbase,1), (kafka,1), (hadoop,1), (hbase,1),
(hadoop,1), (hive,1), (flume,1), (redis,1), (redis,1), (ja