开始下载一个spark的集成包:基于hadoop的2.7的版本:
https://www.apache.org/dyn/closer.lua/spark/spark-2.3.3/spark-2.3.3-bin-hadoop2.7.tgz
然后其上传解压:
tar -zxvf 加 文件名
然后重命名,方便环境变量的添加:
开始 配置环境变量:
vi /etc/profile :这是我习惯在这个目录下面配置环境变量:
开始对spark进行测试:进入spark的bin目录下面执行:spark-shell --master local[2]
现在开始测试wordcount:
Spark WordCount统计:
val file = spark.sparkContext.textFile("file:///home/hadoop/data/wc.txt")
val wordCounts = file.flatMap(line => line.split(",")).map((word => (word, 1))).reduceByKey(_ + _)
wordCounts.collect
hive的wordCount的操作:
select word, count(1) from hive_wordcount lateral view explode(split(context,' ')) wc as word group by word;
lateral view explode(): 是把每行记录按照指定分隔符进行拆解