spark 的安装配置:https://blog.csdn.net/seTaire/article/details/90263765
试跑一个简单的word count例子检验一下安装配置是否成功。
1. 首先在主机 sh start.sh 启动三个容器,start.sh 在 https://blog.csdn.net/seTaire/article/details/90263765
2. 然后进入 master 容器执行下面的命令,完成初始化和启动
hdfs namenode -format
yes | start-dfs.sh
start-yarn.sh
start-all.sh
3. hdfs创建一个home目录,然后上传 1.txt
hdfs dfs -mkdir /home
hdfs dfs -put 1.txt /home
1.txt内容
tu
tu wei
tu wei feng
4. 编写word count脚本 test.py
test.py
from pyspark import SparkConf, SparkContext
from operator import add
conf = SparkConf().setMaster("local").setAppName("My App")
sc = SparkContext(conf=conf)
lines = sc.textFile("/home/1.txt")
print("#"*20)
words = lines.flatMap(lambda line: line.split(' '))
print(words.take(1))
print(words.take(2))
print(words.take(3))
wc = words.map(lambda x:(x,1))
counts = wc.reduceByKey( lambda x,y:x+y )
print(counts.take(1))
print(counts.take(2))
print(counts.take(3))
counts.saveAsTextFile("wcres")
5. 提交 test.py 到 spark 运行
spark-submit test.py
6. 查看结果
hdfs dfs -cat /user/root/wcres/part-00000
part-00000
(u'wei', 2)
(u'tu', 3)
(u'feng', 1)
备注:如果是使用docker容器运行中报错,90%的可能性是配置文件 hdfs-site.xml 配置错了,本人的两个namenode都配置在了master,如果是虚拟机或真实主机可能是防火墙未关闭,具体还是看报错原因