Spark RDD Python学习简单实例
参考:
Spark Downloads
Spark Quick Start
RDD Programming Guide
Spark Python API安装
从http://spark.apache.org/downloads.html下载 spark-2.3.0-bin-hadoop2.7.tgz包。
传送到linux服务器上,我这里是放在root下。
解压:
tar -zxvf spark-2.3.0-bin-hadoop2.7.tgz
mv spark-2.3.0-bin-hadoop2.7 spark
- 配置环境变量:
vim /etc/profile
export SPARK=/root/spark
PATH=$PATH:$JAVA_HOME/bin:$GOROOT/bin:$SPARK/bin
//然后
source /etc/profile
- 数据
data.txt文件内容:
# more data/data.txt
1234 5678
90 123
123 hao qwe
123 973
123 akjf
456 kjalfkdf
dksfjlk
456 898-0
- python文件
mywordcount.py:
"""
mywordcount.py
"""
from operator import add
from pyspark.sql import SparkSession
logFile = "./data/data.txt"
spark = SparkSession\
.builder\
.appName("PythonWordCount")\
.getOrCreate()
lines = spark.read.text(logFile).rdd.map(lambda r: r[0])
counts = lines.flatMap(lambda x: x.split(' ')) \
.map(lambda x: (x, 1)) \
.reduceByKey(add)
output = counts.collect()
for (word, count) in output:
print("%s: %i" % (word, count))
spark.stop()
- 执行命令
spark-submit --master local[4] mywordcount.py
- 执行结果:
......
2018-06-12 11:25:53 INFO DAGScheduler:54 - Job 0 finished: collect at /root/hao/spark/mywordcount.py:18, took 1.032415 s
1234: 1
: 1
973: 1
dksfjlk: 1
hao: 1
qwe: 1
kjalfkdf: 1
456: 2
898-0: 1
123: 4
akjf: 1
5678: 1
90: 1
2018-06-12 11:25:53 INFO AbstractConnector:318 - Stopped Spark@bb90840{HTTP/1.1,[http/1.1]}{0.0.0.0:4040}
......