编写python代码(wordcount.py)
import sys
from operator import add
from pyspark.sql import SparkSession
if __name__ == "__main__":
spark = SparkSession.builder.appName("PythonWordCount").getOrCreate()
lines = spark.read.text('hdfs:///user/asmp/flume_test/word.txt').rdd.map(lambda r: r[0])
counts = lines.flatMap(lambda x: x.split(' ')).map(lambda x: (x, 1)).reduceByKey(add)
output = counts.collect()
for (word, count) in output:
print("%s: %i" % (word, count))
spark.stop()
数据集word.txt
hello word1 word2 word2
hello2 word1 word2 word2
集群运行:shell>spark-submit ./wordcount.py
运行结果: