首先从socket中读取数据,然后通过sparkstreaming统计输入的单词个数
1.通过下面命令开启端口(报错则需安装 nc)
nc -lk 9999
2.编写sparkstreaming.py代码
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
# Create a local StreamingContext with two working thread and batch interval of 1 second
#至少需要2个核,因为需要有一个核用于读取数据
sc = SparkContext("local[2]", "NetworkWordCount")
#间隔一秒读取一次数据流
ssc = StreamingContext(sc, 1)
# Create a DStream that will connect to hostname:port, like localhost:9999
lines = ssc.socketTextStream("localhost", 9999)
# Split each line into words
words = lines.flatMap(lambda line: line.split(" "))
# Count each word in each batch
pairs = words.map(lambda word: (word, 1))
wordCounts = pairs.reduceByKey(lambda x, y: x + y)
# Print the first ten elements of each RDD generated in this DStream to the console
wordCounts.pprint()
ssc.start() # Start the computation
ssc.awaitTermination() # Wait for the computation to terminate
该段代码的作用是,每隔1s时间,从9999端口读取该时间段内输入的数据,并统计读取到的数据的word count。
3.spark-submit --master local sparkstreaming.py运行上述代码。
当在步骤1的窗口中输入数据,则在运行spark的窗口可以看到统计结果。