一、一行代码实现WordCount并保存结果
hello.txt内容:
sc.textFile("/opt/bigdatas/hello.txt").flatMap(lambda line: line.split("\t")).map(lambda word: (word, 1)).reduceByKey(lambda x, y: x + y).saveAsTextFile("/opt/bigdatas/result/wc/001")
统计结果:
二、python代码实现WordCount
开发步骤:
将文本内容的每一行转成一个个的单词,flatMap()
将单个单词转换成(单词,1) ,map()
把所有相同单词的计数相加得到最终的结果,reduceByKey()
import sys
from pyspark import SparkContext
from pyspark import SparkConf
"""
词频统计
"""
if __name__ == '__main__':
if len(sys.argv) !=2 :
print("统计wordcount <input>",file=sys.stderr)
sys.exit(-1)
conf = SparkConf().setMaster("local[2]").setAppName("spark03")
sc = SparkContext(conf=conf)
def printResult():
counts = sc.textFile(sys.argv[1]).flatMap(lambda line:line.split("\t")).map(lambda x:(x,1)).reduceByKey(lambda a,b:a+b)
output = counts.collect()
for(word , count) in output:
print("%s : %i" % (word,count))
printResult()
sc.stop()
测试:
1.一个文件:
./spark-submit --master local[2] --name wordcount01 /opt/script/wordcount01.py file:///opt/bigdatas/hello.txt
2.测试文件夹下的多个文件
复制多个hello.txt文件到 wc文件夹下
./spark-submit --master local[2] --name wordcount01 /opt/script/wordcount01.py file:///opt/bigdatas/wc
3.模糊匹配计算结果
模糊匹配wc文件夹下的 txt文件
./spark-submit --master local[2] --name wordcount01 /opt/script/wordcount01.py file:///opt/bigdatas/wc/*.txt
三.python代码实现WordCount 并保存
import sys
from pyspark import SparkContext
from pyspark import SparkConf
"""
词频统计
"""
if __name__ == '__main__':
if len(sys.argv) !=3 :
print("统计wordcount <input> <output>",file=sys.stderr)
sys.exit(-1)
conf = SparkConf().setMaster("local[2]").setAppName("spark03")
sc = SparkContext(conf=conf)
def saveFile():
sc.textFile(sys.argv[1]).flatMap(lambda line:line.split("\t")).map(lambda x:(x,1)).reduceByKey(lambda a,b:a+b).saveAsTextFile(sys.argv[2])
saveFile()
sc.stop()
分别测试一个文件和多个文件: