数据预览
//有4个内容一模一样的文件
[hadoop@hadoop000 wordcount]$ ls
hello.txt hello - 副本 (2).txt hello - 副本 (3).txt hello - 副本.txt
//文件内容
[hadoop@hadoop000 wordcount]$ cat hello.txt
hello spark
hello flink
hello hadoop
python代码
import sys
from pyspark import SparkConf, SparkContext
if __name__ == '__main__':
if len(sys.argv) != 2:
print("usage:wordcount <input> ", file=sys.stderr)
sys.exit(-1)
conf = SparkConf()
sc = SparkContext(conf=conf)
counts = sc.textFile(sys.argv[1])\
.flatMap(lambda line: line.split("\t"))\
.map(lambda x: (x, 1))\
.reduceByKey(lambda a, b: a+b)\
.map(lambda x: (x[1], x[0]))\
.sortByKey(False)\
.map(lambda x: (x[1], x[0]))
output = counts.collect()
for (word, count) in output:
print("%s:%i" % (word, count))
sc.stop()
提交运行
./spark-submit --master local[2] \
--name wordcount \
/home/hadoop/script/wordcount.py \
file:/home/hadoop/data/pyspark_data/wordcount/*
注意事项:
1.读取多文本,文件夹后加 /*
2.读取本地文件,路径开头需要加file:,否则会默认读取hdfs上的文件
运行结果
//重要的结果信息
//spark版本信息
20/08/19 17:10:20 INFO SparkContext: Running Spark version 2.4.3
//提交的任务的name
20/08/19 17:10:20 INFO SparkContext: Submitted application: wordcount
//sparkUI端口
20/08/19 17:10:20 INFO Utils: Successfully started service 'SparkUI' on port 4040.
//总共输入4个文件
20/08/19 17:10:21 INFO FileInputFormat: Total input paths to process : 4
//总共4个task
20/08/19 17:40:30 INFO TaskSchedulerImpl: Adding task set 1.0 with 4 tasks
//输出结果
hello:12
hadoop:4
spark:4
flink:4