pyspark单词统计(python编写)文件在hdfs上

最新推荐文章于 2024-07-29 17:54:27 发布

小懒胖熊

最新推荐文章于 2024-07-29 17:54:27 发布

阅读量829

点赞数 3

本文链接：https://blog.csdn.net/weixin_41895381/article/details/89485675

版权

在写代码之前先了解一下代码中一些方法的作用：
**sparkContext：**在Spark应用程序的执行过程中起着主导作用，它负责与程序和spark集群进行交互，包括申请集群资源、创建RDD、accumulators及广播变量等。
**sc.textFiles(path)：**能将path 里的所有文件内容读出，以文件中的每一行作为一条记录的方式，文件的每一行相当于 List中以 “,”号隔开的一个元素，因此可以在每个partition中用for i in data的形式遍历处理Array里的数据。
**map(func)：**将每个元素传递到函数func中，并将结果返回为一个新的数据集
**flatMap(func)：**与map()相似，但每个输入元素都可以映射到0或多个输出结果
**reduceByKey(func)：**应用于(K,V)键值对的数据集时，返回一个新的(K, V)形式的数据集，其中的每个值是将每个key传递到函数func中进行聚合
**saveAsTextFile：**会按照执行task的多少生成多少个文件

代码如下：

import os
import shutil
from pyspark import SparkConf, SparkContext
input = 'hdfs://master:9000/hello.txt'
output = 'hdfs://master:9000/out1'
sc = SparkContext('local', 'WordCount')
# 读取文件
test_file = sc.textFile(input)
# 切分单词
word = test_file.flatMap(lambda line: line.split(' '))
# 转换成键值对并计数
count = word.map(lambda word: (word, 1)).reduceByKey(lambda x, y: x + y)
# 输出结果
count.foreach(print)
# 删除输出目录
if os.path.exists(outputpath):  //判断目录是否存在
    shutil.rmtree(outputpath, True)  //目录存在删掉
# 将统计结果写入结果文件
counts.saveAsTextFile(output)

或者可以放在一起

import os
import shutil
from pyspark import SparkConf, SparkContext
input = 'hdfs://master:9000/hello.txt'
output = 'hdfs://master:9000/out1'
sc = SparkContext('local', 'WordCount')
test_file = sc.textFile(input).test_file.flatMap(lambda line: line.split(' ')).map(lambda word: (word, 1)).reduceByKey(lambda x, y: x + y)
test_file.foreach(print)
# 删除输出目录
if os.path.exists(outputpath):  //判断目录是否存在
    shutil.rmtree(outputpath, True)  //目录存在删掉
# 将统计结果写入结果文件
counts.saveAsTextFile(output)