pyspark单词计数

最新推荐文章于 2024-04-23 10:35:45 发布

醉糊涂仙

最新推荐文章于 2024-04-23 10:35:45 发布

阅读量1.1k

点赞数

分类专栏： pyspark

本文链接：https://blog.csdn.net/u010916338/article/details/106132757

版权

pyspark 专栏收录该内容

1 篇文章 0 订阅

订阅专栏

文章目录

一、shell模式
- 1.1 shell本地模式
- 1.2 shell集群模式
二、集群模式

一、shell模式

1.1 shell本地模式

pyspark #进入shell本地模式

# 输入数据
data = ["hello", "world", "hello", "world"]

# 将collection的data转为spark中的rdd并进行操作
rdd = sc.parallelize(data)
res_rdd = rdd.map(lambda word: (word, 1)).reduceByKey(lambda a, b: a + b)

# 将rdd转为collection并打印
res_rdd_coll = res_rdd.collect()
for line in res_rdd_coll:
    print(line)    #此处加table键

在这里插入图片描述
注：
shell本地模式无法通过masterip：8080监控到
只能通过shell所在机器ip：4040监控到

1.2 shell集群模式

pyspark --master spark://big07:7077  #进入shell集群模式

在这里插入图片描述

二、集群模式

spark-submit --master spark://big07:7077 test1.py

from pyspark import SparkContext,SparkConf


conf=SparkConf()
conf.setMaster("spark://big07:7077")
conf.setAppName("test application")


sc=SparkContext(conf=conf)


# 输入数据
data = ["hello", "world", "hello", "world"]

# 将collection的data转为spark中的rdd并进行操作
rdd = sc.parallelize(data)
res_rdd = rdd.map(lambda word: (word, 1)).reduceByKey(lambda a, b: a + b)

# 将rdd转为collection并打印
res_rdd_coll = res_rdd.collect()
for line in res_rdd_coll:
    print(line)    #此处加table键


#sc.close()
sc.stop()

在这里插入图片描述

醉糊涂仙

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
pyspark单词计数

一、shell模式# 输入数据data = ["hello", "world", "hello", "world"]# 将collection的data转为spark中的rdd并进行操作rdd = sc.parallelize(data)res_rdd = rdd.map(lambda word: (word, 1)).reduceByKey(lambda a, b: a + b)# 将rdd转为collection并打印res_rdd_coll = res_rdd.collect()f
复制链接

扫一扫