PySpark

一、Spark

Spark是用于大规模数据处理的同意分析引擎,及一个分布式计算框架

pySpark就是spark对python语言的支持版,是由spark官方开发的python语言第三方库

1、安装pyspark

方式1:pip install pyspark
方式2:pip install -i https://pypi.tuna.tsinghua.edu.cn/simple pyspark

2、 使用pySpark

想要使用pySpark库完成数据处理,首先需要构建一个执行环境入口对象

from pyspark import SparkConf, SparkContext

# 创建sparkConf类对象
conf = SparkConf().setMaster("local[*]").setAppName("test_spark_app")
# 基于SparkConf类对象出啊关键sparkContext对象
sc = SparkContext(conf=conf)
# 输出spark版本
print(sc.version)
# 停止spark
sc.stop()

2.1 数据输入

2.2.1 python对象

RDD对象

PySpark支持多种数据的输入,在输入完成后,都会得到一个RDD类的对象(Resilient Distributed Dadasets),RDD数据计算完毕后依然返回RDD对象

from pyspark import *

# 创建sparkConf类对象
conf = SparkConf().setMaster("local[*]").setAppName("test_spark_app")
# 基于SparkConf类对象出啊关键sparkContext对象
sc = SparkContext(conf=conf)

# 将数据容器对象转变为RDD文件
rdd1 = sc.parallelize([1, 2, 3, 4, 5, 6])
rdd2 = sc.parallelize((1, 2, 3, 4, 5, 6))
rdd3 = sc.parallelize("abcdefg")
rdd4 = sc.parallelize({1, 2, 3, 4, 5, 6})
rdd5 = sc.parallelize({"key1": "value1", "key2": "value2"})

# 如果要查询RDD中的内容,需要调用其collect()方法
print(rdd1.collect())
print(rdd2.collect())
print(rdd3.collect())
print(rdd4.collect())
print(rdd5.collect())

sc.stop()
#-----------结果输出---------------
[1, 2, 3, 4, 5, 6]
[1, 2, 3, 4, 5, 6]
['a', 'b', 'c', 'd', 'e', 'f', 'g']
[1, 2, 3, 4, 5, 6]
['key1', 'key2']
2.1.2 文件
from pyspark import *

# 创建sparkConf类对象
conf = SparkConf().setMaster("local[*]").setAppName("test_spark_app")
# 基于SparkConf类对象出啊关键sparkContext对象
sc = SparkContext(conf=conf)

rdd = sc.textFile()

2.3 数据计算

2.3.1 Map方法

将RDD的数据一条条处理(处理的逻辑基于map算子中接受的处理函数),返回新的RDD

from pyspark import SparkConf, SparkContext
# 为pyspark指定python的位置
import os
os.environ["PYSPARK_PYTHON"] = "D:/python3.10.11/python.exe"

# 创建sparkConf类对象
conf = SparkConf().setMaster("local[*]").setAppName("test_spark_app")
# 基于SparkConf类对象出啊关键sparkContext对象
sc = SparkContext(conf=conf)

# 准备一个RDD
rdd = sc.parallelize([1, 2, 3, 4, 5, 6, 7])


# 通过Map方法将每个元素都乘10
def func(data):
    return data * 10

rdd2 = rdd.map(func)
print(rdd2.collect())
#-----------结果输出---------------
[10, 20, 30, 40, 50, 60, 70]
2.3.2 flasMap

功能:对RDD进行map操作,然后进行接触嵌套操作

from pyspark import SparkConf, SparkContext
import os
os.environ["PYSPARK_PYTHON"] = "D:/python3.10.11/python.exe"
conf = SparkConf().setMaster("local[*]").setAppName("test_spark_app")
sc = SparkContext(conf=conf)

# 准备一个RDD
rdd = sc.parallelize(["itheima itcast 666","itheima itheima itcast","python itheima"])

# 将RDD数据里面的每个单词提取出来
rdd2 = rdd.flatMap(lambda x: x.split(" "))
print(rdd2.collect())
#-----------结果输出---------------
['itheima', 'itcast', '666', 'itheima', 'itheima', 'itcast', 'python', 'itheima']
2.3.3 reduceByKey

针对K-V型的RDD,自动按照key分组,然后根据你同的聚合函数逻辑,完成组内数据(value)的聚合操作

from pyspark import SparkConf, SparkContext
import os

os.environ["PYSPARK_PYTHON"] = "D:/python3.10.11/python.exe"
conf = SparkConf().setMaster("local[*]").setAppName("test_spark_app")
sc = SparkContext(conf=conf)

# 准备一个RDD
rdd = sc.parallelize([('男', 99), ('男', 88), ('女', 99), ('女', 66)])
# 分组求和
rdd2 = rdd.reduceByKey(lambda a, b: a + b)
print(rdd2.collect())
#-----------结果输出---------------
[('男', 187), ('女', 165)]
2.3.4 案例

使用pySpark进行单词计数

hello.txt

itheima itheima itcast itheima
spark python spark python itheima
itheima itcast itcast itheima python
python python spark pyspark pyspark
itheima python pyspark itcast spark

example.py

# 1、构建pyspark对象
from pyspark import SparkConf, SparkContext
import os

os.environ["PYSPARK_PYTHON"] = "D:/python3.10.11/python.exe"
conf = SparkConf().setMaster("local[*]").setAppName("test_spark_app")
sc = SparkContext(conf=conf)

# 2、读取数据文件
rdd = sc.textFile("F:/study/code/资料/hello.txt")

# 3、取出全部单词
word_rdd = rdd.flatMap(lambda x: x.split(" "))

# 4、将所有单词转换成二元元组,单词为key,value为1
word_with_one_rdd = word_rdd.map(lambda word: (word, 1))

# 5、分组求和
result_rdd = word_with_one_rdd.reduceByKey(lambda a, b: a + b)
print(result_rdd.collect())
#-----------结果输出---------------
[('itcast', 4), ('python', 6), ('itheima', 7), ('spark', 4), ('pyspark', 3)]
2.3.5 Filter

功能:过滤想要的数据

rdd.filter(func)
# func:(T) -> bool 传入一个参数,返回值必须为Boolean

需求:准备一个RDD,剔除奇数

from pyspark import SparkConf, SparkContext
import os

os.environ["PYSPARK_PYTHON"] = "D:/python3.10.11/python.exe"
conf = SparkConf().setMaster("local[*]").setAppName("test_spark_app")
sc = SparkContext(conf=conf)

# 准备一个RDD
rdd = sc.parallelize([1, 2, 3, 4, 5])
rdd2 = rdd.filter(lambda num: num % 2 == 0)
print(rdd2.collect())
#-----------结果输出---------------
[2, 4]
2.3.6 Distinct

功能:对RDD进行去重,返回新的RDD

from pyspark import SparkConf, SparkContext
import os

os.environ["PYSPARK_PYTHON"] = "D:/python3.10.11/python.exe"
conf = SparkConf().setMaster("local[*]").setAppName("test_spark_app")
sc = SparkContext(conf=conf)

# 准备一个RDD
rdd = sc.parallelize([1, 2, 5, 6, 9, 8, 5, 6, 3, 2, 1, 4, 7, 5, 6, 8])

# 对RDD去重
rdd2 = rdd.distinct()
print(rdd2.collect())
#-----------结果输出---------------
[8, 1, 9, 2, 3, 4, 5, 6, 7]
2.3.7 sortBy

功能:对RDD数据进行排序,可指定排序依据

rdd.sortBy(func, ascending = False, numPartitions = 1)
# func:(T) -> U :告知按照rdd中的那个数据进行排序,比如lambda x:x[1] 表示按照rdd中的第二列元素进行排序
# ascending True升序,False降序
# numPartitions:用多少分区排序

analys.sortBy.py

from pyspark import SparkConf, SparkContext
import os

os.environ["PYSPARK_PYTHON"] = "D:/python3.10.11/python.exe"
conf = SparkConf().setMaster("local[*]").setAppName("test_spark_app")
sc = SparkContext(conf=conf)

# 2、读取数据文件
rdd = sc.textFile("F:/study/code/资料/hello.txt")

# 3、取出全部单词
word_rdd = rdd.flatMap(lambda x: x.split(" "))

# 4、将所有单词转换成二元元组,单词为key,value为1
word_with_one_rdd = word_rdd.map(lambda word: (word, 1))

# 5、分组求和
result_rdd = word_with_one_rdd.reduceByKey(lambda a, b: a + b)

# 6、对结果排序
final_RDD = result_rdd.sortBy(lambda x: x[1], ascending=False, numPartitions=1)
print(final_RDD.collect())
#-----------结果输出---------------
[('itheima', 7), ('python', 6), ('itcast', 4), ('spark', 4), ('pyspark', 3)]
2.3.8 案例

text.txt

{"id":1,"timestamp":"2019-05-08T01:03.00Z","category":"平板电脑","areaName":"北京","money":"1450"}|{"id":2,"timestamp":"2019-05-08T01:01.00Z","category":"手机","areaName":"北京","money":"1450"}|{"id":3,"timestamp":"2019-05-08T01:03.00Z","category":"手机","areaName":"北京","money":"8412"}
{"id":4,"timestamp":"2019-05-08T05:01.00Z","category":"电脑","areaName":"上海","money":"1513"}|{"id":5,"timestamp":"2019-05-08T01:03.00Z","category":"家电","areaName":"北京","money":"1550"}|{"id":6,"timestamp":"2019-05-08T01:01.00Z","category":"电脑","areaName":"杭州","money":"1550"}
{"id":7,"timestamp":"2019-05-08T01:03.00Z","category":"电脑","areaName":"北京","money":"5611"}|{"id":8,"timestamp":"2019-05-08T03:01.00Z","category":"家电","areaName":"北京","money":"4410"}|{"id":9,"timestamp":"2019-05-08T01:03.00Z","category":"家具","areaName":"郑州","money":"1120"}
{"id":10,"timestamp":"2019-05-08T01:01.00Z","category":"家具","areaName":"北京","money":"6661"}|{"id":11,"timestamp":"2019-05-08T05:03.00Z","category":"家具","areaName":"杭州","money":"1230"}|{"id":12,"timestamp":"2019-05-08T01:01.00Z","category":"书籍","areaName":"北京","money":"5550"}
{"id":13,"timestamp":"2019-05-08T01:03.00Z","category":"书籍","areaName":"北京","money":"5550"}|{"id":14,"timestamp":"2019-05-08T01:01.00Z","category":"电脑","areaName":"北京","money":"1261"}|{"id":15,"timestamp":"2019-05-08T03:03.00Z","category":"电脑","areaName":"杭州","money":"6660"}
{"id":16,"timestamp":"2019-05-08T01:01.00Z","category":"电脑","areaName":"天津","money":"6660"}|{"id":17,"timestamp":"2019-05-08T01:03.00Z","category":"书籍","areaName":"北京","money":"9000"}|{"id":18,"timestamp":"2019-05-08T05:01.00Z","category":"书籍","areaName":"北京","money":"1230"}
{"id":19,"timestamp":"2019-05-08T01:03.00Z","category":"电脑","areaName":"杭州","money":"5551"}|{"id":20,"timestamp":"2019-05-08T01:01.00Z","category":"电脑","areaName":"北京","money":"2450"}
{"id":21,"timestamp":"2019-05-08T01:03.00Z","category":"食品","areaName":"北京","money":"5520"}|{"id":22,"timestamp":"2019-05-08T01:01.00Z","category":"食品","areaName":"北京","money":"6650"}
{"id":23,"timestamp":"2019-05-08T01:03.00Z","category":"服饰","areaName":"杭州","money":"1240"}|{"id":24,"timestamp":"2019-05-08T01:01.00Z","category":"食品","areaName":"天津","money":"5600"}
{"id":25,"timestamp":"2019-05-08T01:03.00Z","category":"食品","areaName":"北京","money":"7801"}|{"id":26,"timestamp":"2019-05-08T01:01.00Z","category":"服饰","areaName":"北京","money":"9000"}
{"id":27,"timestamp":"2019-05-08T01:03.00Z","category":"服饰","areaName":"杭州","money":"5600"}|{"id":28,"timestamp":"2019-05-08T01:01.00Z","category":"食品","areaName":"北京","money":"8000"}|{"id":29,"timestamp":"2019-05-08T02:03.00Z","category":"服饰","areaName":"杭州","money":"7000"}
from pyspark import SparkConf, SparkContext
import os
import json

os.environ["PYSPARK_PYTHON"] = "D:/python3.10.11/python.exe"
conf = SparkConf().setMaster("local[*]").setAppName("test_spark_app")
sc = SparkContext(conf=conf)

# todo 需求1:城市销售额排名
# 1.1 读取文件得到RDD
rdd = sc.textFile("F:/study/code/资料/test.txt")
# 1.2 取出JSON字符串
json_str_rdd = rdd.flatMap(lambda x: x.split("|"))
# 1.3 将JSON字符串转换为字典
dict_rdd = json_str_rdd.map(lambda x: json.loads(x))
# 1.4 取出城市和销售额
city_with_money = dict_rdd.map(lambda x: (x['areaName'], int(x['money'])))
# 1.5 按城市分组,按销售额聚合
city_result_rdd = city_with_money.reduceByKey(lambda a, b: a + b)
# 1.6 按销售额进行排序
resut1_rdd = city_result_rdd.sortBy(lambda x: x[1], ascending=False, numPartitions=1)
print(f"需求1的结果是", resut1_rdd.collect())

# todo 需求2:全部城市有那些商品在售卖
# 2.1 取出全部商品类别
category_rdd = dict_rdd.map(lambda x: x['category']).distinct()
print(category_rdd.collect())
# todo 需求3:北京是有哪些商品在售卖
# 3.1 过滤北京的数据
beijing_data_rdd = dict_rdd.filter(lambda x: x['areaName'] == '北京')
# 3.2取出全部商品类别
result3_rdd = beijing_data_rdd.map(lambda x: x['category']).distinct()
print(f'需求3的结果', result3_rdd.collect())
#-----------结果输出---------------
需求1的结果是 [('北京', 91556), ('杭州', 28831), ('天津', 12260), ('上海', 1513), ('郑州', 1120)]
['平板电脑', '家电', '书籍', '手机', '电脑', '家具', '食品', '服饰']
需求3的结果 ['平板电脑', '家电', '书籍', '手机', '电脑', '家具', '食品', '服饰']

2.4 数据输出

2.4.1 输出为python

1、collect算子,输出RDD为list对象

from pyspark import SparkConf, SparkContext
import os

os.environ["PYSPARK_PYTHON"] = "D:/python3.10.11/python.exe"
conf = SparkConf().setMaster("local[*]").setAppName("test_spark_app")
sc = SparkContext(conf=conf)

rdd = sc.parallelize([1, 2, 3, 4, 5])

# 1、collect算子,输出RDD为list对象
rdd_list: list = rdd.collect()
print(rdd_list)
print(type(rdd_list))
#-----------结果输出---------------
[1, 2, 3, 4, 5]
<class 'list'>

2、reduce算子:对RDD数据集按照传入的逻辑进行聚合

from pyspark import SparkConf, SparkContext
import os

os.environ["PYSPARK_PYTHON"] = "D:/python3.10.11/python.exe"
conf = SparkConf().setMaster("local[*]").setAppName("test_spark_app")
sc = SparkContext(conf=conf)

rdd = sc.parallelize([1, 2, 3, 4, 5])

num = rdd.reduce(lambda a, b: a + b)
print(num)
#-----------结果输出---------------
15

3、take算子:取RDD的前N个元素,组合成list返回

from pyspark import SparkConf, SparkContext
import os

os.environ["PYSPARK_PYTHON"] = "D:/python3.10.11/python.exe"
conf = SparkConf().setMaster("local[*]").setAppName("test_spark_app")
sc = SparkContext(conf=conf)

rdd = sc.parallelize([1, 2, 3, 4, 5])

take_list = rdd.take(3)
print(take_list)
#-----------结果输出---------------
[1, 2, 3]

4、count算子:统计RDD内有多少个数据,返回值为数字

from pyspark import SparkConf, SparkContext
import os

os.environ["PYSPARK_PYTHON"] = "D:/python3.10.11/python.exe"
conf = SparkConf().setMaster("local[*]").setAppName("test_spark_app")
sc = SparkContext(conf=conf)

rdd = sc.parallelize([1, 2, 3, 4, 5])

num_count = rdd.count()
print(num_count)
#-----------结果输出---------------
5
2.4.2 输出到文件

1、saveAsTextFile算子

将RDD数据写入到文本文件中,需要依赖Hadoop框架

①下载Hadoop

②模块中依赖hadop

os.environ['HADOOP_HOME'] = "D:/hadoop-3.0.0"

③下载winutils.exe,放到Hadoop的bin目录下

④下载Hadoop.dll,并放入C:\Windows\System32目录下

from pyspark import SparkConf, SparkContext
import os

os.environ["PYSPARK_PYTHON"] = "D:/python3.10.11/python.exe"
os.environ['HADOOP_HOME'] = "D:/hadoop-3.0.0"
conf = SparkConf().setMaster("local[*]").setAppName("test_spark_app")
sc = SparkContext(conf=conf)

# 准备RDD
rdd1 = sc.parallelize([1, 2, 3, 4, 5])
rdd2 = sc.parallelize([("Hello", 3), ("Spark", 5), ("Hi", 7)])
rdd3 = sc.parallelize([[1, 2, 3], [4, 5, 6], [7, 8, 9]])

# 输出到文件
rdd1.saveAsTextFile("F:/out1")
rdd2.saveAsTextFile("F:/out2")
rdd3.saveAsTextFile("F:/out3")

使用saveAsTextFile输出文件后,文件夹路径下是包含多个文件的,因为这个算子是默认对生成的文件进行分区,分区的多少取决于CPU核心数

  • 修改RDD的分区为1

    • 创建conf时设置
    conf.set("spark.default.parallelism", "1")
    
    • 创建RDD时设置
    rdd1 = sc.parallelize([1, 2, 3, 4, 5],numSlices=1)
    
    2.4.3 综合案例

1、测试数据:search_log.txt

2、代码:

from pyspark import SparkConf, SparkContext
import os

os.environ["PYSPARK_PYTHON"] = "D:/python3.10.11/python.exe"
conf = SparkConf().setMaster("local[*]").setAppName("test_spark_app")
conf.set("spark.default.parallelism", "1")
sc = SparkContext(conf=conf)

file_rdd = sc.textFile("F:/study/code/资料/search_log.txt")
# todo 需求1:热门搜索时段Top3(小时精度)
# 1.1 取出全部时间并转换为小时
# 1.2 转换为(小时,1)的二元元组
# 1.3 key分组聚合calue
# 1.4 排序(降序)
# 1.5 取前三
result1 = file_rdd.map(lambda x: x.split("\t")) \
    .map(lambda x: x[0][:2]) \
    .map(lambda x: (x, 1)) \
    .reduceByKey(lambda a, b: a + b) \
    .sortBy(lambda x: x[1], ascending=False, numPartitions=1) \
    .take(3)
print("需求1的结果:", result1)

# todo 需求2:统计热门搜索词Top3
# 2.1 取出全部的搜索词
# 2.2 (词,1)二元元组
# 2.3 分组聚合
# 2.4 排序
# 2.5 Top3
result2 = file_rdd.map(lambda x: (x.split("\t")[2], 1)) \
    .reduceByKey(lambda a, b: a + b) \
    .sortBy(lambda x: x[1], ascending=False, numPartitions=1) \
    .take(3)
print("需求2的结果:", result2)

# todo 需求3:统计黑马程序员关键字在什么时候搜索的最多
# 3.1 过滤内容,值保留程序员关键字
# 3.2 转换为(小时,1)的二元元组
# 3.3 Key分组聚合Value
# 3.4 排序(降序)
# 3.5 取前1
result3 = file_rdd.map(lambda x: x.split("\t")) \
    .filter(lambda x: x[2] == "黑马程序员") \
    .map(lambda x: (x[0][:2], 1)) \
    .reduceByKey(lambda a, b: a + b) \
    .sortBy(lambda x: x[1], ascending=False, numPartitions=1) \
    .take(1)
print("需求3的结果:", result3)

# todo 需求4:将数据转换为JSON格式,输出到文件
# 4.1 转换为JSON格式的RDD
# 4.2 输出到文件
file_rdd.map(lambda x: x.split("\t")).map(
    lambda x: {"time": x[0],
               "userId": x[1],
               "keyWord": x[2],
               "rank1": x[3],
               "rank2": x[4],
               "url": x[5]}, ).saveAsTextFile("D:/output_json")
#-----------结果输出---------------
需求1的结果: [('20', 3479), ('23', 3087), ('21', 2989)]
需求2的结果: [('scala', 2310), ('hadoop', 2268), ('博学谷', 2002)]
需求3的结果: [('22', 245)]
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值