PySpark

一、Spark

Spark是用于大规模数据处理的同意分析引擎,及一个分布式计算框架

pySpark就是spark对python语言的支持版,是由spark官方开发的python语言第三方库

1、安装pyspark

方式1:pip install pyspark
方式2:pip install -i https://pypi.tuna.tsinghua.edu.cn/simple pyspark

2、 使用pySpark

想要使用pySpark库完成数据处理,首先需要构建一个执行环境入口对象

from pyspark import SparkConf, SparkContext

# 创建sparkConf类对象
conf = SparkConf().setMaster("local[*]").setAppName("test_spark_app")
# 基于SparkConf类对象出啊关键sparkContext对象
sc = SparkContext(conf=conf)
# 输出spark版本
print(sc.version)
# 停止spark
sc.stop()

2.1 数据输入

2.2.1 python对象

RDD对象

PySpark支持多种数据的输入,在输入完成后,都会得到一个RDD类的对象(Resilient Distributed Dadasets),RDD数据计算完毕后依然返回RDD对象

from pyspark import *

# 创建sparkConf类对象
conf = SparkConf().setMaster("local[*]").setAppName("test_spark_app")
# 基于SparkConf类对象出啊关键sparkContext对象
sc = SparkContext(conf=conf)

# 将数据容器对象转变为RDD文件
rdd1 = sc.parallelize([1, 2, 3, 4, 5, 6])
rdd2 = sc.parallelize((1, 2, 3, 4, 5, 6))
rdd3 = sc.parallelize("abcdefg")
rdd4 = sc.parallelize({1, 2, 3, 4, 5, 6})
rdd5 = sc.parallelize({"key1": "value1", "key2": "value2"})

# 如果要查询RDD中的内容,需要调用其collect()方法
print(rdd1.collect())
print(rdd2.collect())
print(rdd3.collect())
print(rdd4.collect())
print(rdd5.collect())

sc.stop()
#-----------结果输出---------------
[1, 2, 3, 4, 5, 6]
[1, 2, 3, 4, 5, 6]
['a', 'b', 'c', 'd', 'e', 'f', 'g']
[1, 2, 3, 4, 5, 6]
['key1', 'key2']
2.1.2 文件
from pyspark import *

# 创建sparkConf类对象
conf = SparkConf().setMaster("local[*]").setAppName("test_spark_app")
# 基于SparkConf类对象出啊关键sparkContext对象
sc = SparkContext(conf=conf)

rdd = sc.textFile()

2.3 数据计算

2.3.1 Map方法

将RDD的数据一条条处理(处理的逻辑基于map算子中接受的处理函数),返回新的RDD

from pyspark import SparkConf, SparkContext
# 为pyspark指定python的位置
import os
os.environ["PYSPARK_PYTHON"] = "D:/python3.10.11/python.exe"

# 创建sparkConf类对象
conf = SparkConf().setMaster("local[*]").setAppName("test_spark_app")
# 基于SparkConf类对象出啊关键sparkContext对象
sc = SparkContext(conf=conf)

# 准备一个RDD
rdd = sc.parallelize([1, 2, 3, 4, 5, 6, 7])


# 通过Map方法将每个元素都乘10
def func(data):
    return data * 10

rdd2 = rdd.map(func)
print(rdd2.collect())
#-----------结果输出---------------
[10, 20, 30, 40, 50, 60, 70]
2.3.2 flasMap

功能:对RDD进行map操作,然后进行接触嵌套操作

from pyspark import SparkConf, SparkContext
import os
os.environ["PYSPARK_PYTHON"] = "D:/python3.10.11/python.exe"
conf = SparkConf().setMaster("local[*]").setAppName("test_spark_app")
sc = SparkContext(conf=conf)

# 准备一个RDD
rdd = sc.parallelize(["itheima itcast 666","itheima itheima itcast","python itheima"])

# 将RDD数据里面的每个单词提取出来
rdd2 = rdd.flatMap(lambda x: x.split(" "))
print(rdd2.collect())
#-----------结果输出---------------
['itheima', 'itcast', '666', 'itheima', 'itheima', 'itcast', 'python', 'itheima']
2.3.3 reduceByKey

针对K-V型的RDD,自动按照key分组,然后根据你同的聚合函数逻辑,完成组内数据(value)的聚合操作

from pyspark import SparkConf, SparkContext
import os

os.environ["PYSPARK_PYTHON"] = "D:/python3.10.11/python.exe"
conf = SparkConf().setMaster("local[*]").setAppName("test_spark_app")
sc = SparkContext(conf=conf)

# 准备一个RDD
rdd = sc.parallelize([('男', 99), ('男', 88), ('女', 99), ('女', 66)])
# 分组求和
rdd2 = rdd.reduceByKey(lambda a, b: a + b)
print(rdd2.collect())
#-----------结果输出---------------
[('男', 187), ('女', 165)]
2.3.4 案例

使用pySpark进行单词计数

hello.txt

itheima itheima itcast itheima
spark python spark python itheima
itheima itcast itcast itheima python
python python spark pyspark pyspark
itheima python pyspark itcast spark

example.py

# 1、构建pyspark对象
from pyspark import SparkConf, SparkContext
import os

os.environ["PYSPARK_PYTHON"] = "D:/python3.10.11/python.exe"
conf = SparkConf().setMaster("local[*]").setAppName("test_spark_app")
sc = SparkContext(conf=conf)

# 2、读取数据文件
rdd = sc.textFile("F:/study/code/资料/hello.txt")

# 3、取出全部单词
word_rdd = rdd.flatMap(lambda x: x.split(" "))

# 4、将所有单词转换成二元元组,单词为key,value为1
word_with_one_rdd = word_rdd.map(lambda word: (word, 1))

# 5、分组求和
result_rdd = word_with_one_rdd.reduceByKey(lambda a, b: a + b)
print(result_rdd.collect())
#-----------结果输出---------------
[('itcast', 4), ('python', 6), ('itheima', 7), ('spark', 4), ('pyspark', 3)]
2.3.5 Filter

功能:过滤想要的数据

rdd.filter(func)
# func:(T) -> bool 传入一个参数,返回值必须为Boolean

需求:准备一个RDD,剔除奇数

from pyspark import SparkConf, SparkContext
import os

os.environ["PYSPARK_PYTHON"] = "D:/python3.10.11/python.exe"
conf = SparkConf().setMaster("local[*]").setAppName("test_spark_app")
sc = SparkContext(conf=conf)

# 准备一个RDD
rdd = sc.parallelize([1, 2, 3, 4, 5])
rdd2 = rdd.filter(lambda num: num % 2 == 0)
print(rdd2.collect())
#-----------结果输出---------------
[2, 4]
2.3.6 Distinct

功能:对RDD进行去重,返回新的RDD

from pyspark import SparkConf, SparkContext
import os

os.environ["PYSPARK_PYTHON"] = "D:/python3.10.11/python.exe"
conf = SparkConf().setMaster("local[*]").setAppName("test_spark_app")
sc = SparkContext(conf=conf)

# 准备一个RDD
rdd = sc.parallelize([1, 2, 5, 6, 9, 8, 5, 6, 3, 2, 1, 4, 7, 5, 6, 8])

# 对RDD去重
rdd2 = rdd.distinct()
print(rdd2.collect())
#-----------结果输出---------------
[8, 1, 9, 2, 3, 4, 5, 6, 7]
2.3.7 sortBy

功能:对RDD数据进行排序,可指定排序依据

rdd.sortBy(func, ascending = False, numPartitions = 1)
# func:(T) -> U :告知按照rdd中的那个数据进行排序,比如lambda x:x[1] 表示按照rdd中的第二列元素进行排序
# ascending True升序,False降序
# numPartitions:用多少分区排序

analys.sortBy.py

from pyspark import SparkConf, SparkContext
import os

os.environ["PYSPARK_PYTHON"] = "D:/python3.10.11/python.exe"
conf = SparkConf().setMaster("local[*]").setAppName("test_spark_app")
sc = SparkContext(conf=conf)

# 2、读取数据文件
rdd = sc.textFile("F:/study/code/资料/hello.txt")

# 3、取出全部单词
word_rdd = rdd.flatMap(lambda x: x.split(" "))

# 4、将所有单词转换成二元元组,单词为key,value为1
word_with_one_rdd = word_rdd.map(lambda word: (word, 1))

# 5、分组求和
result_rdd = word_with_one_rdd.reduceByKey(lambda a, b: a + b)

# 6、对结果排序
final_RDD = result_rdd.sortBy(lambda x: x[1], ascending=False, numPartitions=1)
print(final_RDD.collect())
#-----------结果输出---------------
[('itheima', 7), ('python', 6), ('itcast', 4), ('spark', 4), ('pyspark', 3)]
2.3.8 案例

text.txt

{"id":1,"timestamp":"2019-05-08T01:03.00Z","category":"平板电脑","areaName":"北京","money":"1450"}|{"id":2,"timestamp":"2019-05-08T01:01.00Z","category":"手机","areaName":"北京","money":"1450"}|{"id":3,"timestamp":"2019-05-08T01:03.00Z","category":"手机","areaName":"北京","money":"8412"}
{"id":4,"timestamp":"2019-05-08T05:01.00Z","category":"电脑","areaName":"上海","money":"1513"}|{"id":5,"timestamp":"2019-05-08T01:03.00Z","category":"家电","areaName":"北京","money":"1550"}|{"id":6,"timestamp":"2019-05-08T01:01.00Z","category":"电脑","areaName":"杭州","money":"1550"}
{"id":7,"timestamp":"2019-05-08T01:03.00Z","category":"电脑","areaName":"北京","money":"5611"}|{"id":8,"timestamp":"2019-05-08T03:01.00Z","category":"家电","areaName":"北京","money":"4410"}|{"id":9,"timestamp":"2019-05-08T01:03.00Z","category":"家具","areaName":"郑州","money":"1120"}
{"id":10,"timestamp":"2019-05-08T01:01.00Z","category":"家具","areaName":"北京","money":"6661"}|{"id":11,"timestamp":"2019-05-08T05:03.00Z","category":"家具","areaName":"杭州","money":"1230"}|{"id":12,"timestamp":"2019-05-08T01:01.00Z","category":"书籍","areaName":"北京","money":"5550"}
{"id":13,"timestamp":"2019-05-08T01:03.00Z","category":"书籍","areaName":"北京","money":"5550"}|{"id":14,"timestamp":"2019-05-08T01:01.00Z","category":"电脑","areaName":"北京","money":"1261"}|{"id":15,"timestamp":"2019-05-08T03:03.00Z","category":"电脑","areaName":"杭州","money":"6660"}
{"id":16,"timestamp":"2019-05-08T01:01.00Z","category":"电脑","areaName":"天津","money":"6660"}|{"id":17,"timestamp":"2019-05-08T01:03.00Z","category":"书籍","areaName":"北京","money":"9000"}|{"id":18,"timestamp":"2019-05-08T05:01.00Z","category":"书籍","areaName":"北京","money":"1230"}
{"id":19,"timestamp":"2019-05-08T01:03.00Z","category":"电脑","areaName":"杭州","money":"5551"}|{"id":20,"timestamp":"2019-05-08T01:01.00Z","category":"电脑","areaName":"北京","money":"2450"}
{"id":21,"timestamp":"2019-05-08T01:03.00Z","category":"食品","areaName":"北京","money":"5520"}|{"id":22,"timestamp":"2019-05-08T01:01.00Z","category":"食品","areaName":"北京","money":"6650"}
{"id":23,"timestamp":"2019-05-08T01:03.00Z","category":"服饰","areaName":"杭州","money":"1240"}|{"id":24,"timestamp":"2019-05-08T01:01.00Z","category":"食品","areaName":"天津","money":"5600"}
{"id":25,"timestamp":"2019-05-08T01:03.00Z","category":"食品","areaName":"北京","money":"7801"}|{"id":26,"timestamp":"2019-05-08T01:01.00Z","category":"服饰","areaName":"北京","money":"9000"}
{"id":27,"timestamp":"2019-05-08T01:03.00Z","category":"服饰","areaName":"杭州","money":"5600"}|{"id":28,"timestamp":"2019-05-08T01:01.00Z","category":"食品","areaName":"北京","money":"8000"}|{"id":29,"timestamp":"2019-05-08T02:03.00Z","category":"服饰","areaName":"杭州","money":"7000"}
from pyspark import SparkConf, SparkContext
import os
import json

os.environ["PYSPARK_PYTHON"] = "D:/python3.10.11/python.exe"
conf = SparkConf().setMaster("local[*]").setAppName("test_spark_app")
sc = SparkContext(conf=conf)

# todo 需求1:城市销售额排名
# 1.1 读取文件得到RDD
rdd = sc.textFile("F:/study/code/资料/test.txt")
# 1.2 取出JSON字符串
json_str_rdd = rdd.flatMap(lambda x: x.split("|"))
# 1.3 将JSON字符串转换为字典
dict_rdd = json_str_rdd.map(lambda x: json.loads(x))
# 1.4 取出城市和销售额
city_with_money = dict_rdd.map(lambda x: (x['areaName'], int(x['money'])))
# 1.5 按城市分组,按销售额聚合
city_result_rdd = city_with_money.reduceByKey(lambda a, b: a + b)
# 1.6 按销售额进行排序
resut1_rdd = city_result_rdd.sortBy(lambda x: x[1], ascending=False, numPartitions=1)
print(f"需求1的结果是", resut1_rdd.collect())

# todo 需求2:全部城市有那些商品在售卖
# 2.1 取出全部商品类别
category_rdd = dict_rdd.map(lambda x: x['category']).distinct()
print(category_rdd.collect())
# todo 需求3:北京是有哪些商品在售卖
# 3.1 过滤北京的数据
beijing_data_rdd = dict_rdd.filter(lambda x: x['areaName'] == '北京')
# 3.2取出全部商品类别
result3_rdd = beijing_data_rdd.map(lambda x: x['category']).distinct()
print(f'需求3的结果', result3_rdd.collect())
#-----------结果输出---------------
需求1的结果是 [('北京', 91556), ('杭州', 28831), ('天津', 12260), ('上海', 1513), ('郑州', 1120)]
['平板电脑', '家电', '书籍', '手机', '电脑', '家具', '食品', '服饰']
需求3的结果 ['平板电脑', '家电', '书籍', '手机', '电脑', '家具', '食品', '服饰']

2.4 数据输出

2.4.1 输出为python

1、collect算子,输出RDD为list对象

from pyspark import SparkConf, SparkContext
import os

os.environ["PYSPARK_PYTHON"] = "D:/python3.10.11/python.exe"
conf = SparkConf().setMaster("local[*]").setAppName("test_spark_app")
sc = SparkContext(conf=conf)

rdd = sc.parallelize([1, 2, 3, 4, 5])

# 1、collect算子,输出RDD为list对象
rdd_list: list = rdd.collect()
print(rdd_list)
print(type(rdd_list))
#-----------结果输出---------------
[1, 2, 3, 4, 5]
<class 'list'>

2、reduce算子:对RDD数据集按照传入的逻辑进行聚合

from pyspark import SparkConf, SparkContext
import os

os.environ["PYSPARK_PYTHON"] = "D:/python3.10.11/python.exe"
conf = SparkConf().setMaster("local[*]").setAppName("test_spark_app")
sc = SparkContext(conf=conf)

rdd = sc.parallelize([1, 2, 3, 4, 5])

num = rdd.reduce(lambda a, b: a + b)
print(num)
#-----------结果输出---------------
15

3、take算子:取RDD的前N个元素,组合成list返回

from pyspark import SparkConf, SparkContext
import os

os.environ["PYSPARK_PYTHON"] = "D:/python3.10.11/python.exe"
conf = SparkConf().setMaster("local[*]").setAppName("test_spark_app")
sc = SparkContext(conf=conf)

rdd = sc.parallelize([1, 2, 3, 4, 5])

take_list = rdd.take(3)
print(take_list)
#-----------结果输出---------------
[1, 2, 3]

4、count算子:统计RDD内有多少个数据,返回值为数字

from pyspark import SparkConf, SparkContext
import os

os.environ["PYSPARK_PYTHON"] = "D:/python3.10.11/python.exe"
conf = SparkConf().setMaster("local[*]").setAppName("test_spark_app")
sc = SparkContext(conf=conf)

rdd = sc.parallelize([1, 2, 3, 4, 5])

num_count = rdd.count()
print(num_count)
#-----------结果输出---------------
5
2.4.2 输出到文件

1、saveAsTextFile算子

将RDD数据写入到文本文件中,需要依赖Hadoop框架

①下载Hadoop

②模块中依赖hadop

os.environ['HADOOP_HOME'] = "D:/hadoop-3.0.0"

③下载winutils.exe,放到Hadoop的bin目录下

④下载Hadoop.dll,并放入C:\Windows\System32目录下

from pyspark import SparkConf, SparkContext
import os

os.environ["PYSPARK_PYTHON"] = "D:/python3.10.11/python.exe"
os.environ['HADOOP_HOME'] = "D:/hadoop-3.0.0"
conf = SparkConf().setMaster("local[*]").setAppName("test_spark_app")
sc = SparkContext(conf=conf)

# 准备RDD
rdd1 = sc.parallelize([1, 2, 3, 4, 5])
rdd2 = sc.parallelize([("Hello", 3), ("Spark", 5), ("Hi", 7)])
rdd3 = sc.parallelize([[1, 2, 3], [4, 5, 6], [7, 8, 9]])

# 输出到文件
rdd1.saveAsTextFile("F:/out1")
rdd2.saveAsTextFile("F:/out2")
rdd3.saveAsTextFile("F:/out3")

使用saveAsTextFile输出文件后,文件夹路径下是包含多个文件的,因为这个算子是默认对生成的文件进行分区,分区的多少取决于CPU核心数

  • 修改RDD的分区为1

    • 创建conf时设置
    conf.set("spark.default.parallelism", "1")
    
    • 创建RDD时设置
    rdd1 = sc.parallelize([1, 2, 3, 4, 5],numSlices=1)
    
    2.4.3 综合案例

1、测试数据:search_log.txt

2、代码:

from pyspark import SparkConf, SparkContext
import os

os.environ["PYSPARK_PYTHON"] = "D:/python3.10.11/python.exe"
conf = SparkConf().setMaster("local[*]").setAppName("test_spark_app")
conf.set("spark.default.parallelism", "1")
sc = SparkContext(conf=conf)

file_rdd = sc.textFile("F:/study/code/资料/search_log.txt")
# todo 需求1:热门搜索时段Top3(小时精度)
# 1.1 取出全部时间并转换为小时
# 1.2 转换为(小时,1)的二元元组
# 1.3 key分组聚合calue
# 1.4 排序(降序)
# 1.5 取前三
result1 = file_rdd.map(lambda x: x.split("\t")) \
    .map(lambda x: x[0][:2]) \
    .map(lambda x: (x, 1)) \
    .reduceByKey(lambda a, b: a + b) \
    .sortBy(lambda x: x[1], ascending=False, numPartitions=1) \
    .take(3)
print("需求1的结果:", result1)

# todo 需求2:统计热门搜索词Top3
# 2.1 取出全部的搜索词
# 2.2 (词,1)二元元组
# 2.3 分组聚合
# 2.4 排序
# 2.5 Top3
result2 = file_rdd.map(lambda x: (x.split("\t")[2], 1)) \
    .reduceByKey(lambda a, b: a + b) \
    .sortBy(lambda x: x[1], ascending=False, numPartitions=1) \
    .take(3)
print("需求2的结果:", result2)

# todo 需求3:统计黑马程序员关键字在什么时候搜索的最多
# 3.1 过滤内容,值保留程序员关键字
# 3.2 转换为(小时,1)的二元元组
# 3.3 Key分组聚合Value
# 3.4 排序(降序)
# 3.5 取前1
result3 = file_rdd.map(lambda x: x.split("\t")) \
    .filter(lambda x: x[2] == "黑马程序员") \
    .map(lambda x: (x[0][:2], 1)) \
    .reduceByKey(lambda a, b: a + b) \
    .sortBy(lambda x: x[1], ascending=False, numPartitions=1) \
    .take(1)
print("需求3的结果:", result3)

# todo 需求4:将数据转换为JSON格式,输出到文件
# 4.1 转换为JSON格式的RDD
# 4.2 输出到文件
file_rdd.map(lambda x: x.split("\t")).map(
    lambda x: {"time": x[0],
               "userId": x[1],
               "keyWord": x[2],
               "rank1": x[3],
               "rank2": x[4],
               "url": x[5]}, ).saveAsTextFile("D:/output_json")
#-----------结果输出---------------
需求1的结果: [('20', 3479), ('23', 3087), ('21', 2989)]
需求2的结果: [('scala', 2310), ('hadoop', 2268), ('博学谷', 2002)]
需求3的结果: [('22', 245)]
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
About This Book, Learn why and how you can efficiently use Python to process data and build machine learning models in Apache Spark 2.0Develop and deploy efficient, scalable real-time Spark solutionsTake your understanding of using Spark with Python to the next level with this jump start guide, Who This Book Is For, If you are a Python developer who wants to learn about the Apache Spark 2.0 ecosystem, this book is for you. A firm understanding of Python is expected to get the best out of the book. Familiarity with Spark would be useful, but is not mandatory., What You Will Learn, Learn about Apache Spark and the Spark 2.0 architectureBuild and interact with Spark DataFrames using Spark SQLLearn how to solve graph and deep learning problems using GraphFrames and TensorFrames respectivelyRead, transform, and understand data and use it to train machine learning modelsBuild machine learning models with MLlib and MLLearn how to submit your applications programmatically using spark-submitDeploy locally built applications to a cluster, In Detail, Apache Spark is an open source framework for efficient cluster computing with a strong interface for data parallelism and fault tolerance. This book will show you how to leverage the power of Python and put it to use in the Spark ecosystem. You will start by getting a firm understanding of the Spark 2.0 architecture and how to set up a Python environment for Spark., You will get familiar with the modules available in PySpark. You will learn how to abstract data with RDDs and DataFrames and understand the streaming capabilities of PySpark. Also, you will get a thorough overview of machine learning capabilities of PySpark using ML and MLlib, graph processing using GraphFrames, and polyglot persistence using Blaze. Finally, you will learn how to deploy your applications to the cloud using the spark-submit command., By the end of this book, you will have established a firm understanding of the Spark Python API and how it can be used t
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值