为什么使用PySpark
Spark是大数据分析引擎,擅长批处理,支持部分机器学习算法的并行化。Spark支持Java,Scala,Python,以及R语言,其中使用Python进行编程的PySpark非常适合算法工程师和数据科学家对数据进行分析与建模。Python自身拥有最为完善的算法库,编写容易,算法相关岗位普及率高。相比于Java和Scala,PySpark可以节省大量编程时间,人的时间永远比计算机的要宝贵。
安装
安装Scala
在MacOS上进行安装,直接打开命令行,输入:
brew install scala
安装Spark
完成Scala的安装后,在命令行中继续安装Spark:
brew install apache-spark
安装完成后,在~/.bash_profile的最后配置export PATH,以我安装的Spark 3.2.1版本为例:
export SPARK_HOME=/usr/local/Cellar/apache-spark/3.2.1/libexec
export PATH="$SPARK_HOME/bin/:$PATH"
配置完成后在命令行中输入spark-shell,如果一切正常则会显示Spark的shell交互界面:
安装PySpark
在命令行中输入:
pip install pyspark
完成安装后在命令行中输入PySpark,一切正常的话可以看到PySpark的交互界面:
简单文件读取
读取CSV文件
在PySpark的交互界面中,用户可以直接使用PySpark语法进行交互式编程。首先尝试打开一个MovieLens 1M数据集中的CSV文件,假设文件路径为"/Users/data/movielens_1m/movies.csv":
>>> df = spark.read.csv('/Users/data/movielens_1m/movies.csv', header=True)
>>> df.printSchema()
root
|-- movieId: string (nullable = true)
|-- title: string (nullable = true)
|-- genres: string (nullable = true)
>>> df.show(5)
+-------+--------------------+--------------------+
|movieId| title| genres|
+-------+--------------------+--------------------+
| 1| Toy Story (1995)|Animation|Childre...|
| 2| Jumanji (1995)|Adventure|Childre...|
| 3|Grumpier Old Men ...| Comedy|Romance|
| 4|Waiting to Exhale...| Comedy|Drama|
| 5|Father of the Bri...| Comedy|
+-------+--------------------+--------------------+
only showing top 5 rows
读取JSON文件
使用PySpark对JSON文件进行处理。首先读取Yelp数据集中的评论数据,文件路径为"/Users/data/archive/yelp_academic_dataset_review.json":
>>> df = spark.read.json('/Users/data/archive/yelp_academic_dataset_review.json')
>>> df.printSchema()
root
|-- business_id: string (nullable = true)
|-- cool: long (nullable = true)
|-- date: string (nullable = true)
|-- funny: long (nullable = true)
|-- review_id: string (nullable = true)
|-- stars: double (nullable = true)
|-- text: string (nullable = true)
|-- useful: long (nullable = true)
|-- user_id: string (nullable = true)
>>> df.show(5)
+--------------------+----+-------------------+-----+--------------------+-----+--------------------+------+--------------------+
| business_id|cool| date|funny| review_id|stars| text|useful| user_id|
+--------------------+----+-------------------+-----+--------------------+-----+--------------------+------+--------------------+
|buF9druCkbuXLX526...| 1|2014-10-11 03:34:02| 1|lWC-xP3rd6obsecCY...| 4.0|Apparently Prides...| 3|ak0TdVmGKo4pwqdJS...|
|RA4V8pr014UyUbDvI...| 0|2015-07-03 20:38:25| 0|8bFej1QE5LXp4O05q...| 4.0|This store is pre...| 1|YoVfDbnISlW0f7abN...|
|_sS2LBIGNT5NQb6PD...| 0|2013-05-28 20:38:06| 0|NDhkzczKjLshODbqD...| 5.0|I called WVM on t...| 0|eC5evKn1TWDyHCyQA...|
|0AzLzHfOJgL7ROwhd...| 1|2010-01-08 02:29:15| 1|T5fAqjjFooT4V0OeZ...| 2.0|I've stayed at ma...| 1|SFQ1jcnGguO0LYWnb...|
|8zehGz9jnxPqXtOc7...| 0|2011-07-28 18:05:01| 0|sjm_uUcQVxab_EeLC...| 4.0|The food is alway...| 0|0kA0PAJ8QFMeveQWH...|
+--------------------+----+-------------------+-----+--------------------+-----+--------------------+------+--------------------+
only showing top 5 rows
然后通过Spark中的filter()函数过滤出stars等于3的数据:
>>> df.filter(df.stars == 3.0).show(5)
+--------------------+----+-------------------+-----+--------------------+-----+--------------------+------+--------------------+
| business_id|cool| date|funny| review_id|stars| text|useful| user_id|
+--------------------+----+-------------------+-----+--------------------+-----+--------------------+------+--------------------+
|-_GnwXmzC3DXsHR9n...| 0|2012-11-06 07:09:57| 0|GgWFjRHhelaiUgR2-...| 3.0|3.5 stars! I got ...| 0|pRPT3vqhqpU7kHgmK...|
|IdXHHEUH4ebcxdRxC...| 1|2009-10-13 22:20:10| 4|oNNTEpc2PmB4w_vy9...| 3.0|"A Bit Embarrasse...| 7|s4NgvdIfBH3UQdccW...|
|p2BkIrOuIsxGqtV0l...| 0|2014-12-17 19:34:52| 0|ADPWjsySIpmuOSL07...| 3.0|Just had Yalla fo...| 0|_soZ9DRjCF7Op7Us8...|
|VPqWLp9kMiZEbctCe...| 0|2018-09-25 03:22:50| 0|P320Yt8vFD3yjI34h...| 3.0|Overall is good, ...| 0|IMfkbLVt_GJfD7zJ9...|
|jdAHMkNHejuvOk9vE...| 0|2017-02-09 06:23:25| 0|ljjT3RaKYLWZOwdWB...| 3.0|Bummer that the p...| 1|njEa-gaTTxMueydgu...|
+--------------------+----+-------------------+-----+--------------------+-----+--------------------+------+--------------------+
only showing top 5 rows