Pycharm 搭建pyspark开发环境
spark安装
- spark下载
下载地址 http://spark.apache.org/downloads.html
本次是搭建环境使用的官网已编译的版本,如需自己编译可参照官网自行编译,地址为 http://spark.apache.org/docs/latest/building-spark.html
- 验证spark是否安装成功
(spark_demo) shylin ~/Desktop/work/spark_demo cd ~/Downloads/spark-2.4.0-bin-hadoop2.7/bin
(spark_demo) shylin ~/Downloads/spark-2.4.0-bin-hadoop2.7/bin ./pyspark
Python 3.6.6 (v3.6.6:4cf1f54eb7, Jun 26 2018, 19:50:54)
[GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.57)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
2019-05-24 10:55:19 WARN NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 2.4.0
/_/
Using Python version 3.6.6 (v3.6.6:4cf1f54eb7, Jun 26 2018 19:50:54)
SparkSession available as 'spark'.
>>>
搭建pyspark开发环境
-
打开pycharm,新建project,创建一个新的虚拟环境
-
配置项目环境变量,方便每次创建新py文件都要再次环境变量,操作如下图所示
SPARK_HOME /Users/shylin/Downloads/spark-2.4.0-bin-hadoop2.7(spark的下载目录)
- 添加两个包到项目目录下 包的路径如下: /Users/shylin/Downloads/spark-2.4.0-bin-hadoop2.7/python/lib
- 执行wordcount
# word_count.py
#!/usr/bin/python
# -*- coding: UTF-8 -*-
from pyspark.sql import SparkSession
if __name__ == '__main__':
spark = SparkSession.builder.appName("test").getOrCreate()
sc = spark.sparkContext
# file:// 指明是本地文件,默认是找hdfs文件
counts = sc.textFile('file:///Users/shylin/Desktop/work/spark_demo/test.txt') \
.flatMap(lambda line: line.split(" ")) \
.map(lambda x: (x, 1)) \
.reduceByKey(lambda a, b: a + b)
output = counts.collect()
for (word, count) in output:
print("%s: %i" % (word, count))
sc.stop()
- 执行结果
/Users/shylin/.virtualenvs/spark_demo/bin/python /Users/shylin/Desktop/work/spark_demo/word_count.py
2019-05-24 11:23:05 WARN NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
[Stage 0:> (0 + 2) / 2]/Users/shylin/Downloads/spark-2.4.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/shuffle.py:60: UserWarning: Please install psutil to have better support with spilling
/Users/shylin/Downloads/spark-2.4.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/shuffle.py:60: UserWarning: Please install psutil to have better support with spilling
world: 2
python: 3
hadoop: 1
hello: 4
spark: 1
Process finished with exit code 0
以上结果显示一个警告, pip install psutil
即可消除警告
- 服务器上提交spark 任务
sudo -uhive spark-submit --master local[4] word_count.py
Shylin