python之pyspark环境的引入(Mac OS)
1 前提条件
一台Mac OS,安装Pycharm开发软件
2 安装本地python环境
安装本地python环境可以通过2种方式进行安装。
- python包进行安装
- anaconda环境进行安装
- https://www.anaconda.com/products/individual
下载之后进行安装,勾选添加环境变量
vim ~/.bash_profile # >>> conda initialize >>> # !! Contents within this block are managed by 'conda init' !! __conda_setup="$('/Users/shufang/opt/anaconda3/bin/conda' 'shell.bash' 'hook' 2> /dev/null)" if [ $? -eq 0 ]; then eval "$__conda_setup" else if [ -f "/Users/shufang/opt/anaconda3/etc/profile.d/conda.sh" ]; then . "/Users/shufang/opt/anaconda3/etc/profile.d/conda.sh" else export PATH="/Users/shufang/opt/anaconda3/bin:$PATH" fi fi unset __conda_setup # <<< conda initialize <<<
2.1 如何查看python是否安装成功
(base) shufangdeMacBook-Pro:~ shufang$ python
Python 3.7.6 (default, Jan 8 2020, 13:42:34)
[Clang 4.0.1 (tags/RELEASE_401/final)] :: Anaconda, Inc. on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>>
2.2 查看python的启动路径
(base) shufangdeMacBook-Pro:~ shufang$ which python
/Users/shufang/opt/anaconda3/bin/python
2.3 python环境安装完毕
3 通过PyCharm来配置本地python环境
配置python环境的图如下:
4 引入pyspark环境
引入pyspark环境有2种方法
1、通过pip安装
pip install pyspark[==2.4.5] # 指定版本进行安装
2、 离线进行安装
4.1 安装方式一之(pip安装)
pip install pyspark==2.4.5
## 所有pip安装的包都会默认放在 /User/shufang/opt/anaconda3/lib/python3.7/site-packages下面
pip install py4j # 用来打印spark的日志
4.2 安装方式二之(离线安装)
首先下载相应的Spark编译过的安装包文件
解压到指定的目录
# 解压到指定目录 tar -zxvf spark-2.4.5-bin-hadoop2.7.tgz -C /User/shufang/program_files/ # 修改目录名 mv spark-2.4.5-bin-hadoop2.7 spark
将
pyspark
目录拷贝到/User/shufang/opt/anaconda3/lib/python3.7/site-packages
目录下
cd /User/shufang/program_files/spark/python/ ls -alh >> pyspark # 移到python环境的指定目录 mv pyspark /User/shufang/opt/anaconda3/lib/python3.7/site-packages
4.3 添加环境变量(SPARK_HOME、PYTHON_HOME)
在PyCharm中添加环境变量。
# 点击菜单栏 Run => Edit Configuration
PYTHONUNBUFFERED=1;
SPARK_HOME=/Users/shufang/program_files/spark;
PYTHON_HOME=/Users/shufang/program_files/spark/python
4.4 此时pyspark的环境配置成功
此时可以在本地编写一个简单的spark程序并运行了。
from pyspark import SparkConf, SparkContext
# 创建SparkConf和SparkContext
conf = SparkConf().setMaster("local").setAppName("shufang-wordcount")
sc = SparkContext(conf=conf)
# 输入的数据
data = ["hello", "world", "hello", "word", "count", "count", "hello"]
# 将Collection的data转化为spark中的rdd并进行操作
rdd = sc.parallelize(data)
resultRdd = rdd.map(lambda word: (word, 1)).reduceByKey(lambda a, b: a + b)
# rdd转为collecton并打印
resultColl = resultRdd.collect()
for line in resultColl:
print(line)
# 结束
sc.stop()