1.准备工作
java 1.8
spark-2.4.5-bin-hadoop2.7
下载页面:
https://archive.apache.org/dist/spark/spark-2.4.5/
下载地址:
https://archive.apache.org/dist/spark/spark-2.4.5/spark-2.4.5-bin-hadoop2.7.tgz
hadoop-2.7.1
下载页面:
https://archive.apache.org/dist/hadoop/core/hadoop-2.7.1/
下载地址:
https://archive.apache.org/dist/hadoop/core/hadoop-2.7.1/hadoop-2.7.1.tar.gz
winutils
下载地址:
https://github.com/duanjz/winutils
从github上下载项目 从hadoop-2.7.3/bin下将winutils.exe和winutils.pdb两个文件放入hadoop-2.7.1的bin文件夹下:
2.环境变量配置
Java环境变量配置
spark环境变量配置
SPARK_HOME D:\spark-2.4.5-bin-hadoop2.7
hadoop环境变量配置
HADOOP_HOME D:\hadoop-2.7.1
path配置
%HADOOP_HOME%\bin
%SPARK_HOME%\sbin
%SPARK_HOME%\bin
3.Pycharm下载pyspark
File->Settings->Project:你的项目名->Python Interpreter
通过勾选Specify version选择与Spark相同的版本
下载成功后如图:
4.Python版本WordCount
spark.py
# -- coding: GBK --
from pyspark import SparkContext
sc = SparkContext( 'local', 'test')
textFile = sc.textFile("./word.txt")
wordCount = textFile.flatMap(lambda line: line.split(" "))
wordCount = wordCount.map(lambda word: (word,1)).reduceByKey(lambda a, b : a + b)
wordCount.foreach(print)
word.txt为执行python代码的相同目录下。如图所示: