一、工具下载
1、下载Pycharm和python,安装Pycharm和python。【注意:python的安装版本请参照根据spark官方提示,并不建议 python版本太高,可能会导致一些模块不支持】
2、下载spark和hadoop。【注意spark和hadoop版本的匹配】
博主下载的版本分别为:
二、配置环境(python、hadoop和spark)
2.1、配置python环境,具体见python开发环境配置和简单的使用
2.2、配置spark环境
- 并将压缩包解压,解压至相应路径,如D:\spark-1.6.0-bin-hadoop2.6。
- 添加 SPARK_HOME = D:\spark-1.6.0-bin-hadoop2.6。
- 并将 %SPARK_HOME%/bin 添加至环境变量PATH。
2.3、配置hadoop环境(同spark配置)
注若未配置hadoop环境变量,在使用pyspark命令,用python启动spark时,会出现如下错误:
java.io.IOException: Could not locate executable null\bin\winutils.exe in the Ha
doop binaries.
at org.apache.hadoop.util.Shell.getQualifiedBinPath(Shell.java:355)
at org.apache.hadoop.util.Shell.getWinUtilsPath(Shell.java:370)
at org.apache.hadoop.util.Shell.<clinit>(Shell.java:363)
at org.apache.hadoop.util.StringUtils.<clinit>(StringUtils.java:79)
at org.apache.hadoop.security.Groups.parseStaticMapping(Groups.java:104)
at org.apache.hadoop.security.Groups.<init>(Groups.java:86)
at org.apache.hadoop.security.Groups.<init>(Groups.java:66)
at org.apache.hadoop.security.Groups.getUserToGroupsMappingService(Group
s.java:280)
at org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupI
nformation.java:271)
at org.apache.hadoop.security.UserGroupInformation.ensureInitialized(Use
rGroupInformation.java:248)
at org.apache.hadoop.security.UserGroupInformation.loginUserFromSubject(
UserGroupInformation.java:763)
at org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGrou
pInformation.java:748)
at org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGr
oupInformation.java:621)
at org.apache.spark.util.Utils$$anonfun$getCurrentUserName$1.apply(Utils
.scala:2136)
at org.apache.spark.util.Utils$$anonfun$getCurrentUserName$1.apply(Utils
.scala:2136)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.util.Utils$.getCurrentUserName(Utils.scala:2136)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:322)
at org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.sc
ala:59)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstruct
orAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingC
onstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:234)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381)
at py4j.Gateway.invoke(Gateway.java:214)
at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand
.java:79)
at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:68)
at py4j.GatewayConnection.run(GatewayConnection.java:209)
at java.lang.Thread.run(Thread.java:748)
2.4、
进入命令行,输入pyspark命令。若成功执行。则成功设置环境变量
有些小伙伴在自己的系统中同时安装了python2和python3,为做语法区别,更改了python的启动命令,如博主就是
如python3才是博主的启动python3的正确命令
则在使用pyspark启动spark时,会出现
此时,需要修改spark/bin目录下的pyspark2配置文件,将python改为python3,即可
此时就可以正常启动spark。有小伙伴肯定想知道怎么在同一个系统下兼容python2和python3,点我一下
三、配置pycharm
1、打开pycharm,导入已有的或者新建工程。
2、打开 run->Edit Configurations添加环境变量
设置环境,创建PYTHONPATH和SPARK_HOME
配置路径如图所示,都可以在Spark安装路径下找到:
PYTHONPATH:spark安装路径下的python。..\spark-1.6.0-bin-hadoop2.6\python
SPARK_HOME:..\spark-1.6.0-bin-hadoop2.6\
四、添加模块
打开File->Settings->Project->Project Structire,点击右上角Add Content Root
找到spark安目录->python->lib添加py4j.zip和pyspark.zip
接下来就是我在网上下载的一份简单的spark代码
from pyspark import SparkContext
sc = SparkContext('local')
doc = sc.parallelize([['a','b','c'],['b','d','d']])
words = doc.flatMap(lambda d:d).distinct().collect()
word_dict = {w:i for w,i in zip(words,range(len(words)))}
word_dict_b = sc.broadcast(word_dict)
def wordCountPerDoc(d):
dict={}
wd = word_dict_b.value
for w in d:
if dict.has_key(wd[w]):
dict[wd[w]] +=1
else:
dict[wd[w]] = 1
return dict
print doc.map(wordCountPerDoc).collect()
print "successful!"