1.环境
win7、pycharm、python2.7、spark-2.3.1-bin-hadoop2.7
2.wordcount程序
#!/usr/bin/env python
# -*- coding: utf-8 -*-
# @Time : 2018/12/1 0001 上午 9:10
# @Author : JingYZ
from pyspark import SparkConf,SparkContext
import os,time
if __name__ == '__main__':
os.environ['SPARK_HOME']='F://spark-2.3.1-bin-hadoop2.7'
os.environ['HADOOP_HOME']='F://Hadoop2.7//hadoop-2.7.5//bin//winuntil'
os.environ['PYSPARK_PYTHON']='D://Python27'
#spark conf
sparkConf=SparkConf().setAppName('word_count').setMaster('local[2]')
#设置 sparkcontext
sc=SparkContext(conf=sparkConf)
#设置日志级别
sc.setLogLevel('WARN')
'''创建RDD两种方式:1.本地集合进行并行化创建
2.从外部文件系统读取(HDFS,local)
'''
datas=['hadoop spark', 'spark hive spark sql', 'spark hadoop sql sql spark']
rdd=sc.parallelize(datas)#并行化创建RDD
#rdd = sc.textFile("a.txt")
#测试,获取rdd第一条数据及数据条目
print rdd.first(), rdd.count()
time.sleep(10000)
3.报错
2019-01-14 11:47:41 ERROR TaskSetManager:70 - Task 0 in stage 0.0 failed 1 times; aborting job
Traceback (most recent call last):
File "C:/Users/Administrator/Desktop/spark_new/word_count.py", line 26, in <module>
print rdd.first(), rdd.count()
File "D:\Python27\lib\site-packages\pyspark\rdd.py", line 1393, in first
rs = self.take(1)
File "D:\Python27\lib\site-packages\pyspark\rdd.py", line 1375, in take
res = self.context.runJob(self, takeUpToNumLeft, p)
File "D:\Python27\lib\site-packages\pyspark\context.py", line 1013, in runJob
sock_info = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, partitions)
File "D:\Python27\lib\site-packages\py4j\java_gateway.py", line 1257, in __call__
answer, self.gateway_client, self.target_id, self.name)
File "D:\Python27\lib\site-packages\py4j\protocol.py", line 328, in get_return_value
format(target_id, ".", name), value)
py4j.protocol.Py4JJavaError: <exception str() failed>
4.解决办法
如果你之前由于python2和python3共存,将python文件夹下的python.exe改为python2.exe,只需要
1.将它再改回来,然后pycharm重新设置解释器。在settings——>project——>project interpreter中设置,并应用到所有项目。
2.添加
os.environ['PYSPARK_PYTHON']='D://Python27//python2.exe'
注意改为你自己的python路径