系统:win7 x64
Spark版本:spark-1.3.0-bin-hadoop2.4
编写了名为“SimpleApp.py”的Spark本地执行文件,内容如下:
""SimpleApp.py"""
from pyspark import SparkContext
logFile = "D:/ProgramFiles/spark-1.3.0-bin-hadoop2.4/README.md" # 这里是Spark解压目录也是spark的主目录下的README.md文件
sc = SparkContext("local", "Simple App")
logData = sc.textFile(logFile).cache()
numAs = logData.filter(lambda s: 'a' in s).count()
numBs = logData.filter(lambda s: 'b' in s).count()
print("Lines with a: %i, lines with b: %i" % (numAs, numBs))</span>
在spark目录下执行:
bin\spark-submit --master local[4] D:\Files\Python\SimpleApp.py
报错,error的最后部分显示如下:
但其实这并不是主要错误,往上翻一点会看到这个错误提示:
这才是真正出错的地方,不是代码问题,而是免安装Hadoop版本的Spark的问题,下面直接引用下JamCon在stackoverflow上关于
submit .py script on Spark without Hadoop installation
The good news is you're not doing anything wrong, and your code will run after the error is mitigated.
Despite the statement that Spark will run on Windows without Hadoop, it still looks for some Hadoop components. The bug has a JIRA ticket (SPARK-2356), and a patch is available. As of Spark 1.3.1, the patch hasn't been committed to the main branch yet.
Fortunately, there's a fairly easy work around.
-
Create a bin directory for winutils under your Spark installation directory. In my case, Spark is installed in D:\Languages\Spark, so I created the following path: D:\Languages\Spark\winutils\bin
-
Download the winutils.exe from Hortonworks and put it into the bin directory created in the first step. Download link for Win64: http://public-repo-1.hortonworks.com/hdp-win-alpha/winutils.exe
-
Create a "HADOOP_HOME" environment variable that points to the winutils directory (not the bin subdirectory). You can do this in a couple of ways:
-
a. Establish a permanent environment variable via the
Control Panel -> System -> Advanced System Settings -> Advanced Tab -> Environment variables
. You can create either a user variable or a system variable with the following parameters:Variable Name=HADOOP_HOME Variable Value=D:\Languages\Spark\winutils\
-
b. Set a temporary environment variable inside your command shell before executing your script
set HADOOP_HOME=d:\\Languages\\Spark\\winutils
-
-
Run your code. It should work without error now.