昨天花了一个下午的时间安装spark,并将pyspark的shell编辑界面配置到Jupyter Notebook,然后按照《spark快速大数据分析》这本书尝尝鲜,感受下spark的威力。
本人的系统为win7,spark 1.6,anaconda 3,python3。代码如下:
lines = sc.textFile("D://Program Files//spark//spark-1.6.0-bin-hadoop2.6//README.md")
print("文本的行数",lines.count())
from pyspark import SparkContext
logFile = "D://Program Files//spark//spark-1.6.0-bin-hadoop2.6//README.md" # Should be some file on your system
sc = SparkContext("local", "Simple App")
logData = sc.textFile(logFile).cache()
numAs = logData.filter(lambda s: 'a' in s).count()
numBs = logData.filter(lambda s: 'b' in s).count()
pythonLines = lines.filter(lambda line: "Python" in line)
print("Lines with a: %i, lines with b: %i" % (numAs, numBs))
结果如下,报错个ValueError:
文本的行数 95
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-3-70ecab39b7ea> in <module>()
5
6 logFile = "D://Program Files//spark//spark-1.6.0-bin-hadoop2.6//README.md" # Should be some file on your system
----> 7 sc = SparkContext("local", "Simple App")
8 logData = sc.textFile(logFile).cache()
9
D:\spark\spark-1.6.0-bin-hadoop2.6\python\pyspark\context.py in __init__(self, master, appName, sparkHome, pyFiles, environment, batchSize, serializer, conf, gateway, jsc, profiler_cls)
110 """
111 self._callsite = first_spark_call() or CallSite(None, None, None)
--> 112 SparkContext._ensure_initialized(self, gateway=gateway)
113 try:
114 self._do_init(master, appName, sparkHome, pyFiles, environment, batchSize, serializer,
D:\spark\spark-1.6.0-bin-hadoop2.6\python\pyspark\context.py in _ensure_initialized(cls, instance, gateway)
259 " created by %s at %s:%s "
260 % (currentAppName, currentMaster,
--> 261 callsite.function, callsite.file, callsite.linenum))
262 else:
263 SparkContext._active_spark_context = instance
ValueError: Cannot run multiple SparkContexts at once; existing SparkContext(app=PySparkShell, master=local[*]) created by <module> at D:\Program Files\Anaconda3\lib\site-packages\IPython\utils\py3compat.py:186
博主google了下这个问题,最后在stack overflow 这个网站找到了答案。原来,ValueError: Cannot run multiple SparkContexts at once; existing SparkContext(app=PySparkShell, master=local[*]) created by at D:\Program Files\Anaconda3\lib\site-packages\IPython\utils\py3compat.py:186 。这句话意思就是说,不能一次性开多个sc(SparkContext),因为之前已经存在一个Spark Contexts,所以再创建一个新的sc会报错。所以解决错误的办法就是,必须把已有的sc给关闭掉,才能创建新的sc。那怎么去关闭呢?我们可以用sc.stop()函数就可以关闭了。
我们修改下代码,然后run一下,看看结果:
lines = sc.textFile("D://Program Files//spark//spark-1.6.0-bin-hadoop2.6//README.md")
print("文本的行数",lines.count())
sc.stop() #退出已有的sc
from pyspark import SparkContext
logFile = "D://Program Files//spark//spark-1.6.0-bin-hadoop2.6//README.md" # Should be some file on your system
sc = SparkContext("local", "Simple App")
logData = sc.textFile(logFile).cache()
numAs = logData.filter(lambda s: 'a' in s).count()
numBs = logData.filter(lambda s: 'b' in s).count()
pythonLines = lines.filter(lambda line: "Python" in line)
print("Lines with a: %i, lines with b: %i" % (numAs, numBs))
结果如下:
文本的行数 95
Lines with a: 58, lines with b: 26
结果就这样轻松地搞定了,然后mark一下,谨防下一次再遇上这样的问题。