一、pycharm开发spark程序配置方法:
在C:\Anaconda\Lib\site-packages目录下新建pyspark.pth,内容是D:\hadoop_spark\spark-2.0.2-bin-hadoop2.7\python
即spark目录下的的python目录,也就是spark的python API.
其实就是将pyspark当做一个普通的python包对待而已,没有做其他的配置。
另外,对python安装了py4j包,不知道是否是必须的。
以下是验证是否配置成功的脚本,在pycharm中顺利运行即可。
from pyspark import SparkContext
import numpy
sc = SparkContext("local","Simple App")
doc = sc.parallelize([['a','b','c'],['b','d','d']])
words = doc.flatMap(lambda d:d).distinct().collect()
word_dict = {w:i for w,i in zip(words,range(len(words)))}
word_dict_b = sc.broadcast(word_dict)
def wordCountPerDoc(d):
dict={}
wd = word_dict_b.value
for w in d:
if dict.get(wd[w],0):
dict[wd[w]] +=1
else:
dict[wd[w]] = 1
return dict
print(doc.map(wordCountPerDoc).collect())
print("successful!")
二、运行mllib时,No module named numpy报错
报错原因是本项目使用anaconda环境python,但spark ml会默认寻找本地安装的python,本地python没有安装numpy,安装之后此报错无。
三、第一个ml建模脚本
from pyspark.ml.clustering import KMeans
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.master("local") \
.appName("Word Count") \
.config("spark.some.config.option", "some-value") \
.getOrCreate()
dataset = spark.read.format("libsvm").load("D:\spark\spark-2.0.2-bin-hadoop2.6\data\mllib\sample_kmeans_data.txt")
kmeans = KMeans().setK(2).setSeed(1)
model = kmeans.fit(dataset)
predictions = model.transform(dataset)
print predictions.collect()