1.pyspark使用方式
1.1. jupyter 、Python shell
from pyspark import SparkConf, SparkContext
from pyspark.mllib.recommendation import ALS
conf = SparkConf().setMaster('local[2]').set("spark.executor.memory", "3g")
sc = SparkContext.getOrCreate(conf)
lines = sc.textFile("D:/ML/python-design/ml-10M100K/ratings.dat")
# .set('spark.driver.host','txy').set('spark.local.ip','txy')
1.2. pyspark shell
PS D:\ML\python-design\ALS-spark-NCG> pyspark
>>> lines = spark.read.text("ratings.dat").rdd
1.3. jupyter with ipython * 【未验证】
%pyspark
2.Python pyspark和 Spark python区别
可以在idea里面探索它们的区别,因为可以关联源码
2.1.读取文件
Python
lines = sc.textFile("D:/ML/python-design/ml-10M100K/ratings.dat")
Spark
lines = spark.read.text("ratings.dat").rdd
2.2. 切割字符串
Python
parts = lines.map(lambda row: row.split("::"))
Spark
parts = lines.map(lambda row: row.value.split("::"))
2.3.创建dataframe / 切割数据集
前提:
ratingsRDD = parts.map(lambda p: Row(userId=int(p[0]), movieId=int(p[1]),
rating=float(p[2]), timestamp=long(p[3])))
Python
(training, test) = ratingsRDD.randomSplit([0.8, 0.2])
Spark
ratings = spark.createDataFrame(ratingsRDD)
(training, test) = ratings.randomSplit([0.8, 0.2])
2.4. 模型对象初始化和训练
Python
model = ALS.train(training, rank=50, iterations=10, lambda_=0.01)
Spark
als = ALS(maxIter=5, regParam=0.01, userCol="userId", itemCol="movieId", ratingCol="rating",
coldStartStrategy="drop")
model = als.fit(training)
转载至链接:https://my.oschina.net/datadev/blog/1926736