这是我找到的解决方案.不是对每组数据执行单独的回归,而是为每个组创建一个具有单独列的稀疏矩阵:
from pyspark.mllib.regression import LabeledPoint, SparseVector
# Label points for regression
def groupid_to_feature(group_id, x, num_groups):
intercept_id = num_groups + group_id-1
# Need a vector containing x and a '1' for the intercept term
return SparseVector(num_groups*2, {group_id-1: x, intercept_id: 1.0})
labelled = df.map(lambda line:LabeledPoint(line[2],
groupid_to_feature(line[0], line[1], 3)))
labelled.take(5)
# [LabeledPoint(2.0, (6,[0,3],[0.0,1.0])),
# LabeledPoint(1.0, (6,[0,3],[1.0,1.0])),
# LabeledPoint(0.0, (6,[0,3],[2.0,1.0])),
# LabeledPoint(0.0, (6,[1,4],[0.0,1.0])),
# LabeledPoint(0.5, (6,[1,4],[1.0,1.0]))]
然后使用Spark的LinearRegressionWithSGD来运行回归:
from pyspark.mllib.regression import LinearRegressionModel, LinearRegressionWithSGD
lrm = LinearRegressionWithSGD.train(labelled, iterations=5000, intercept=False)
此回归的权重包含每个group_id的系数和截距,即
lrm.weights
# DenseVector([-1.0, 0.5, 1.0014, 2.0, 0.0, 0.9946])
或者重塑为DataFrame,为每个组提供a和b:
pd.DataFrame(lrm.weights.reshape(2,3).transpose(), columns=['a','b'], index=[1,2,3])
# a b
# 1 -0.999990 1.999986e+00
# 2 0.500000 5.270592e-11
# 3 1.001398 9.946426e-01