一、降维模型
Mllib中支持的降维模型只有主成分分析PCA算法。这个模型在spark.ml.feature中,通常作为特征预处理的一种技巧使用。
from pyspark.ml.feature import PCA
from pyspark.ml.linalg import Vectors
data = [(Vectors.sparse(5, [(1, 1.0), (3, 7.0)]),),
(Vectors.dense([2.0, 0.0, 3.0, 4.0, 5.0]),),
(Vectors.dense([4.0, 0.0, 0.0, 6.0, 7.0]),)]
dfdata = spark.createDataFrame(data, ["features"])
pca = PCA(k=3, inputCol="features", outputCol="pcaFeatures")
model = pca.fit(dfdata)
dfresult = model.transform(dfdata).select("pcaFeatures")
dfresult.show(truncate=False)
+-----------------------------------------------------------+
|pcaFeatures |
+-----------------------------------------------------------+
|[1.6485728230883807,-4.013282700516296,-5.524543751369388] |
|[-4.645104331781534,-1.1167972663619026,-5.524543751369387]|
|[-6.428880535676489,-5.337951427775355,-5.524543751369389] |
+-----------------------------------------------------------+
二、模型优化
模型优化一般也称作模型选择(Model selection)或者超参调优(hyperparameter tuning)。
Mllib支持网格搜索方法进行超参调优,相关函数在spark.ml.tunning模块中。
有两种使用网格搜索方法的模式
- 交叉验证(cross-validation)方式
- 留出法(hold-out)方法。
交叉验证模式 使用的是K-fold交叉验证,将数据随机等分划分成K份,每次将一份作为验证集,其余作为训练集,根据K次验证集的平均结果来决定超参选取,计算成本较高,但是结果更加可靠。
留出法 只用将数据随机划分成训练集和验证集,仅根据验证集的单次结果决定超参选取,结果没有交叉验证可靠,但计算成本较低。
- 如果数据规模较大,一般选择留出法,
- 如果数据规模较小,则应该选择交叉验证模式。
1、交叉验证模式
from pyspark.ml import Pipeline
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml.feature import HashingTF, Tokenizer
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
# 准备数据
dfdata = spark.createDataFrame([
(0, "a b c d e spark", 1.0),
(1, "b d", 0.0),
(2, "spark f g h", 1.0),
(3, "hadoop mapreduce", 0.0),
(4, "b spark who", 1.0),
(5, "g d a y", 0.0),
(6, "spark fly", 1.0),
(7, "was mapreduce", 0.0),
(8, "e spark program", 1.0),
(9, "a e c l", 0.0),
(10, "spark compile", 1.0),
(11, "hadoop software", 0.0)
], ["id", "text", "label"])
# 构建流水线,包含: tokenizer, hashingTF, lr.
tokenizer = Tokenizer(inputCol="text", outputCol="words")
hashingTF = HashingTF(inputCol=tokenizer.getOutputCol(), outputCol="features")
lr = LogisticRegression(maxIter=10)
pipeline = Pipeline(stages=[tokenizer, hashingTF, lr])
# 现在我们将整个流水线视作一个Estimator进行统一的超参数调优
# 构建网格: hashingTF.numFeatures 有 3 个可选值 and lr.regParam 有2个可选值
# 我们的网格空间总共有2*3=6个点需要搜索
paramGrid = ParamGridBuilder() \
.addGrid(hashingTF.numFeatures, [10, 100, 1000]) \
.addGrid(lr.regParam, [0.1, 0.01]) \
.build()
# 创建5折交叉验证超参调优器
crossval = CrossValidator(estimator=pipeline,
estimatorParamMaps=paramGrid,
evaluator=BinaryClassificationEvaluator(),
numFolds=5)
# fit后会输出最优的模型
cvModel = crossval.fit(dfdata)
# 准备预测数据
test = spark.createDataFrame([
(4, "spark i j k"),
(5, "l m n"),
(6, "mapreduce spark"),
(7, "apache hadoop")
], ["id", "text"])
# 使用最优模型进行预测
prediction = cvModel.transform(test)
selected = prediction.select("id", "text", "probability", "prediction")
for row in selected.collect():
print(row)
Row(id=4, text='spark i j k', probability=DenseVector([0.2661, 0.7339]), prediction=1.0)
Row(id=5, text='l m n', probability=DenseVector([0.9209, 0.0791]), prediction=0.0)
Row(id=6, text='mapreduce spark', probability=DenseVector([0.4429, 0.5571]), prediction=1.0)
Row(id=7, text='apache hadoop', probability=DenseVector([0.8584, 0.1416]), prediction=0.0)
2、留出法模式
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.regression import LinearRegression
from pyspark.ml.tuning import ParamGridBuilder, TrainValidationSplit
# 准备数据
dfdata = spark.read.format("libsvm")\
.load("data/sample_linear_regression_data.txt")
dftrain, dftest = dfdata.randomSplit([0.9, 0.1], seed=12345)
lr = LinearRegression(maxIter=10)
# 构建网格作为超参数搜索空间
paramGrid = ParamGridBuilder()\
.addGrid(lr.regParam, [0.1, 0.01]) \
.addGrid(lr.fitIntercept, [False, True])\
.addGrid(lr.elasticNetParam, [0.0, 0.5, 1.0])\
.build()
# 创建留出法超参调优器
tvs = TrainValidationSplit(estimator=lr,
estimatorParamMaps=paramGrid,
evaluator=RegressionEvaluator(),
# 80% 的数据作为训练集,20的数据作为验证集
trainRatio=0.8)
# 训练后会输出最优超参的模型
model = tvs.fit(dftrain)
# 使用模型进行预测
model.transform(dftest)\
.select("features", "label", "prediction")\
.show()
+--------------------+--------------------+--------------------+
| features| label| prediction|
+--------------------+--------------------+--------------------+
|(10,[0,1,2,3,4,5,...| -17.026492264209548| -1.6265106840933026|
|(10,[0,1,2,3,4,5,...| -16.71909683360509|-0.01129960392982...|
|(10,[0,1,2,3,4,5,...| -15.375857723312297| 0.9008270143746643|
|(10,[0,1,2,3,4,5,...| -13.772441561702871| 3.435609049373433|
|(10,[0,1,2,3,4,5,...| -13.039928064104615| 0.3670260850771136|
|(10,[0,1,2,3,4,5,...| -9.42898793151394| -3.26399994121536|
|(10,[0,1,2,3,4,5,...| -9.2679651250406| -0.1762581278405398|
|(10,[0,1,2,3,4,5,...| -9.173693798406978| -0.2824541263038875|
|(10,[0,1,2,3,4,5,...| -7.1500991588127265| 3.087239142258043|
|(10,[0,1,2,3,4,5,...| -6.930603551528371| 0.12389571117374062|
|(10,[0,1,2,3,4,5,...| -6.456944198081549| -0.7275144195427645|
|(10,[0,1,2,3,4,5,...| -3.2843694575334834| -0.9048235164747517|
|(10,[0,1,2,3,4,5,...| -1.99891354174786| 0.9588887587748192|
|(10,[0,1,2,3,4,5,...| -0.4683784136986876| 0.6261083785799368|
|(10,[0,1,2,3,4,5,...|-0.44652227528840105| 0.19068393875752507|
|(10,[0,1,2,3,4,5,...| 0.10157453780074743| -0.9062122256799047|
|(10,[0,1,2,3,4,5,...| 0.2105613019270259| 1.225604620956131|
|(10,[0,1,2,3,4,5,...| 2.1214592666251364| 0.2854396644518767|
|(10,[0,1,2,3,4,5,...| 2.8497179990245116| 1.3569268250561075|
|(10,[0,1,2,3,4,5,...| 3.980473021620311| 2.5359695420417965|
+--------------------+--------------------+--------------------+
only showing top 20 rows
三、实用工具
pyspark.ml.linalg模块提供了线性代数向量和矩阵对象。
pyspark.ml.stat模块提供了数理统计诸如卡方检验,相关性分析等功能。
1、向量和矩阵
pyspark.ml.linalg 支持 DenseVector,SparseVector,DenseMatrix,SparseMatrix类。
并可以使用Matrices和Vectors提供的工厂方法创建向量和矩阵。
from pyspark.ml.linalg import DenseVector, SparseVector
#稠密向量
dense_vec = DenseVector([1, 0, 0, 2.0, 0])
print("dense_vec: ", dense_vec)
print("dense_vec.numNonzeros: ", dense_vec.numNonzeros())
#稀疏向量
#参数分别是维度,非零索引,非零元素值
sparse_vec = SparseVector(5, [0,3],[1.0,2.0])
print("sparse_vec: ", sparse_vec)
dense_vec: [1.0,0.0,0.0,2.0,0.0]
dense_vec.numNonzeros: 2
sparse_vec: (5,[0,3],[1.0,2.0])
dense_vec.toArray()
array([1., 0., 0., 2., 0.])
from pyspark.ml.linalg import DenseMatrix, SparseMatrix
#稠密矩阵
#参数分别是 行数,列数,元素值,是否转置(默认False)
dense_matrix = DenseMatrix(3, 2, [1, 3, 5, 2, 4, 6])
#稀疏矩阵
#参数分别是 行数,列数,在第几个元素列索引加1,行索引,非零元素值
sparse_matrix = SparseMatrix(3, 3, [0, 2, 3, 6],
[0, 2, 1, 0, 1, 2], [1.0, 2.0, 3.0, 4.0, 5.0, 6.0])
print("sparse_matrix.toArray(): \n", sparse_matrix.toArray())
sparse_matrix.toArray():
[[1. 0. 4.]
[0. 3. 5.]
[2. 0. 6.]]
from pyspark.ml.linalg import Vectors,Matrices
#工厂方法
vec = Vectors.zeros(3)
matrix = Matrices.dense(2,2,[1,2,3,5])
print(matrix)
DenseMatrix([[1., 3.],
[2., 5.]])
2、数理统计
#相关性分析
from pyspark.ml.linalg import Vectors
from pyspark.ml.stat import Correlation
data = [(Vectors.sparse(4, [(0, 1.0), (3, -2.0)]),),
(Vectors.dense([4.0, 5.0, 0.0, 3.0]),),
(Vectors.dense([6.0, 7.0, 0.0, 8.0]),),
(Vectors.sparse(4, [(0, 9.0), (3, 1.0)]),)]
df = spark.createDataFrame(data, ["features"])
r1 = Correlation.corr(df, "features").head()
print("Pearson correlation matrix:\n" + str(r1[0]))
r2 = Correlation.corr(df, "features", "spearman").head()
print("Spearman correlation matrix:\n" + str(r2[0]))
Pearson correlation matrix:
DenseMatrix([[1. , 0.05564149, nan, 0.40047142],
[0.05564149, 1. , nan, 0.91359586],
[ nan, nan, 1. , nan],
[0.40047142, 0.91359586, nan, 1. ]])
Spearman correlation matrix:
DenseMatrix([[1. , 0.10540926, nan, 0.4 ],
[0.10540926, 1. , nan, 0.9486833 ],
[ nan, nan, 1. , nan],
[0.4 , 0.9486833 , nan, 1. ]])
#卡方检验
from pyspark.ml.linalg import Vectors
from pyspark.ml.stat import ChiSquareTest
data = [(0.0, Vectors.dense(0.5, 10.0)),
(0.0, Vectors.dense(1.5, 20.0)),
(1.0, Vectors.dense(1.5, 30.0)),
(0.0, Vectors.dense(3.5, 30.0)),
(0.0, Vectors.dense(3.5, 40.0)),
(1.0, Vectors.dense(3.5, 40.0))]
df = spark.createDataFrame(data, ["label", "features"])
r = ChiSquareTest.test(df, "features", "label").head()
print("pValues: " + str(r.pValues))
print("degreesOfFreedom: " + str(r.degreesOfFreedom))
print("statistics: " + str(r.statistics))
pValues: [0.6872892787909721,0.6822703303362126]
degreesOfFreedom: [2, 3]
statistics: [0.75,1.5]