大数据手册(Spark)--Spark机器学习

Spark 专为在内存中运行的快速交互式计算而设计,使机器学习可以快速运行。

Quick Start

MLlib(DataFrame-based) 是Spark的机器学习(ML)库。它的目标是使实用的机器学习变得可扩展和简单。它提供以下高端接口:

  • ML算法:常见的学习算法,如分类、回归、聚类和协作过滤
  • 特征:特征提取、转换、特征缩放和选择
  • 管道:构建、评估和调整ML管道的工具
  • 持久性:保存和加载算法、模型和管道
  • 实用函数:线性代数、统计、数据处理等。

从Spark 2.0开始,spark.mllib软件包中基于RDD的API已进入维护模式。Spark的主要机器学习API现在是spark.ml包中基于DataFrame的API。

MLlib类似sklearn标准化了机器学习算法的API。以 iris 数据集为例,简述下建模流程:

Step 1: 加载数据集

# Load the dataset
>>> iris = spark.read.csv("file:///iris.csv", inferSchema="true", header=True)
>>> iris.show(5)                                                                  
+---------------+--------------+---------------+--------------+-------+
|SepalLength(cm)|SepalWidth(cm)|PetalLength(cm)|PetalWidth(cm)|Species|
+---------------+--------------+---------------+--------------+-------+
|            5.1|           3.5|            1.4|           0.2| setosa|
|            4.9|           3.0|            1.4|           0.2| setosa|
|            4.7|           3.2|            1.3|           0.2| setosa|
|            4.6|           3.1|            1.5|           0.2| setosa|
|            5.0|           3.6|            1.4|           0.2| setosa|
+---------------+--------------+---------------+--------------+-------+

Step 2: 数据准备

标签索引化:将类别型标签数值化(可选)

# Convert the categorical labels in the target column to numerical values
indexer = StringIndexer(
    inputCol="Species", 
    outputCol="label"
)

创建特征向量:将所有的特征整合到单一列(估计器必须)

# Assemble the feature columns into a single vector column
assembler = VectorAssembler(
    inputCols=["SepalLength(cm)", "SepalWidth(cm)", "PetalLength(cm)", "PetalWidth(cm)"], 
    outputCol="features"
)

拆分成训练集和测试集

# Split data into training and testing sets
train, test = iris.randomSplit([0.8, 0.2], seed=42)

Step 3: 创建估计器

from pyspark.ml.classification import LogisticRegression

# Create a LogisticRegression instance. This instance is an Estimator.
classifier = LogisticRegression(
    maxIter=10, 
    regParam=0.01, 
    featuresCol="features",
    labelCol='label'
)

Step 4: 创建管道拟合模型

from pyspark.ml import Pipeline

# Assemble all the steps (indexing, assembling, and model building) into a pipeline.
pipeline = Pipeline(stages=[indexer, assembler, classifier])
model = pipeline.fit(train)

管道通过调用.fit()方法返回用于预测的PipelineModel对象。lrModel 位于管道的对应位置,可以提取并获得模型参数。

lrModel = model.stages[2]
print(lrModel.coefficientMatrix)
print(lrModel.interceptVector)

Step 5: 模型预测

# perform predictions
predictions = model.transform(test)

# save predictions
predictions.write.mode('overwrite').saveAsTable("predictions", partitionBy=None)

将之前创建的测试集传递给.transform()方法获得预测。模型预测输出了几列:rawPrediction是原始值,probability是为每个类别计算出的概率,最后prediction是最终的类分配。

Step 6: 模型评估

from pyspark.ml.evaluation import MultiClassificationEvaluator

# Evaluate the model performance
evaluator = MulticlassClassificationEvaluator(
    labelCol="label", 
    metricName="accuracy"
)

accuracy = evaluator.evaluate(predictions)
print(f"Test Accuracy: {accuracy:.2f}")

Step 7: 模型持久化

# Save model
pipelinePath="./pipeline"
model.save(pipelinePath)

# Load the model
from pyspark.ml import PipelineModel
loaded_model = PipelineModel.load(pipelinePath)

ML Pipelines

ML 抽象类

ML管道在DataFrames之上提供了一套统一的高级API,以便更容易将多个算法合并到单个管道或工作流中。ML库提供了三个主要的抽象类:Transformer、Estimator和Pipeline。

  • DataFrame:使用Spark SQL的DataFrame作为ML的数据集,可以容纳各种数据类型。
  • Transformer:实现了一个方法 transform(),通常通过将一个或多个新列附加到DataFrame来转换为新的DataFrame。比如一个模型就是一个转换器,他可以把一个带有特征列的DataFrame转换为一个加上预测列的新的DataFrame。
  • Estimator:它是学习算法或在训练数据上的训练方法的抽象概念。在Pipeline里通常被用来操作一个DataFrame数据并生成一个Transformer。评估器实现了一个fit()方法。比如随机森林算法就是一个Estimator,它可以调用fit()方法训练特征数据从而得到一个随机森林模型。
  • Pipeline:管道的概念用来表示从转换到评估(具有一系列不同阶段)的端到端的过程,这个过程可以对输入的一些原始数据(以DataFrame形式)执行必要的数据加工(转换),最后评估统计模型,返回PipelineModel。
  • Parameter:所有Transformer和Estimator使用统一的API来指定参数。

Pipeline

from pyspark.ml import Pipeline
Pipeline(stages=[stage1,stage2,stage3,...])

在Pipeline对象上执行.fit()方法时,所有阶段按照stages参数中指定的顺序执行。stages参数是转换器和评估器对象的列表。

from pyspark.ml import Pipeline

# Configure an ML pipeline, which consists of three stages.
pipeline = Pipeline(stages=[indexer, assembler, classifier])

# Fit the pipeline to training dataset.
model = pipeline.fit(train)

# Make predictions on training datasett.
prediction = model.transform(train)

数据预处理

import pyspark.ml.feature as ft

特征向量化

pyspark.ml.feature
VectorAssembler特征向量化Transformer
VectorSlicer向量特征提取切片Transformer

VectorAssembler 特征向量化,将多个给定列(包括向量)组合成单个向量列。常用于生成评估器的 featuresCol参数。

>>> from pyspark.ml.feature import VectorAssembler

>>> assembler = VectorAssembler(
...     inputCols=["SepalLength(cm)", "SepalWidth(cm)", "PetalLength(cm)", "PetalWidth(cm)"],
...     outputCol="features"
... )
>>> 
>>> iris = assembler.transform(iris)
>>> iris.select("features", "Species").show(5, truncate=False)
+-----------------+-------+
|features         |Species|
+-----------------+-------+
|[5.1,3.5,1.4,0.2]|setosa |
|[4.9,3.0,1.4,0.2]|setosa |
|[4.7,3.2,1.3,0.2]|setosa |
|[4.6,3.1,1.5,0.2]|setosa |
|[5.0,3.6,1.4,0.2]|setosa |
+-----------------+-------+
only showing top 5 rows

VectorSlicer是一个Transformer,它接受一个特征向量,并输出一个具有原始特征子数组的新特征向量。可以使用整数索引和字符串名称作为参数。

>>> from pyspark.ml.feature import VectorSlicer

>>> slicer = VectorSlicer(inputCol="features", outputCol="selectedFeatures", indices=[1, 2])
>>> output = slicer.transform(iris)
>>> output.select("features", "selectedFeatures").show(5)
+-----------------+----------------+
|         features|selectedFeatures|
+-----------------+----------------+
|[5.1,3.5,1.4,0.2]|       [3.5,1.4]|
|[4.9,3.0,1.4,0.2]|       [3.0,1.4]|
|[4.7,3.2,1.3,0.2]|       [3.2,1.3]|
|[4.6,3.1,1.5,0.2]|       [3.1,1.5]|
|[5.0,3.6,1.4,0.2]|       [3.6,1.4]|
+-----------------+----------------+
only showing top 5 rows

特征提取

特征提取被用于将原始特征提取成机器学习算法支持的数据格式,比如文本和图像特征提取。

pyspark.ml.featureML 特征
CountVectorizer是一个Estimator,从文档集合中提取词汇表并生成 CountVectorizerModel
HashingTF是一个Transformer,它接受一组term,并将这些集合转换为固定长度的特征向量。
IDF是一个Estimator,计算给定文档集合的逆文档频率并生成IDFModel
Word2Vec是一个Estimator,它接受代表文档的单词序列,并训练Word2VecModel
StopWordsRemover是一个Transformer,从输入中过滤掉停止单词
NGram是一个Transformer,将字符串的输入数组转换为n-grams.。
Tokenizer是一个Transformer,将字符串转换成小写
FeatureHasher是一个Transformer,将一组分类或数值特征投射到指定维度的特征向量中(通常大大小于原始特征空间)。
>>> from pyspark.ml.feature import Word2Vec
>>> 
>>> # Input data: Each row is a bag of words from a sentence or document.
>>> documentDF = spark.createDataFrame([
...     ("Hi I heard about Spark".split(" "), ),
...     ("I wish Java could use case classes".split(" "), ),
...     ("Logistic regression models are neat".split(" "), )
... ], ["text"])
>>> 
>>> # Learn a mapping from words to Vectors.
>>> word2Vec = Word2Vec(vectorSize=3, minCount=0, inputCol="text", outputCol="result")
>>> model = word2Vec.fit(documentDF)
24/05/01 16:18:36 WARN InstanceBuilder: Failed to load implementation from:dev.ludovic.netlib.blas.JNIBLAS
>>> 
>>> result = model.transform(documentDF)
>>> for row in result.collect():
...     text, vector = row
...     print("Text: [%s] => \nVector: %s\n" % (", ".join(text), str(vector)))
... 
Text: [Hi, I, heard, about, Spark] => 
Vector: [0.012264367192983627,-0.06442034244537354,-0.007622340321540833]

Text: [I, wish, Java, could, use, case, classes] => 
Vector: [0.05160687722465289,0.025969027541577816,0.02736483487699713]

Text: [Logistic, regression, models, are, neat] => 
Vector: [-0.06564115285873413,0.02060299552977085,-0.08455150425434113]

标准化/归一化

pyspark.ml.feature
StandardScaler(withMean, withStd, …)是一个Estimator。z-scoe标准化
Normalizer(p, inputCol, outputCol)是一个Transformer。该方法使用p范数将数据缩放为单位范数(默认为L2)
MaxAbsScaler(inputCol, outputCol)是一个Estimator。将数据标准化到[-1, 1]范围内
MinMaxScaler(min, max, inputCol, outputCol)是一个Estimator。将数据标准化到[0, 1]范围内
RobustScaler(lower, upper, …)是一个Estimator。根据分位数缩放数据

需要先将连续变量合并成向量

>>> from pyspark.ml.feature import Normalizer
>>> 
>>> # Normalize each Vector using $L^1$ norm.
>>> normalizer = Normalizer(inputCol="features", outputCol="normFeatures", p=1.0)
>>> l1NormData = normalizer.transform(iris).select("normFeatures")
>>> print("Normalized using L^1 norm")
Normalized using L^1 norm
>>> l1NormData.show(5)
+--------------------+
|        normFeatures|
+--------------------+
|[0.5,0.3431372549...|
|[0.51578947368421...|
|[0.5,0.3404255319...|
|[0.48936170212765...|
|[0.49019607843137...|
+--------------------+
only showing top 5 rows

分类特征编码

pyspark.ml.feature
StringIndexer将字符特征编码为索引列,可以同时编码多列。
IndexToString对应于StringIndexer,将标签索引列映射回原始标字符串列
VectorIndexerVector特征列中的分类特征索引化
OneHotEncoderOne-hot 编码。为每个输入列返回一个编码的输出向量列
ElementwiseProduct元素乘积

StringIndexer转换器可以把字符型特征进行编码,使其数值化。使得某些无法使用类别型特征的算法可以使用,并提高决策树等机器学习算法的效率。

>>> from pyspark.ml.feature import StringIndexer
>>> 
>>> df = spark.createDataFrame(
...     [(0, "a"), (1, "b"), (2, "c"), (3, "a"), (4, "a"), (5, "c")],
...     ["id", "category"])
>>> 
>>> indexer = StringIndexer(inputCol="category", outputCol="categoryIndex")
>>> indexed = indexer.fit(df).transform(df)
# Transformed string column 'category' to indexed column 'categoryIndex'.
# StringIndexer will store labels in output column metadata.
>>> indexed.show()
+---+--------+-------------+
| id|category|categoryIndex|
+---+--------+-------------+
|  0|       a|          0.0|
|  1|       b|          2.0|
|  2|       c|          1.0|
|  3|       a|          0.0|
|  4|       a|          0.0|
|  5|       c|          1.0|
+---+--------+-------------+

索引的范围从0开始,索引构建的顺序为字符标签的频率,优先编码频率较大的标签,所以出现频率最高的标签为0。

与StringIndexer相对应,IndexToString的作用是把特征索引列重新映射回原有的字符标签。

from pyspark.ml.feature import IndexToString

>>> converter = IndexToString(inputCol="categoryIndex", outputCol="originalCategory")
>>> converted = converter.transform(indexed)

# Transformed indexed column 'categoryIndex' back to original string column 'originalCategory' using labels in metadata
>>> converted.select("id", "categoryIndex", "originalCategory").show()
+---+-------------+----------------+
| id|categoryIndex|originalCategory|
+---+-------------+----------------+
|  0|          0.0|               a|
|  1|          2.0|               b|
|  2|          1.0|               c|
|  3|          0.0|               a|
|  4|          0.0|               a|
|  5|          1.0|               c|
+---+-------------+----------------+

之前介绍的 StringIndexer 分别对单个特征进行转换,如果所有特征已经合并到特征向量features中,又想对其中某些单个分量进行处理时,ML包提供了VectorIndexer转化器来执行向量索引化。

VectorIndexer基于不同特征的数量来识别类别型,maxCategorise参数提供一个阈值,超过阈值的将被认为是类别型,会被索引化。

>>> from pyspark.ml.linalg import Vectors
>>> from pyspark.ml.feature import VectorIndexer

>>> df = spark.createDataFrame([
...     (Vectors.dense([-1.0, 0.0]),),
...     (Vectors.dense([0.0, 1.0]),), 
...     (Vectors.dense([0.0, 2.0]),)], 
...     ["features"])
>>> 
>>> indexer = VectorIndexer(inputCol="features", outputCol="indexed", maxCategories=2)
>>> indexerModel = indexer.fit(df)
>>> 
>>> categoricalFeatures = indexerModel.categoryMaps
>>> print("Chose %d categorical features: %s" %
...       (len(categoricalFeatures), ", ".join(str(k) for k in categoricalFeatures.keys())))
Chose 1 categorical features: 0
>>> 
>>> # Create new column "indexed" with categorical values transformed to indices 
>>> indexedData = indexerModel.transform(df)
>>> indexedData.show()
+----------+---------+
|  features|  indexed|
+----------+---------+
|[-1.0,0.0]|[1.0,0.0]|
| [0.0,1.0]|[0.0,1.0]|
| [0.0,2.0]|[0.0,2.0]|
+----------+---------+

OneHotEncoder方法来对离散特征进行编码。但是,该方法不接受StringType列,它只能处理数值类型,需要先对特征进行索引化。

>>> from pyspark.ml.feature import OneHotEncoder
>>> 
>>> df = spark.createDataFrame([
...     (0.0, 1.0),
...     (1.0, 0.0),
...     (2.0, 1.0),
...     (0.0, 2.0),
...     (0.0, 1.0),
...     (2.0, 0.0)
... ], ["categoryIndex1", "categoryIndex2"])
>>> 
>>> encoder = OneHotEncoder(inputCols=["categoryIndex1", "categoryIndex2"],
...                         outputCols=["categoryVec1", "categoryVec2"])
>>> model = encoder.fit(df)
>>> encoded = model.transform(df)
>>> encoded.show()
+--------------+--------------+-------------+-------------+
|categoryIndex1|categoryIndex2| categoryVec1| categoryVec2|
+--------------+--------------+-------------+-------------+
|           0.0|           1.0|(2,[0],[1.0])|(2,[1],[1.0])|
|           1.0|           0.0|(2,[1],[1.0])|(2,[0],[1.0])|
|           2.0|           1.0|    (2,[],[])|(2,[1],[1.0])|
|           0.0|           2.0|(2,[0],[1.0])|    (2,[],[])|
|           0.0|           1.0|(2,[0],[1.0])|(2,[1],[1.0])|
|           2.0|           0.0|    (2,[],[])|(2,[0],[1.0])|
+--------------+--------------+-------------+-------------+

ElementwiseProduct 输出每个输入向量与提供的“权重”向量的Hadamard积(即元素乘积)。换句话说,它用标量乘数缩放数据集的每一列。
( v 1 ⋮ v N ) ∘ ( w 1 ⋮ w N ) = ( v 1 w 1 ⋮ v N w N ) \begin{pmatrix}v_1\\ \vdots \\ v_N\end{pmatrix}\circ \begin{pmatrix}w_1\\ \vdots \\ w_N\end{pmatrix}= \begin{pmatrix}v_1w_1\\ \vdots \\ v_Nw_N\end{pmatrix} v1vN w1wN = v1w1vNwN

>>> from pyspark.ml.feature import ElementwiseProduct
>>> from pyspark.ml.linalg import Vectors
>>> 
>>> # Create some vector data; also works for sparse vectors
>>> data = [(Vectors.dense([1.0, 2.0, 3.0]),), (Vectors.dense([4.0, 5.0, 6.0]),)]
>>> df = spark.createDataFrame(data, ["vector"])
>>> transformer = ElementwiseProduct(scalingVec=Vectors.dense([0.0, 1.0, 2.0]),
...                                  inputCol="vector", outputCol="transformedVector")
>>> # Batch transform the vectors to create new column:
>>> transformer.transform(df).show()
+-------------+-----------------+
|       vector|transformedVector|
+-------------+-----------------+
|[1.0,2.0,3.0]|    [0.0,2.0,6.0]|
|[4.0,5.0,6.0]|   [0.0,5.0,12.0]|
+-------------+-----------------+

连续特征离散化

pyspark.ml.feature
Binarizer(threshold, inputCol, …)给定阈值的连续特征二值化
Bucketizer(splits, inputCol, outputCol, …)根据阈值列表将连续变量离散化
QuantileDiscretizer(numBuckets, …)传递一个numBuckets参数通过计算数据的近似分位数离散
>>> values = [(0.1, 0.0), (0.4, 1.0), (1.2, 1.3), (1.5, float("nan")),
...     (float("nan"), 1.0), (float("nan"), 0.0)]
>>> df = spark.createDataFrame(values, ["values1", "values2"])
>>> bucketizer = Bucketizer(
...     splitsArray=[
...         [-float("inf"), 0.5, 1.4, float("inf")], 
...         [-float("inf"), 0.5, float("inf")]
...     ],
...     inputCols=["values1", "values2"], 
...     outputCols=["buckets1", "buckets2"]
... )
>>> bucketed = bucketizer.setHandleInvalid("keep").transform(df)
>>> bucketed.show(truncate=False)
+-------+-------+--------+--------+
|values1|values2|buckets1|buckets2|
+-------+-------+--------+--------+
|0.1    |0.0    |0.0     |0.0     |
|0.4    |1.0    |0.0     |1.0     |
|1.2    |1.3    |1.0     |1.0     |
|1.5    |NaN    |2.0     |2.0     |
|NaN    |1.0    |3.0     |1.0     |
|NaN    |0.0    |3.0     |0.0     |
+-------+-------+--------+--------+

splits:将连续特征离散化的阈值列表。对于n+1个阈值,则有n个桶。由阈值 x, y 定义的桶在 [x, y) 范围内持有值,但最后一个桶除外,它也包括y。必须显式提供-inf和inf以涵盖所有Double值。否则,指定边界之外的值将被视为错误。

特征构造

pyspark.ml.feature
PolynomialExpansion(degree, inputCol, …)多项式特征
PCA(k, inputCol, outputCol)使用主成分分析执行数据降维
DCT(inverse, inputCol, outputCol)Discrete Cosine Transform (DCT) 离散余弦变换将时间序列中的离散特征转化为周期特征
Interaction(inputCols,outputCol)特征交互。接受矢量或双值列,并生成一个矢量列,其中包含每个输入列中一个值的所有组合的乘积。
RFormula(formula, featuresCol, …)实现根据R模型公式拟合数据集所需的转换。
>>> data = [(Vectors.sparse(5, [(1, 1.0), (3, 7.0)]),),
...         (Vectors.dense([2.0, 0.0, 3.0, 4.0, 5.0]),),
...         (Vectors.dense([4.0, 0.0, 0.0, 6.0, 7.0]),)]
>>> df = spark.createDataFrame(data, ["features"])

>>> pca = PCA(k=3, inputCol="features", outputCol="pca_features")
>>> model = pca.fit(df)

>>> model.explainedVariance
DenseVector([0.7944, 0.2056, 0.0])

>>> result = model.transform(df).select("pca_features")
>>> result.show(truncate=False)
+------------------------------------------------------------+
|pca_features                                                |
+------------------------------------------------------------+
|[1.6485728230883814,-4.0132827005162985,-1.0091435193998504]|
|[-4.645104331781533,-1.1167972663619048,-1.0091435193998501]|
|[-6.428880535676488,-5.337951427775359,-1.009143519399851]  |
+------------------------------------------------------------+

缺失值插补

Imputer(strategy, missingValue, ...)

使用缺失值所在列的平均值、中位数或众数完成缺失值的插。输入列应为数字类型。目前Imputer不支持分类功能,并可能为包含分类功能的列创建不正确的值。

>>> from pyspark.ml.feature import Imputer
>>> 
>>> df = spark.createDataFrame([
...     (1.0, float("nan")),
...     (2.0, float("nan")),
...     (float("nan"), 3.0),
...     (4.0, 4.0),
...     (5.0, 5.0)
... ], ["a", "b"])
>>> 
>>> imputer = Imputer(inputCols=["a", "b"], outputCols=["out_a", "out_b"])
>>> model = imputer.fit(df)
>>> 
>>> model.transform(df).show()
+---+---+-----+-----+
|  a|  b|out_a|out_b|
+---+---+-----+-----+
|1.0|NaN|  1.0|  4.0|
|2.0|NaN|  2.0|  4.0|
|NaN|3.0|  3.0|  3.0|
|4.0|4.0|  4.0|  4.0|
|5.0|5.0|  5.0|  5.0|
+---+---+-----+-----+

特征选择

pyspark.ml.feature
ChiSqSelector(numTopFeatures, …)选择用于预测分类标签的分类特征
VarianceThresholdSelector(featuresCol, …)删除所有低方差特征
UnivariateFeatureSelector(featuresCol, …)单变量特征选择

UnivariateFeatureSelector在具有分类/连续特征的分类/回归任务上选择特征。Spark根据指定的featureTypelabelType参数选择要使用的评分函数。

featureTypelabelTypescore function
categoricalcategoricalchi-squared (chi2)
continuouscategoricalANOVATest (f_classif)
continuouscontinuousF-value (f_regression)

它支持五种选择模式:

  • numTopFeatures 选择评分最高的固定数量的特征。
  • percentile 选择评分最高的固定百分比的特征。
  • fpr选择p值低于阈值的所有特征,从而控制假阳性选择率。
  • fdr使用Benjamini-Hochberg程序来选择错误发现率低于阈值的所有特征。
  • fwe选择p值低于阈值的所有功能。阈值按1/numFeatures缩放,从而控制family-wise的错误率。
>>> from pyspark.ml.feature import UnivariateFeatureSelector
>>> from pyspark.ml.linalg import Vectors
>>> 
>>> df = spark.createDataFrame([
...     (1, Vectors.dense([1.7, 4.4, 7.6, 5.8, 9.6, 2.3]), 3.0,),
...     (2, Vectors.dense([8.8, 7.3, 5.7, 7.3, 2.2, 4.1]), 2.0,),
...     (3, Vectors.dense([1.2, 9.5, 2.5, 3.1, 8.7, 2.5]), 3.0,),
...     (4, Vectors.dense([3.7, 9.2, 6.1, 4.1, 7.5, 3.8]), 2.0,),
...     (5, Vectors.dense([8.9, 5.2, 7.8, 8.3, 5.2, 3.0]), 4.0,),
...     (6, Vectors.dense([7.9, 8.5, 9.2, 4.0, 9.4, 2.1]), 4.0,)], 
...     ["id", "features", "label"])
>>> 
>>> selector = UnivariateFeatureSelector(
...     featuresCol="features", 
...     outputCol="selectedFeatures",
...     labelCol="label", 
...     selectionMode="numTopFeatures")
>>> selector.setFeatureType("continuous").setLabelType("categorical").setSelectionThreshold(1)
>>> # UnivariateFeatureSelector output with top 1 features selected using f_classif
>>> result = selector.fit(df).transform(df)
>>> result.show()
+---+--------------------+-----+----------------+
| id|            features|label|selectedFeatures|
+---+--------------------+-----+----------------+
|  1|[1.7,4.4,7.6,5.8,...|  3.0|           [2.3]|
|  2|[8.8,7.3,5.7,7.3,...|  2.0|           [4.1]|
|  3|[1.2,9.5,2.5,3.1,...|  3.0|           [2.5]|
|  4|[3.7,9.2,6.1,4.1,...|  2.0|           [3.8]|
|  5|[8.9,5.2,7.8,8.3,...|  4.0|           [3.0]|
|  6|[7.9,8.5,9.2,4.0,...|  4.0|           [2.1]|
+---+--------------------+-----+----------------+

SQL转换器

SQLTransformer实现了由SQL语句定义的转换。目前,只支持如下SQL语法:

SELECT ... FROM __THIS__ ... where __THIS__

其中__THIS__ 表示输入数据集的底层表。

>>> from pyspark.ml.feature import SQLTransformer
>>> 
>>> df = spark.createDataFrame([
...     (0, 1.0, 3.0),
...     (2, 2.0, 5.0)
... ], ["id", "v1", "v2"])
>>> sqlTrans = SQLTransformer(
...     statement="SELECT *, (v1 + v2) AS v3, (v1 * v2) AS v4 FROM __THIS__")
>>> sqlTrans.transform(df).show()
+---+---+---+---+----+
| id| v1| v2| v3|  v4|
+---+---+---+---+----+
|  0|1.0|3.0|4.0| 3.0|
|  2|2.0|5.0|7.0|10.0|
+---+---+---+---+----+

机器学习常用算法

分类和回归

pyspark.ml.classification
LogisticRegression逻辑回归
DecisionTreeClassifier决策树
RandomForestClassifier随机森林
GBTClassifier梯度增强树(GBT)
LinearSVC线性支持向量机
NaiveBayes朴素贝叶斯
FMClassifier分解机器学习
MultilayerPerceptronClassifier多层感知机
OneVsRest多分类简化为二分类
pyspark.ml.regression
LinearRegression线性回归
GeneralizedLinearRegression广义线性回归
DecisionTreeRegressor决策树
RandomForestRegressor随机森林
GBTRegressor梯度增强树(GBT)
AFTSurvivalRegression生存回归
IsotonicRegression保序回归
FMRegressor分解机器学习
>>> from pyspark.ml import Pipeline
>>> from pyspark.ml.classification import DecisionTreeClassifier
>>> from pyspark.ml.feature import StringIndexer, VectorAssembler
>>> from pyspark.ml.evaluation import MulticlassClassificationEvaluator

>>> # Load the dataset.
>>> data = spark.read.csv("file:///iris.csv", inferSchema="true", header=True)

>>> # Index labels, adding metadata to the label column.
>>> # Fit on whole dataset to include all labels in index.
>>> labelIndexer = StringIndexer(inputCol="Species", outputCol="label")

>>> # Assemble the feature columns into a single vector column
>>> assembler = VectorAssembler(
...     inputCols=["SepalLength(cm)", "SepalWidth(cm)", "PetalLength(cm)", "PetalWidth(cm)"], 
...     outputCol="features"
... )

>>> # Split the data into training and test sets (30% held out for testing)
>>> trainingData, testData = data.randomSplit([0.7, 0.3])
>>> # Train a DecisionTree model.

>>> dt = DecisionTreeClassifier(labelCol="label", featuresCol="features")
>>> # Chain indexers and tree in a Pipeline
>>> pipeline = Pipeline(stages=[labelIndexer, assembler, dt])
>>> # Train model.  This also runs the indexers.
>>> model = pipeline.fit(trainingData)

>>> # Make predictions.
>>> predictions = model.transform(testData)
>>> # Select example rows to display.
>>> predictions.select("prediction", "label").show(5)
+----------+-----+
|prediction|label|
+----------+-----+
|       0.0|  0.0|
|       0.0|  0.0|
|       0.0|  0.0|
|       0.0|  0.0|
|       0.0|  0.0|
+----------+-----+
only showing top 5 rows

>>> # Select (prediction, true label) and compute test error
>>> evaluator = MulticlassClassificationEvaluator(
...     labelCol="label", predictionCol="prediction", metricName="accuracy")
>>> accuracy = evaluator.evaluate(predictions)
>>> print("Test Error = %g " % (1.0 - accuracy))
Test Error = 0.0425532 

>>> treeModel = model.stages[2]
>>> # summary only
>>> print(treeModel)
DecisionTreeClassificationModel: uid=DecisionTreeClassifier_912bad7cd9f2, depth=5, numNodes=15, numClasses=3, numFeatures=4

聚类

pyspark.ml.clustering
KMeansk-means
LDA线性判别分析
BisectingKMeans分层k-means
GaussianMixture高斯混合聚类
PowerIterationClusteringPIC
>>> from pyspark.ml.clustering import KMeans
>>> from pyspark.ml.evaluation import ClusteringEvaluator

# Loads data.
>>> data = spark.read.csv("file:///iris.txt", inferSchema="true", header=True)

>>> # Assemble the feature columns into a single vector column
>>> data = VectorAssembler(
...     inputCols=["SepalLength(cm)", "SepalWidth(cm)", "PetalLength(cm)", "PetalWidth(cm)"], 
...     outputCol="features"
... ).transform(data)

# Trains a k-means model.
>>> kmeans = KMeans().setK(3).setSeed(1)
>>> model = kmeans.fit(data)

# Make predictions
>>> predictions = model.transform(data)
>>> evaluator = ClusteringEvaluator()
>>> silhouette = evaluator.evaluate(predictions)
>>> print("Silhouette with squared euclidean distance = " + str(silhouette))
Silhouette with squared euclidean distance = 0.7342113066202739
>>> centers = model.clusterCenters()
>>> print("Cluster Centers: ")
Cluster Centers: 
>>> for center in centers:
...     print(center)
... 
[6.85384615 3.07692308 5.71538462 2.05384615]
[5.006 3.418 1.464 0.244]
[5.88360656 2.74098361 4.38852459 1.43442623]

协同过滤

协同过滤通常用于推荐系统。

pyspark.ml.recommendation
ALS交替最小二乘(ALS)矩阵分解
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.recommendation import ALS
from pyspark.sql import Row

lines = spark.read.text("data/mllib/als/sample_movielens_ratings.txt").rdd
parts = lines.map(lambda row: row.value.split("::"))
ratingsRDD = parts.map(lambda p: Row(userId=int(p[0]), movieId=int(p[1]),
                                     rating=float(p[2]), timestamp=int(p[3])))
ratings = spark.createDataFrame(ratingsRDD)
(training, test) = ratings.randomSplit([0.8, 0.2])

# Build the recommendation model using ALS on the training data
# Note we set cold start strategy to 'drop' to ensure we don't get NaN evaluation metrics
als = ALS(maxIter=5, regParam=0.01, userCol="userId", itemCol="movieId", ratingCol="rating",
          coldStartStrategy="drop")
model = als.fit(training)

# Evaluate the model by computing the RMSE on the test data
predictions = model.transform(test)
evaluator = RegressionEvaluator(metricName="rmse", labelCol="rating",
                                predictionCol="prediction")
rmse = evaluator.evaluate(predictions)
print("Root-mean-square error = " + str(rmse))

# Generate top 10 movie recommendations for each user
userRecs = model.recommendForAllUsers(10)
# Generate top 10 user recommendations for each movie
movieRecs = model.recommendForAllItems(10)

# Generate top 10 movie recommendations for a specified set of users
users = ratings.select(als.getUserCol()).distinct().limit(3)
userSubsetRecs = model.recommendForUserSubset(users, 10)
# Generate top 10 user recommendations for a specified set of movies
movies = ratings.select(als.getItemCol()).distinct().limit(3)
movieSubSetRecs = model.recommendForItemSubset(movies, 10)

频繁模式挖掘

pyspark.ml.fpm
FPGrowth一种并行FP增长算法,用于挖掘频繁的项目集
PrefixSpan一个并行的PrefixSpan算法来挖掘频繁的顺序模式。

模型选择和评估

ML支持使用CrossValidator和TrainValidationSplit进行模型评估和选择。主要需要以下参数:

  • Estimator:要调整的算法或Pipeline
  • ParamMap:可供选择的参数网格
  • Evaluator:衡量模型表现的评估器

模型评估

pyspark.ml.evaluationDescmetricName
RegressionEvaluator回归模型评估areaUnderROC, areaUnderPR
BinaryClassificationEvaluator二分类模型评估rmse, mse, r2, mae, var
MulticlassClassificationEvaluator多分类模型评估f1, accuracy, weightedPrecision, weightedRecall, logLoss, …
MultilabelClassificationEvaluator多标签分类模型评估precisionByLabel, recallByLabel, f1MeasureByLabel
ClusteringEvaluator聚类模型评估silhouette
RankingEvaluator排序模型评估meanAveragePrecision, ndcgAtK, …
>>> from pyspark.ml.linalg import Vectors
>>> from pyspark.ml.evaluation import BinaryClassificationEvaluator

>>> scoreAndLabels = [
...     (Vectors.dense([0.9, 0.1]), 0.0), 
...     (Vectors.dense([0.9, 0.1]), 1.0), 
...     (Vectors.dense([0.6, 0.4]), 0.0), 
...     (Vectors.dense([0.4, 0.6]), 1.0)
... ]
>>> dataset = spark.createDataFrame(scoreAndLabels, ["raw", "label"])
>>> 
>>> evaluator = BinaryClassificationEvaluator()
>>> evaluator.setRawPredictionCol("raw")
BinaryClassificationEvaluator_13c5fd3055fb
>>> evaluator.evaluate(dataset)
0.625
>>> 
>>> evaluator.evaluate(dataset, {evaluator.metricName: "areaUnderPR"})
0.75

超参数调优

pyspark.ml.tuning
ParamGridBuilder构建参数网格
CrossValidatorK折交叉验证
TrainValidationSplit单次验证

Spark ML 的超参数调优主要借助CrossValidator或TrainValidationSplit来实现,工作方式如下:

  1. 将输入数据拆分为单独的训练集和测试集。
  2. 使用ParamGridBuilder构建参数网格,根据给定评估指标,循环遍历定义的参数值列表,估计各个单独的模型,从而选定一个最佳 ParamMap。
  3. 最终使用最佳ParamMap和整个数据集重新拟合Estimator。

默认情况下,参数网格中的参数集串行计算。

>>> from pyspark.ml.tuning import ParamGridBuilder, CrossValidator
>>> from pyspark.ml.evaluation import MulticlassClassificationEvaluator
>>> from pyspark.ml.classification import LogisticRegression

# Prepare training and test data.
>>> iris = spark.read.csv("file:///iris.csv", inferSchema="true", header=True)

# Convert the categorical labels in the target column to numerical values
>>> indexer = StringIndexer(
...     inputCol="Species", 
...     outputCol="label"
... )

>>> # Assemble the feature columns into a single vector column
>>> assembler = VectorAssembler(
...     inputCols=["SepalLength(cm)", "SepalWidth(cm)", "PetalLength(cm)", "PetalWidth(cm)"], 
...     outputCol="features"
... )

>>> train, test = iris.randomSplit([0.9, 0.1], seed=42)

>>> lr = LogisticRegression(maxIter=100)

# Assemble all the steps (indexing, assembling, and model building) into a pipeline.
>>> pipeline = Pipeline(stages=[indexer, assembler, lr])

# We use a ParamGridBuilder to construct a grid of parameters to search over.
# CrossValidator will try all combinations of values and determine best model using
# the evaluator.
>>> paramGrid = ParamGridBuilder()\
...     .addGrid(lr.regParam, [0.1, 0.01]) \
...     .addGrid(lr.fitIntercept, [False, True])\
...     .addGrid(lr.elasticNetParam, [0.0, 0.5, 1.0])\
...     .build()

# In this case the estimator is simply the linear regression.
# A CrossValidator requires an Estimator, a set of Estimator ParamMaps, and an Evaluator.
>>> crossval = CrossValidator(estimator=pipeline,
...                           estimatorParamMaps=paramGrid,
...                           evaluator=MulticlassClassificationEvaluator(),
...                           numFolds=3)

# Run cross-validation, and choose the best set of parameters.
>>> cvModel = crossval.fit(train)

# Make predictions on test data. model is the model with combination of parameters
# that performed best.
>>> cvModel.transform(test)\
...     .select("features", "label", "prediction")\
...     .show(5)
+-----------------+-----+----------+
|         features|label|prediction|
+-----------------+-----+----------+
|[4.8,3.4,1.6,0.2]|  1.0|       1.0|
|[4.9,3.1,1.5,0.1]|  1.0|       1.0|
|[5.4,3.4,1.5,0.4]|  1.0|       1.0|
|[5.1,3.4,1.5,0.2]|  1.0|       1.0|
|[5.1,3.8,1.6,0.2]|  1.0|       1.0|
+-----------------+-----+----------+
only showing top 5 rows

附录: ROC

PySpark 中并没有在评估模块中直接提供ROC曲线的提取方式,我们可以从以下三个渠道获得:

  • MLlib (DataFrame-based)中的部分分类模型,可以通过summary属性/方法获得
  • Spark 在Scala API中提供了提取ROC曲线的方式,因此我们需要从Scala模块中借用
  • 自定义ROC提取函数

支持summary属性/方法的模型如下:

# Returns the ROC curve, which is a Dataframe having two fields (FPR, TPR) 
# with (0.0, 0.0) prepended and (1.0, 1.0) appended to it.

roc_df = LinearSVCModel.summary().roc
roc_df = LogisticRegressionModel.summary.roc
roc_df = RandomForestClassificationModel.summary.roc
roc_df = MultilayerPerceptronClassificationModel.summary().roc
roc_df = FMClassificationModel.summary().roc

scala中的提取代码如下:

import org.apache.spark.mllib.evaluation.BinaryClassificationMetrics

// Instantiate metrics object
val metrics = new BinaryClassificationMetrics(predictionAndLabels)

// Precision-Recall Curve
val PRC = metrics.pr

// AUPRC
val auPRC = metrics.areaUnderPR
println(s"Area under precision-recall curve = $auPRC")

// ROC Curve
val roc = metrics.roc

// AUROC
val auROC = metrics.areaUnderROC
println(s"Area under ROC = $auROC")

在 PySpark 中定义一个子类借用scala接口:

from pyspark.mllib.evaluation import BinaryClassificationMetrics

class CurveMetrics(BinaryClassificationMetrics):
    def __init__(self, *args):
        super(CurveMetrics, self).__init__(*args)
    
    def _to_list(self, rdd):
        points = []
        # Note this collect could be inefficient for large datasets
        # considering there may be one probability per datapoint (at most)
        # The Scala version takes a numBins parameter,
        # but it doesn't seem possible to pass this from Python to Java
        for row in rdd.collect():
            # Results are returned as type scala.Tuple2,
            # which doesn't appear to have a py4j mapping
            points += [(float(row._1()), float(row._2()))]
        return points
    
    def get_curve(self, method):
        rdd = getattr(self._java_model, method)().toJavaRDD()
        points = self._to_list(rdd)
        return zip(*points)  # return tuple(fpr, tpr)

定义子类后具体使用如下:

from pyspark.sql import Row
def extractProbability(row, labelCol='label', probabilityCol='probability'):
    return Row(label = float(row[labelCol]), probability = float(row['probability'][1]))

pred_df = predictions.rdd.map(extractProbability)
fpr, tpr = CurveMetrics(pred_df).get_curve('roc')

参考sklearn自定义函数提取ROC:

from pyspark.sql import Window, functions as fn
from pyspark.sql import feature as ft

def roc_curve_on_spark(dataset, labelCol='label', probabilityCol='probability'):
    """
    Returns the receiver operating characteristic (ROC) curve,
        which is a Dataframe having two fields (FPR, TPR) with
        (0.0, 0.0) prepended and (1.0, 1.0) appended to it.
    """
    
    roc = dataset.select(labelCol, probabilityCol)
    
    # window functions
    window = Window.orderBy(fn.desc(probabilityCol))
    
    # accumulate the true positives with decreasing threshold
    roc = roc.withColumn('tps', fn.sum(roc[labelCol]).over(window))
    # accumulate the false positives with decreasing threshold
    roc = roc.withColumn('fps', fn.sum(fn.lit(1) - roc[labelCol]).over(window))

    # The total number of negative samples
    numPositive = roc.tail(1)[0]['tps']
    numNegative = roc.tail(1)[0]['fps']

    roc = roc.withColumn('tpr', roc['tps'] / fn.lit(numPositive))
    roc = roc.withColumn('fpr', roc['fps'] / fn.lit(numNegative))
    
    roc = roc.dropDuplicates(subset=probabilityCol).select('fpr', 'tpr')

    # Add an extra threshold position
    # to make sure that the curve starts at (0, 0)
    start_row = spark.createDataFrame([(0.0, 0.0)], schema=roc.schema)
    roc = start_row.unionAll(roc)
    
    return roc

实用工具

数理统计

pyspark.ml.stat统计模块
Correlation.corr(dataset, column, method)计算相关系数矩阵,目前支持 Pearson and Spearman相关系数
ChiSquareTest.test(dataset, featuresCol, labelCol)卡方检验
Summarizer描述统计:可用的指标包括列最大值、最小值、均值、总和、方差、标准值和非零数,以及总计数。
>>> from pyspark.ml.linalg import DenseMatrix, Vectors
>>> from pyspark.ml.stat import Correlation, ChiSquareTest, Summarizer
>>> dataset = [[0, Vectors.dense([1, 0, 0, -2])],
...            [0, Vectors.dense([4, 5, 0, 3])],
...            [1, Vectors.dense([6, 7, 0, 8])],
...            [1, Vectors.dense([9, 0, 0, 1])]]
>>> dataset = spark.createDataFrame(dataset, ['features'])

# Compute the correlation matrix with specified method using dataset.
>>> pearsonCorr = Correlation.corr(dataset, 'features', 'pearson').collect()[0][0]
>>> print(str(pearsonCorr).replace('nan', 'NaN'))
DenseMatrix([[ 1.        ,  0.0556...,         NaN,  0.4004...],
             [ 0.0556...,  1.        ,         NaN,  0.9135...],
             [        NaN,         NaN,  1.        ,         NaN],
             [ 0.4004...,  0.9135...,         NaN,  1.        ]])

# Perform a Pearson’s independence test using dataset.
>>> chiSqResult = ChiSquareTest.test(dataset, 'features', 'label').collect()[0]
>>> print("pValues: " + str(chiSqResult.pValues))
pValues: [0.2614641299491107,0.3678794411714422,1.0,0.2614641299491107]
>>> print("degreesOfFreedom: " + str(chiSqResult.degreesOfFreedom))
degreesOfFreedom: [3, 2, 0, 3]
>>> print("statistics: " + str(chiSqResult.statistics))
statistics: [4.0,2.0,0.0,4.0]

# create summarizer for multiple metrics "mean" and "count"
>>> summarizer = Summarizer.metrics("mean", "count")
>>> dataset.select(summarizer.summary(dataset.features)).show(truncate=False)
+--------------------------------+
|aggregate_metrics(features, 1.0)|
+--------------------------------+
|{[5.0,3.0,0.0,2.5], 4}          |
+--------------------------------+
>>> dataset.select(Summarizer.mean(dataset.features)).show(truncate=False)
+-----------------+
|mean(features)   |
+-----------------+
|[5.0,3.0,0.0,2.5]|
+-----------------+

MLlib (RDD-based)

MLlib (RDD-based) 是基于RDD的原始算法的API。从Spark2.0开始,ML是主要的机器学习库,它对DataFrame进行操作。

MLlib的抽象类

  • Vector:向量(mllib.linalg.Vectors)支持dense和sparse(稠密向量和稀疏向量)。区别在与前者的每一个数值都会存储下来,后者只存储非零数值以节约空间。
  • LabeledPoint:(mllib.regression)表示带标签的数据点,包含一个特征向量与一个标签。注意,标签要通过StringIndexer转化成浮点型的。
  • Matrix:(pyspark.mllib.linalg),支持dense和sparse(稠密矩阵和稀疏矩阵)。
  • 各种Model类:每个Model都是训练算法的结果,一般都有一个predict方法可以用来对新的数据点或者数据点组成的RDD应用该模型进行预测

一般来说,大多数算法直接操作由Vector、LabledPoint或Rating组成的RDD,通常我们从外部数据读取数据后需要进行转化操作构建RDD。

from pyspark.mllib.tree import DecisionTree, DecisionTreeModel
from pyspark.mllib.util import MLUtils

# Load and parse the data file into an RDD of LabeledPoint.
data = MLUtils.loadLibSVMFile(sc, 'data/mllib/sample_libsvm_data.txt')
# Split the data into training and test sets (30% held out for testing)
(trainingData, testData) = data.randomSplit([0.7, 0.3])

# Train a DecisionTree model.
#  Empty categoricalFeaturesInfo indicates all features are continuous.
model = DecisionTree.trainClassifier(trainingData, numClasses=2, categoricalFeaturesInfo={},
                                     impurity='gini', maxDepth=5, maxBins=32)

# Evaluate model on test instances and compute test error
predictions = model.predict(testData.map(lambda x: x.features))
labelsAndPredictions = testData.map(lambda lp: lp.label).zip(predictions)
testErr = labelsAndPredictions.filter(
    lambda lp: lp[0] != lp[1]).count() / float(testData.count())
print('Test Error = ' + str(testErr))
print('Learned classification tree model:')
print(model.toDebugString())

# Save and load model
model.save(sc, "target/tmp/myDecisionTreeClassificationModel")
sameModel = DecisionTreeModel.load(sc, "target/tmp/myDecisionTreeClassificationModel")
  • 2
    点赞
  • 14
    收藏
    觉得还不错? 一键收藏
  • 1
    评论
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值