Spark机器学习之分类与回归_spark机器学习分类-CSDN博客

本页面介绍了分类和回归的算法。它还包括讨论特定类别的算法的部分，如线性方法，树和集合体。

目录

分类 Classification
-----------逻辑回归 Logistic regression
-------------------二项式逻辑回归 Binomial logistic regression
-------------------多项Logistic回归 Multinomial logistic regression
-----------决策树分类器 Decision tree classifier
-----------随机森林分类器 Random forest classifier
-----------梯度增强树分类器 Gradient-boosted tree classifier
-----------多层感知器分类器 Multilayer perceptron classifier
---------一对一休息分类器（a.k.a.一对全）One-vs-Rest classifier (a.k.a. One-vs-All)
-----------朴素贝叶斯 Naive Bayes
回归 Regression
-----------线性回归 Linear regression
----------广义线性回归 Generalized linear regression
-----------------可用的家庭 Available families
-----------决策树回归 Decision tree regression
------------随机森林回归 Random forest regression
-----------梯度增强树回归 Gradient-boosted tree regression
-----------生存回归 Survival regression
-----------等渗回归 Isotonic regression
----------------例子 Examples
线性方法 Linear methods
决策树 Decision trees
------------输入和输出 Inputs and Outputs
------------输入列 Input Columns
------------输出列 Output Columns
树套 Tree Ensembles
随机森林 Random Forests
-------------输入和输出 Inputs and Outputs
-------------输入列 Input Columns
-------------输出列（预测）Output Columns (Predictions)
梯度增强树（GBT）Gradient-Boosted Trees (GBTs)
--------------输入和输出 Inputs and Outputs
-------------输入列 Input Columns
-------------输出列（预测）Output Columns (Predictions)

1、Classification

1.1 Logistic regression

逻辑回归是预测分类的流行方法。广义线性模型的一个特例是预测结果的可能性。在spark.ml逻辑回归中可以使用二项式逻辑回归来预测二进制结果，也可以通过使用多项Logistic回归来预测多类结果。使用系列参数在这两种算法之间进行选择，或者将其设置为未设置，Spark将推断出正确的变体。

通过将family参数设置为“多项式”，可以将多项Logistic回归用于二进制分类。它将产生两组系数和两个截距。

当在具有常量非零列的数据集上对LogisticRegressionModel进行拟合时，Spark MLlib为常数非零列输出零系数。此行为与R glmnet相同，但与LIBSVM不同。

1.1.1 Binomial logistic regression

有关二项式逻辑回归实现的更多背景和更多细节，请参阅spark.mllib中逻辑回归的文档。

例
以下示例显示了如何用弹性网络正则化来训练二项分类的二项式和多项Logistic回归模型。 elasticNetParam对应于α，regParam对应于λ。

from pyspark.ml.classification import LogisticRegression

# Load training data
training = spark.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt")

lr = LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8)

# Fit the model
lrModel = lr.fit(training)

# Print the coefficients and intercept for logistic regression
print("Coefficients: " + str(lrModel.coefficients))
print("Intercept: " + str(lrModel.intercept))

# We can also use the multinomial family for binary classification
mlr = LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8, family="multinomial")

# Fit the model
mlrModel = mlr.fit(training)

# Print the coefficients and intercepts for logistic regression with multinomial family
print("Multinomial coefficients: " + str(mlrModel.coefficientMatrix))
print("Multinomial intercepts: " + str(mlrModel.interceptVector))

查找Spark repo中的“examples / src / main / python / ml / logistic_regression_with_elastic_net.py”的完整示例代码。
逻辑回归的spark.ml实现也支持在训练集中提取模型的摘要。请注意，在BinaryLogisticRegressionSummary中存储为DataFrame的预测和度量标注为@transient，因此仅在驱动程序上可用。

LogisticRegressionTrainingSummary为LogisticRegressionModel提供了一个摘要。目前，只支持二进制分类。将来会增加对多类别模型摘要的支持。
继续前面的例子：

from pyspark.ml.classification import LogisticRegression

# Extract the summary from the returned LogisticRegressionModel instance trained
# in the earlier example
trainingSummary = lrModel.summary

# Obtain the objective per iteration
objectiveHistory = trainingSummary.objectiveHistory
print("objectiveHistory:")
for objective in objectiveHistory:
    print(objective)

# Obtain the receiver-operating characteristic as a dataframe and areaUnderROC.
trainingSummary.roc.show()
print("areaUnderROC: " + str(trainingSummary.areaUnderROC))

# Set the model threshold to maximize F-Measure
fMeasure = trainingSummary.fMeasureByThreshold
maxFMeasure = fMeasure.groupBy().max('F-Measure').select('max(F-Measure)').head()
bestThreshold = fMeasure.where(fMeasure['F-Measure'] == maxFMeasure['max(F-Measure)']) \
    .select('threshold').head()['threshold']
lr.setThreshold(bestThreshold)

1.1.2 Multinomial logistic regression

通过多项Logistic（softmax）回归支持多类分类。在多项Logistic回归中，该算法产生K系数集，或K×J矩阵，其中KK是结果类的数量，J是特征数。如果算法与截距项拟合，则截距的长度K向量是可用的。

多项式系数可用作系数矩阵，截距可作为interceptVector使用

不支持用多项式族训练的逻辑回归模型的系数和截距方法。改用系数矩阵和interceptVector。

使用softmax函数建模结果类k∈1,2，...，K的条件概率。

P(Y=k|X,βk,β0k)=eβk⋅X+β0k∑K−1k′=0eβk′⋅X+β0k′

我们使用多项式响应模型将加权负对数似然值最小化，并用弹性网络惩罚来控制过拟合。

minβ,β0−[∑i=1Lwi⋅logP(Y=yi|xi)]+λ[12(1−α)||β||22+α||β||1]

例
以下示例展示了如何使用弹性网络正则化来训练多元逻辑回归模型。

from pyspark.ml.classification import LogisticRegression

# Load training data
training = spark \
    .read \
    .format("libsvm") \
    .load("data/mllib/sample_multiclass_classification_data.txt")

lr = LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8)

# Fit the model
lrModel = lr.fit(training)

# Print the coefficients and intercept for multinomial logistic regression
print("Coefficients: \n" + str(lrModel.coefficientMatrix))
print("Intercept: " + str(lrModel.interceptVector))

1.2 Decision tree classifier

决策树是一种流行的分类和回归方法。有关spark.ml实现的更多信息，请参见决策树部分。

例
以下示例以LibSVM格式加载数据集，将其拆分为训练和测试集，在第一个数据集上训练，然后对所保留的测试集进行评估。我们使用两个特征转换来准备数据; 这些帮助索引类别的标签和分类功能，添加元数据到决策树算法可以识别的DataFrame。

from pyspark.ml import Pipeline
from pyspark.ml.classification import DecisionTreeClassifier
from pyspark.ml.feature import StringIndexer, VectorIndexer
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

# Load the data stored in LIBSVM format as a DataFrame.
data = spark.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt")

# Index labels, adding metadata to the label column.
# Fit on whole dataset to include all labels in index.
labelIndexer = StringIndexer(inputCol="label", outputCol="indexedLabel").fit(data)
# Automatically identify categorical features, and index them.
# We specify maxCategories so features with > 4 distinct values are treated as continuous.
featureIndexer =\
    VectorIndexer(inputCol="features", outputCol="indexedFeatures", maxCategories=4).fit(data)

# Split the data into training and test sets (30% held out for testing)
(trainingData, testData) = data.randomSplit([0.7, 0.3])

# Train a DecisionTree model.
dt = DecisionTreeClassifier(labelCol="indexedLabel", featuresCol="indexedFeatures")

# Chain indexers and tree in a Pipeline
pipeline = Pipeline(stages=[labelIndexer, featureIndexer, dt])

# Train model.  This also runs the indexers.
model = pipeline.fit(trainingData)

# Make predictions.
predictions = model.transform(testData)

# Select example rows to display.
predictions.select("prediction", "indexedLabel", "features").show(5)

# Select (prediction, true label) and compute test error
evaluator = MulticlassClassificationEvaluator(
    labelCol="indexedLabel", predictionCol="prediction", metricName="accuracy")
accuracy = evaluator.evaluate(predictions)
print("Test Error = %g " % (1.0 - accuracy))

treeModel = model.stages[2]
# summary only
print(treeModel)

1.3 Random forest classifier

随机森林是一类受Example欢迎的分类和回归方法。有关spark.ml实现的更多信息，请参见随机林部分。

Example

from pyspark.ml import Pipeline
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.feature import IndexToString, StringIndexer, VectorIndexer
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

# Load and parse the data file, converting it to a DataFrame.
data = spark.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt")

# Index labels, adding metadata to the label column.
# Fit on whole dataset to include all labels in index.
labelIndexer = StringIndexer(inputCol="label", outputCol="indexedLabel").fit(data)

# Automatically identify categorical features, and index them.
# Set maxCategories so features with > 4 distinct values are treated as continuous.
featureIndexer =\
    VectorIndexer(inputCol="features", outputCol="indexedFeatures", maxCategories=4).fit(data)

# Split the data into training and test sets (30% held out for testing)
(trainingData, testData) = data.randomSplit([0.7, 0.3])

# Train a RandomForest model.
rf = RandomForestClassifier(labelCol="indexedLabel", featuresCol="indexedFeatures", numTrees=10)

# Convert indexed labels back to original labels.
labelConverter = IndexToString(inputCol="prediction", outputCol="predictedLabel",
                               labels=labelIndexer.labels)

# Chain indexers and forest in a Pipeline
pipeline = Pipeline(stages=[labelIndexer, featureIndexer, rf, labelConverter])

# Train model.  This also runs the indexers.
model = pipeline.fit(trainingData)

# Make predictions.
predictions = model.transform(testData)

# Select example rows to display.
predictions.select("predictedLabel", "label", "features").show(5)

# Select (prediction, true label) and compute test error
evaluator = MulticlassClassificationEvaluator(
    labelCol="indexedLabel", predictionCol="prediction", metricName="accuracy")
accuracy = evaluator.evaluate(predictions)
print("Test Error = %g" % (1.0 - accuracy))

rfModel = model.stages[2]
print(rfModel)  # summary only

1.4 Gradient-boosted tree classifier

梯度增强树（GBT）是使用决策树组合的流行分类和回归方法。有关spark.ml实现的更多信息，请参见GBT部分。

Example

from pyspark.ml import Pipeline
from pyspark.ml.classification import GBTClassifier
from pyspark.ml.feature import StringIndexer, VectorIndexer
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

# Load and parse the data file, converting it to a DataFrame.
data = spark.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt")

# Index labels, adding metadata to the label column.
# Fit on whole dataset to include all labels in index.
labelIndexer = StringIndexer(inputCol="label", outputCol="indexedLabel").fit(data)
# Automatically identify categorical features, and index them.
# Set maxCategories so features with > 4 distinct values are treated as continuous.
featureIndexer =\
    VectorIndexer(inputCol="features", outputCol="indexedFeatures", maxCategories=4).fit(data)

# Split the data into training and test sets (30% held out for testing)
(trainingData, testData) = data.randomSplit([0.7, 0.3])

# Train a GBT model.
gbt = GBTClassifier(labelCol="indexedLabel", featuresCol="indexedFeatures", maxIter=10)

# Chain indexers and GBT in a Pipeline
pipeline = Pipeline(stages=[labelIndexer, featureIndexer, gbt])

# Train model.  This also runs the indexers.
model = pipeline.fit(trainingData)

# Make predictions.
predictions = model.transform(testData)

# Select example rows to display.
predictions.select("prediction", "indexedLabel", "features").show(5)

# Select (prediction, true label) and compute test error
evaluator = MulticlassClassificationEvaluator(
    labelCol="indexedLabel", predictionCol="prediction", metricName="accuracy")
accuracy = evaluator.evaluate(predictions)
print("Test Error = %g" % (1.0 - accuracy))

gbtModel = model.stages[2]
print(gbtModel)  # summary only

1.5 Multilayer perceptron classifier

多层感知器分类器（MLPC）是基于前馈人工神经网络的分类器。 MLPC由多层节点组成。每个层完全连接到网络中的下一层。输入层中的节点表示输入数据。所有其他节点通过输入与节点权重w和偏差b的线性组合将输入映射到输出，并应用激活功能。对于具有K + 1层的MLPC，可以以矩阵形式写出如下：

y(x)=fK(...f2(wT2f1(wT1x+b1)+b2)...+bK)

中间层节点使用Sigmoid（logistic）函数：

f(zi)=11+e−zi

输出层节点使用softmax函数：

f(zi)=ezi∑Nk=1ezk

输出层中的节点数N对应于类的数量。
MLPC采用反向传播来学习模型。我们使用物流损失函数进行优化和L-BFGS作为优化程序。

Example

from pyspark.ml.classification import MultilayerPerceptronClassifier
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

# Load training data
data = spark.read.format("libsvm")\
    .load("data/mllib/sample_multiclass_classification_data.txt")

# Split the data into train and test
splits = data.randomSplit([0.6, 0.4], 1234)
train = splits[0]
test = splits[1]

# specify layers for the neural network:
# input layer of size 4 (features), two intermediate of size 5 and 4
# and output of size 3 (classes)
layers = [4, 5, 4, 3]

# create the trainer and set its parameters
trainer = MultilayerPerceptronClassifier(maxIter=100, layers=layers, blockSize=128, seed=1234)

# train the model
model = trainer.fit(train)

# compute accuracy on the test set
result = model.transform(test)
predictionAndLabels = result.select("prediction", "label")
evaluator = MulticlassClassificationEvaluator(metricName="accuracy")
print("Test set accuracy = " + str(evaluator.evaluate(predictionAndLabels)))

1.6 One-vs-Rest classifier (a.k.a. One-vs-All)

OneVsRest是用于执行多类别分类的机器学习简化的示例，给定可以有效地执行二分类的基本分类器。它也被称为“一对一”。
OneVsRest作为估计器实现。对于基类分类器，它需要分类器的实例，并为每个k类创建二分类问题。对i类的分类器进行训练，以预测标签是否，将i类与所有其他类区分开来。预测是通过评估每个二分类器来完成的，最可靠的分类器的索引作为标签输出。

例
下面的示例演示如何加载Iris数据集，将其解析为DataFrame，并使用OneVsRest执行多类分类。计算出测试误差来测量算法的准确性。

from pyspark.ml.classification import LogisticRegression, OneVsRest
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

# load data file.
inputData = spark.read.format("libsvm") \
    .load("data/mllib/sample_multiclass_classification_data.txt")

# generate the train/test split.
(train, test) = inputData.randomSplit([0.8, 0.2])

# instantiate the base classifier.
lr = LogisticRegression(maxIter=10, tol=1E-6, fitIntercept=True)

# instantiate the One Vs Rest Classifier.
ovr = OneVsRest(classifier=lr)

# train the multiclass model.
ovrModel = ovr.fit(train)

# score the model on test data.
predictions = ovrModel.transform(test)

# obtain evaluator.
evaluator = MulticlassClassificationEvaluator(metricName="accuracy")

# compute the classification error on test data.
accuracy = evaluator.evaluate(predictions)
print("Test Error = %g" % (1.0 - accuracy))

1.7 Naive Bayes

朴素贝叶斯分类器是一个简单概率分类器的家族，基于贝叶斯定理与特征之间的强烈独立假设。 spark.ml实现目前支持多项朴素贝叶斯和伯努利天真贝叶斯。有关更多信息，请参阅MLlib中Naive Bayes部分。

Example

from pyspark.ml.classification import NaiveBayes
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

# Load training data
data = spark.read.format("libsvm") \
    .load("data/mllib/sample_libsvm_data.txt")

# Split the data into train and test
splits = data.randomSplit([0.6, 0.4], 1234)
train = splits[0]
test = splits[1]

# create the trainer and set its parameters
nb = NaiveBayes(smoothing=1.0, modelType="multinomial")

# train the model
model = nb.fit(train)

# select example rows to display.
predictions = model.transform(test)
predictions.show()

# compute accuracy on the test set
evaluator = MulticlassClassificationEvaluator(labelCol="label", predictionCol="prediction",
                                              metricName="accuracy")
accuracy = evaluator.evaluate(predictions)
print("Test set accuracy = " + str(accuracy))

2 Regression

2.1 Linear regression

使用线性回归模型和模型摘要的界面类似于逻辑回归案例。
在使用“l-bfgs”求解器的常量非零列的数据集上对LinearRegressionModel进行拟合时，Spark MLlib为常数非零列输出零系数。此行为与R glmnet相同，但与LIBSVM不同。

例
以下示例演示了训练弹性网络正则化线性回归模型并提取模型汇总统计量。

from pyspark.ml.regression import LinearRegression

# Load training data
training = spark.read.format("libsvm")\
    .load("data/mllib/sample_linear_regression_data.txt")

lr = LinearRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8)

# Fit the model
lrModel = lr.fit(training)

# Print the coefficients and intercept for linear regression
print("Coefficients: %s" % str(lrModel.coefficients))
print("Intercept: %s" % str(lrModel.intercept))

# Summarize the model over the training set and print out some metrics
trainingSummary = lrModel.summary
print("numIterations: %d" % trainingSummary.totalIterations)
print("objectiveHistory: %s" % str(trainingSummary.objectiveHistory))
trainingSummary.residuals.show()
print("RMSE: %f" % trainingSummary.rootMeanSquaredError)
print("r2: %f" % trainingSummary.r2)

2.2 Generalized linear regression

与线性回归相比，输出被假设为跟随高斯分布，广义线性模型（GLM）是线性模型的规范，其中响应变量Yi遵循指数族分布的一些分布。 Spark的GeneralizedLinearRegression界面允许灵活的GLM规范，可用于各种类型的预测问题，包括线性回归，泊松回归，逻辑回归等。目前在spark.ml中，仅支持指数族分布的一部分，下面列出它们。

注意：Spark当前仅通过其GeneralizedLinearRegression接口支持多达4096个功能，如果超出此约束，则会抛出异常。有关详细信息，请参阅高级部分。然而，对于线性和逻辑回归，可以使用线性回归和逻辑回归估计器来训练具有增加的特征数量的模型。

from pyspark.ml.regression import GeneralizedLinearRegression

# Load training data
dataset = spark.read.format("libsvm")\
    .load("data/mllib/sample_linear_regression_data.txt")

glr = GeneralizedLinearRegression(family="gaussian", link="identity", maxIter=10, regParam=0.3)

# Fit the model
model = glr.fit(dataset)

# Print the coefficients and intercept for generalized linear regression model
print("Coefficients: " + str(model.coefficients))
print("Intercept: " + str(model.intercept))

# Summarize the model over the training set and print out some metrics
summary = model.summary
print("Coefficient Standard Errors: " + str(summary.coefficientStandardErrors))
print("T Values: " + str(summary.tValues))
print("P Values: " + str(summary.pValues))
print("Dispersion: " + str(summary.dispersion))
print("Null Deviance: " + str(summary.nullDeviance))
print("Residual Degree Of Freedom Null: " + str(summary.residualDegreeOfFreedomNull))
print("Deviance: " + str(summary.deviance))
print("Residual Degree Of Freedom: " + str(summary.residualDegreeOfFreedom))
print("AIC: " + str(summary.aic))
print("Deviance Residuals: ")
summary.residuals().show()

2.3 Decision tree regression

from pyspark.ml import Pipeline
from pyspark.ml.regression import DecisionTreeRegressor
from pyspark.ml.feature import VectorIndexer
from pyspark.ml.evaluation import RegressionEvaluator

# Load the data stored in LIBSVM format as a DataFrame.
data = spark.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt")

# Automatically identify categorical features, and index them.
# We specify maxCategories so features with > 4 distinct values are treated as continuous.
featureIndexer =\
    VectorIndexer(inputCol="features", outputCol="indexedFeatures", maxCategories=4).fit(data)

# Split the data into training and test sets (30% held out for testing)
(trainingData, testData) = data.randomSplit([0.7, 0.3])

# Train a DecisionTree model.
dt = DecisionTreeRegressor(featuresCol="indexedFeatures")

# Chain indexer and tree in a Pipeline
pipeline = Pipeline(stages=[featureIndexer, dt])

# Train model.  This also runs the indexer.
model = pipeline.fit(trainingData)

# Make predictions.
predictions = model.transform(testData)

# Select example rows to display.
predictions.select("prediction", "label", "features").show(5)

# Select (prediction, true label) and compute test error
evaluator = RegressionEvaluator(
    labelCol="label", predictionCol="prediction", metricName="rmse")
rmse = evaluator.evaluate(predictions)
print("Root Mean Squared Error (RMSE) on test data = %g" % rmse)

treeModel = model.stages[1]
# summary only
print(treeModel)

2.4 Random forest regression

from pyspark.ml import Pipeline
from pyspark.ml.regression import RandomForestRegressor
from pyspark.ml.feature import VectorIndexer
from pyspark.ml.evaluation import RegressionEvaluator

# Load and parse the data file, converting it to a DataFrame.
data = spark.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt")

# Automatically identify categorical features, and index them.
# Set maxCategories so features with > 4 distinct values are treated as continuous.
featureIndexer =\
    VectorIndexer(inputCol="features", outputCol="indexedFeatures", maxCategories=4).fit(data)

# Split the data into training and test sets (30% held out for testing)
(trainingData, testData) = data.randomSplit([0.7, 0.3])

# Train a RandomForest model.
rf = RandomForestRegressor(featuresCol="indexedFeatures")

# Chain indexer and forest in a Pipeline
pipeline = Pipeline(stages=[featureIndexer, rf])

# Train model.  This also runs the indexer.
model = pipeline.fit(trainingData)

# Make predictions.
predictions = model.transform(testData)

# Select example rows to display.
predictions.select("prediction", "label", "features").show(5)

# Select (prediction, true label) and compute test error
evaluator = RegressionEvaluator(
    labelCol="label", predictionCol="prediction", metricName="rmse")
rmse = evaluator.evaluate(predictions)
print("Root Mean Squared Error (RMSE) on test data = %g" % rmse)

rfModel = model.stages[1]
print(rfModel)  # summary only

2.5 Gradient-boosted tree regression

from pyspark.ml import Pipeline
from pyspark.ml.regression import GBTRegressor
from pyspark.ml.feature import VectorIndexer
from pyspark.ml.evaluation import RegressionEvaluator

# Load and parse the data file, converting it to a DataFrame.
data = spark.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt")

# Automatically identify categorical features, and index them.
# Set maxCategories so features with > 4 distinct values are treated as continuous.
featureIndexer =\
    VectorIndexer(inputCol="features", outputCol="indexedFeatures", maxCategories=4).fit(data)

# Split the data into training and test sets (30% held out for testing)
(trainingData, testData) = data.randomSplit([0.7, 0.3])

# Train a GBT model.
gbt = GBTRegressor(featuresCol="indexedFeatures", maxIter=10)

# Chain indexer and GBT in a Pipeline
pipeline = Pipeline(stages=[featureIndexer, gbt])

# Train model.  This also runs the indexer.
model = pipeline.fit(trainingData)

# Make predictions.
predictions = model.transform(testData)

# Select example rows to display.
predictions.select("prediction", "label", "features").show(5)

# Select (prediction, true label) and compute test error
evaluator = RegressionEvaluator(
    labelCol="label", predictionCol="prediction", metricName="rmse")
rmse = evaluator.evaluate(predictions)
print("Root Mean Squared Error (RMSE) on test data = %g" % rmse)

gbtModel = model.stages[1]
print(gbtModel)  # summary only

2.6 Survival regression

在spark.ml中，我们实现了加速失效时间（AFT）模型，该模型是用于截尾数据的参数生存回归模型。它描述了生存时间对数的模型，因此通常将其称为生存分析的对数线性模型。与为同一目的设计的比例危害模型不同，AFT模型更容易并行化，因为每个实例独立地有助于目标函数。

from pyspark.ml.regression import AFTSurvivalRegression
from pyspark.ml.linalg import Vectors

training = spark.createDataFrame([
    (1.218, 1.0, Vectors.dense(1.560, -0.605)),
    (2.949, 0.0, Vectors.dense(0.346, 2.158)),
    (3.627, 0.0, Vectors.dense(1.380, 0.231)),
    (0.273, 1.0, Vectors.dense(0.520, 1.151)),
    (4.199, 0.0, Vectors.dense(0.795, -0.226))], ["label", "censor", "features"])
quantileProbabilities = [0.3, 0.6]
aft = AFTSurvivalRegression(quantileProbabilities=quantileProbabilities,
                            quantilesCol="quantiles")

model = aft.fit(training)

# Print the coefficients, intercept and scale parameter for AFT survival regression
print("Coefficients: " + str(model.coefficients))
print("Intercept: " + str(model.intercept))
print("Scale: " + str(model.scale))
model.transform(training).show(truncate=False)

2.7 Isotonic regression

from pyspark.ml.regression import IsotonicRegression

# Loads data.
dataset = spark.read.format("libsvm")\
    .load("data/mllib/sample_isotonic_regression_libsvm_data.txt")

# Trains an isotonic regression model.
model = IsotonicRegression().fit(dataset)
print("Boundaries in increasing order: %s\n" % str(model.boundaries))
print("Predictions associated with the boundaries: %s\n" % str(model.predictions))

# Makes predictions.
model.transform(dataset).show()