spark与遥感图像处理

阅读了一篇论文doi: 10.11809/ bqzbgcxb2018.08.023

遥感图像处理大致分为三类:基于人工特征描述的分类方法、基于机器学习的分类方法和基于深 度学习的分类方法、方法的迭代得益于技术的发展。几个人工特征描述的方法:

1)颜色直方图 能简单描述图像中颜色分布即不同颜色在图像所占比例便于计算有很好的平移旋转不变形,但是其对光照变化和量化误差很敏感且不能传递位置信息。

2)纹理特征 描述图像或图像区域表面性质,能够识别有明显纹理特征的图像,但是图像分辨率光照情况变化时纹理可能产生较大偏差影响分类。

3)尺度不变特征变化 (SIFT)局部特征描述子,通过确定关键点周围的梯度信息描述子区域。SFIT所提取的图像特征是局部特征,具有尺度和旋转不变性,对亮度变化、视角变化、仿射变化及噪声也有一定程度的稳定性。当特征点不多时,算法处理速度也相对较快,适于在海量特征数据中进行快速、准确的匹配。但SIFT是一种只利用到灰度性质的算法,无法识别图像的色彩信息。当目标图像形状相似时,分类错误率较高。

几个机器学习遥感图像分类方法:

1)支持向量机 是一种监督学习方法,它通过引入核函数的概念在高维特征空间解算最优化问题,进而寻找最优分类超平面,解决复杂数据分类问题。在实际应用中,SVM具有稳定、易用等特点,但其在解决多类目标分类问题中表现较差,如何正确选择核函数也没有相关的理论依据。

2)决策树 决策树(DecisionTree)是一种归纳推理的分类方法,通过对图像光谱、颜色、空间等信息定义规则,从中心节点出发,对图像各类信息值进行比较,得出新的分支,通过更新规则得到新的决策树,直到满足分类要求,最终的节点即为分类结果。近年来有基于决策树算法改进的随机森林模型以及CART决策树陆续用于遥感图像分类。决策树算法易于理解,可操作性高,能够处理多输出问题。其缺点在于泛化能力太差,在处理高维数据时表现不佳。

3)主成分分析法 无监督训练的简单模型, 可以学习到用于多类图像分类任务的不变特征进行目标分类,但是其是线性,无法获得更多的抽象表示。

4)k均值聚类 矢量化方法, 划分为k个集群并将相似对象归到同一个集群。其易于理解,复杂度低,能够在短时间内处理海量的数据,聚类效果尚可。缺点在于对噪声和离群点敏感,在算法运行前需要先确认K值,但目前并没有明确理论指导确定K值,而且其分类结果不一定是全局最优值。

基于深度学习的遥感图像分类方法

1)自动编码器 自动编码器是一种无监督的学习算法,主要用于数据的降维或者特征的提取。自动编码器的缺点在于模型的泛化能力较差,即当测试样本和训练样本不符合同一分布时,分类效果欠佳。

2)卷积神经网络 卷积神经网络是模仿人类视觉大脑皮层机理建立的网络。CNN的主要缺点在于需要大数据量的训练集来学习确定各层网络参数。同时,随着网络层数的增加,容易出现局部最优及过拟合。

3)深度信念网络DBN通过对各RBM层进行单独训练完成整个网络的训练,提升了网络的训练速度,使系统对复杂数据分类问题的处理能力有较大提升,并且克服了直接对深度神经网络进行训练时容易出现局部最优等问题。DBN在多项遥感图像分类实验中分类精度达到80%以上。DBN的缺点在于模型不能明确不同类别之间的最优分类面,所以在分类任务中,分类精度可能没有判别模型高,此外DNB还要求输入数据具有平移不变性,并且不适当的参数选择会导致学习收敛于局部最优解。

4)迁移学习 迁移学习是指将一个分类问题上训练好的模型经过调整和优化使其能适用于另一个分类问题。

机器学习代码案例

二项式逻辑回归

# coding: utf8
from pyspark.sql import SparkSession
from pyspark.ml.classification import LogisticRegression

if __name__ == "__main__":
    # 创建或获取SparkSession对象
    spark = SparkSession.builder \
        .appName("binomial") \
        .master("local[*]") \
        .getOrCreate()

    # 加载训练数据
    training = spark.read.format("libsvm").load("hdfs://node1:8020/user/hadoop/sample_libsvm_data.txt")

    # 创建逻辑回归模型
    lr = LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8)

    # 拟合二元逻辑回归模型
    lrModel = lr.fit(training)

    # 打印二元逻辑回归模型的系数和截距
    print("Coefficients: " + str(lrModel.coefficients))
    print("Intercept: " + str(lrModel.intercept))

    # 创建多项式逻辑回归模型
    mlr = LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8, family="multinomial")

    # 拟合多项式逻辑回归模型
    mlrModel = mlr.fit(training)

    # 打印多项式逻辑回归模型的系数和截距
    print("Multinomial coefficients: " + str(mlrModel.coefficientMatrix))
    print("Multinomial intercepts: " + str(mlrModel.interceptVector))

在这些数据中,标签(label)、特征(feature)和特征值(value)的含义如下:

  • 标签(label):标签是每个样本对应的类别或类别标识符。在二分类问题中,通常标签为 0 或 1;在多分类问题中,标签可以是不同类别的索引。例如,标签为 1 表示这个样本属于某个特定的类别,标签为 0 表示不属于该类别。

  • 特征(feature):特征是描述样本的属性或特点。在这里,特征可能是经过某种方式提取的图像特征,如像素值、颜色直方图、梯度等。特征通常由其索引表示,每个索引对应于样本的一个特征。

  • 特征值(value):特征值是特征对应的具体数值。它表示样本在该特征上的具体取值或属性。特征值可以是实数或整数,取决于特征的类型和表示方式。在这里,特征值是给定特征的具体数值。


Coefficients: (692,[272,300,323,350,351,378,379,405,406,407,428,433,434,435,455,456,461,462,483,484,489,490,496,511,512,517,539,540,568],[-7.520689871384157e-05,-8.11577314684704e-05,3.814692771846389e-05,0.0003776490540424341,0.0003405148366194407,0.0005514455157343111,0.00040853861160969167,0.00041974673327494573,0.0008119171358670032,0.0005027708372668752,-2.392926040660149e-05,0.0005745048020902299,0.000903754642680371,7.818229700243959e-05,-2.17875519529124e-05,-3.402165821789581e-05,0.0004966517360637634,0.0008190557828370371,-8.017982139522661e-05,-2.743169403783574e-05,0.00048108322262389896,0.00048408017626778744,-8.926472920010679e-06,-0.0003414881233042728,-8.950592574121448e-05,0.0004864546911689218,-8.478698005186158e-05,-0.0004234783215831764,-7.296535777631296e-05])
Intercept: -0.5991460286401442
Multinomial coefficients: 2 X 692 CSRMatrix
(0,272) 0.0001
(0,300) 0.0001
(0,350) -0.0002
(0,351) -0.0001
(0,378) -0.0003
(0,379) -0.0002
(0,405) -0.0002
(0,406) -0.0004
(0,407) -0.0002
(0,433) -0.0003
(0,434) -0.0005
(0,435) -0.0001
(0,456) 0.0
(0,461) -0.0002
(0,462) -0.0004
(0,483) 0.0001
..
..
Multinomial intercepts: [0.2750587585718083,-0.2750587585718083]

Process finished with exit code 0

    • (692, [...], [...]): 这个输出表示模型的系数,是一个稀疏向量的表示。稀疏向量的长度为 692,包含了模型所使用的所有特征。在这个输出中,只有一部分特征的系数是非零的,非零系数对应着对分类结果有显著影响的特征。
    • [...], [...]]: 这里给出了非零系数的索引和对应的系数值。例如,[272, 300, 323, ...]是一系列特征的索引,[-7.520689871384157e-05, -8.11577314684704e-05, 3.814692771846389e-05, ...]是对应的系数值。每个索引对应着模型所使用的特征,而每个系数值表示该特征对分类结果的影响程度。系数值的正负和大小表示了特征对分类结果的贡献程度,值越大表示对分类结果的影响越大。
  1. Intercept(截距):

    • -0.5991460286401442: 这个输出表示模型的截距,即当所有特征取值为 0 时的模型输出值。截距表示了模型在无特征信息时的基准预测值。在逻辑回归中,截距可以理解为偏置项,对模型的预测结果进行了调整。
  2. Multinomial coefficients(多项式逻辑回归系数矩阵):

    • 2 X 692 CSRMatrix: 这个输出表示多项式逻辑回归模型的系数矩阵。这里的系数矩阵是一个稀疏矩阵,其中行数为类别数,列数为特征数。每个元素表示某个类别在某个特征上的系数值。
    • (0, 272) 0.0001: 这是系数矩阵中的一个元素,表示第 0 类别(通常表示负类)在第 272 个特征上的系数值为 0.0001。类似地,(0, 300) 0.0001表示第 0 类别在第 300 个特征上的系数值为 0.0001。
    • 系数矩阵的大小和值提供了每个类别在每个特征上的影响程度,用于对样本进行分类。
  3. Multinomial intercepts(多项式逻辑回归截距向量):

    • [0.2750587585718083, -0.2750587585718083]: 这个输出表示多项式逻辑回归模型的截距向量,每个元素对应一个类别的截距。在这个例子中,有两个类别,所以截距向量包含两个值。第一个值表示第 0 类别(通常表示负类)的截距,第二个值表示第 1 类别(通常表示正类)的截距。

信用评分模型

假设银行要评估客户申请信用卡的信用风险,他们可以根据客户的个人信息和历史数据来构建一个二项式逻辑回归模型。该模型的目标是根据客户的特征来预测客户是否会违约,即是否会无法按时偿还信用卡债务。

银行可以收集客户的一些特征,例如年龄、性别、收入、信用记录等,并将这些特征作为模型的输入。然后,他们可以使用历史数据训练二项式逻辑回归模型,使其能够根据客户的特征预测客户是否会违约。

一旦模型训练完成,银行就可以将新客户的特征输入到模型中,模型将输出一个概率值,表示客户违约的可能性。如果概率高于某个阈值,银行可能会拒绝客户的信用卡申请;如果概率低于阈值,银行可能会批准客户的信用卡申请。

通过这种方式,银行可以利用二项式逻辑回归模型来有效地评估客户的信用风险,从而做出合理的信用决策。

训练二项式逻辑回归模型涉及以下步骤:

  1. 准备数据:首先需要准备用于训练的数据集,包括特征和标签。特征是用于预测的属性或变量,标签是我们想要预测的结果,通常是二元变量,表示两个类别中的一个。

  2. 数据预处理:对数据进行预处理,包括缺失值处理、特征标准化或归一化等。确保数据准备好供模型使用。

  3. 划分数据集:将数据集划分为训练集和测试集。训练集用于训练模型,测试集用于评估模型的性能。

  4. 选择模型:选择适当的二项式逻辑回归模型,并设置模型的超参数,如正则化参数等。

  5. 训练模型:使用训练集来拟合逻辑回归模型。训练模型的过程是通过优化算法(如梯度下降法)来调整模型的参数,使得模型能够最好地拟合训练数据。

  6. 评估模型:使用测试集来评估模型的性能。通常使用指标如准确率、精确率、召回率、F1 分数等来评估模型的表现。

  7. 调整模型:根据评估结果对模型进行调整,可能需要调整模型的超参数或进行特征工程等。

  8. 模型应用:训练好的二项式逻辑回归模型可以用于预测新的未知数据的标签,或者用于解决特定的分类问题。

以上是训练二项式逻辑回归模型的一般步骤,具体的实现可以使用机器学习库(如Scikit-learn、TensorFlow等)来完成。

# coding: utf8
from pyspark.sql import SparkSession
from pyspark.ml.classification import LogisticRegression

if __name__ == "__main__":
    # 创建或获取SparkSession对象
    spark = SparkSession.builder \
        .appName("binomial") \
        .master("local[*]") \
        .getOrCreate()

    # 加载训练数据
    training = spark.read.format("libsvm").load("hdfs://node1:8020/user/hadoop/sample_libsvm_data.txt")

    # 创建逻辑回归模型
    lr = LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8)

    # 拟合二元逻辑回归模型
    lrModel = lr.fit(training)

    # 打印二元逻辑回归模型的系数和截距
    print("Coefficients: " + str(lrModel.coefficients))
    print("Intercept: " + str(lrModel.intercept))

    # 创建多项式逻辑回归模型
    mlr = LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8, family="multinomial")

    # 拟合多项式逻辑回归模型
    mlrModel = mlr.fit(training)

    # 打印多项式逻辑回归模型的系数和截距
    print("Multinomial coefficients: " + str(mlrModel.coefficientMatrix))
    print("Multinomial intercepts: " + str(mlrModel.interceptVector))

    trainingSummary = lrModel.summary

    # Obtain the objective per iteration
    objectiveHistory = trainingSummary.objectiveHistory
    print("objectiveHistory:")
    for objective in objectiveHistory:
        print(objective)

    # Obtain the receiver-operating characteristic as a dataframe and areaUnderROC.
    trainingSummary.roc.show(plot=True)

    print("areaUnderROC: " + str(trainingSummary.areaUnderROC))

    # Set the model threshold to maximize F-Measure
    fMeasure = trainingSummary.fMeasureByThreshold
    maxFMeasure = fMeasure.groupBy().max('F-Measure').select('max(F-Measure)').head()
    bestThreshold = fMeasure.where(fMeasure['F-Measure'] == maxFMeasure['max(F-Measure)']) \
        .select('threshold').head()['threshold']
    lr.setThreshold(bestThreshold)

这段代码用于分析训练过程中 Logistic 回归模型的性能和调整模型的阈值以最大化 F-Measure。

具体来说,它执行以下操作:

  1. 从训练好的 Logistic 回归模型中提取摘要信息。
  2. 获取训练过程中每次迭代的目标值,并打印出来,以便分析模型训练的收敛情况。
  3. 获取并显示训练摘要中的 ROC 曲线,并计算并打印出 ROC 曲线下的面积(Area Under ROC)。
  4. 通过找到 F-Measure 最大的阈值来调整模型的阈值,以最大化模型的 F-Measure。

这些步骤可以帮助你评估模型的性能并进行必要的调整,以便在二元分类问题中取得更好的结果。

目前遇到的问题是在pycharm无法完成可视化

多项式逻辑回归

多项式逻辑回归是逻辑回归的一种扩展形式,它在处理非线性分类问题时非常有用。在逻辑回归中,我们试图找到一个线性决策边界来将数据分为两类。然而,有时数据可能不是线性可分的,这时候就需要引入多项式逻辑回归。

多项式逻辑回归的基本思想是通过增加特征的多项式项来扩展特征空间,从而使决策边界更加灵活,能够更好地拟合非线性关系。

具体来说,如果原始特征是​ x1和x2 ,那么多项式逻辑回归可以将特征空间扩展为 x的高次幂 等多项式项的组合。通过引入这些高阶项,模型可以更灵活地拟合数据,并且有可能实现更复杂的决策边界。

多项式逻辑回归的训练过程与普通的逻辑回归类似,通常使用梯度下降等优化算法来最小化损失函数,以拟合数据并学习模型参数。然而,由于引入了更多的特征,模型的复杂度也增加了,这可能导致过拟合的问题。因此,在使用多项式逻辑回归时,需要注意选择合适的多项式阶数,以避免过拟合。

总的来说,多项式逻辑回归是一种强大的工具,适用于处理非线性分类问题。通过引入高阶多项式项,它可以更好地拟合数据并实现更灵活的决策边界,从而提高模型的性能和泛化能力。

# coding: utf8
from pyspark.ml.classification import LogisticRegression, LogisticRegressionModel  # 添加LogisticRegressionModel的导入
from pyspark.sql import SparkSession
import shutil  # 导入 shutil 用于删除目录
import os
if __name__ == "__main__":
    # 创建或获取SparkSession对象
    spark = SparkSession.builder \
        .appName("Multinomial Logistic") \
        .master("local[*]") \
        .getOrCreate()

    # Load training data
    training = spark \
        .read \
        .format("libsvm") \
        .load("hdfs://node1:8020/user/hadoop/sample_multiclass_classification_data.txt")

    lr = LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8)

    # Fit the model
    lrModel = lr.fit(training)

    # 检查模型保存路径是否存在,如果存在则删除
    model_path = "logistic_regression_model"
    if os.path.exists(model_path):
        shutil.rmtree(model_path)

    # 保存模型
    lrModel.write().overwrite().save(model_path)

    # 加载模型
    loadedModel = LogisticRegressionModel.load(model_path)

    # Print the coefficients and intercept for multinomial logistic regression
    print("Coefficients: \n" + str(lrModel.coefficientMatrix))
    print("Intercept: " + str(lrModel.interceptVector))

    trainingSummary = lrModel.summary

    # Obtain the objective per iteration
    objectiveHistory = trainingSummary.objectiveHistory
    print("objectiveHistory:")
    for objective in objectiveHistory:
        print(objective)

    # Print the false positive rate by label
    print("False positive rate by label:")
    for i, rate in enumerate(trainingSummary.falsePositiveRateByLabel):
        print("label %d: %s" % (i, rate))

    # Print the true positive rate by label
    print("True positive rate by label:")
    for i, rate in enumerate(trainingSummary.truePositiveRateByLabel):
        print("label %d: %s" % (i, rate))

    # Print the precision by label
    print("Precision by label:")
    for i, prec in enumerate(trainingSummary.precisionByLabel):
        print("label %d: %s" % (i, prec))

    # Print the recall by label
    print("Recall by label:")
    for i, rec in enumerate(trainingSummary.recallByLabel):
        print("label %d: %s" % (i, rec))

    # Print the F-measure by label
    print("F-measure by label:")
    fMeasureByLabel = trainingSummary.fMeasureByLabel()
    for i, f in enumerate(fMeasureByLabel):
        print("label %d: %s" % (i, f))

    accuracy = trainingSummary.accuracy
    falsePositiveRate = trainingSummary.weightedFalsePositiveRate
    truePositiveRate = trainingSummary.weightedTruePositiveRate
    fMeasure = trainingSummary.weightedFMeasure
    precision = trainingSummary.weightedPrecision
    recall = trainingSummary.weightedRecall
    print("Accuracy: %s\nFPR: %s\nTPR: %s\nF-measure: %s\nPrecision: %s\nRecall: %s"
          % (accuracy, falsePositiveRate, truePositiveRate, fMeasure, precision, recall))
  1. 加载训练数据:

    • 使用 spark.read.format("libsvm").load() 方法加载 LIBSVM 格式的数据。
  2. 构建逻辑回归模型:

    • 使用 LogisticRegression 类构建逻辑回归模型,并设置一些超参数,如最大迭代次数、正则化参数和弹性网参数。
  3. 拟合模型:

    • 使用 fit() 方法将逻辑回归模型拟合到训练数据上。
  4. 检查模型保存路径是否存在,如果存在则删除:

    • 使用 os.path.exists() 检查路径是否存在,如果存在则使用 shutil.rmtree() 删除目录。
  5. 保存模型:

    • 使用 write().overwrite().save() 方法将拟合后的模型保存到指定路径。
  6. 加载模型:

    • 使用 LogisticRegressionModel.load() 方法从指定路径加载模型。
  7. 输出模型的系数和截距:

    • 输出逻辑回归模型的系数矩阵和截距向量。
  8. 获取训练摘要信息:

    • 通过训练模型的 summary 属性获取模型的摘要信息。
  9. 打印每次迭代的目标值:

    • 通过训练摘要的 objectiveHistory 属性获取每次迭代的目标值,并进行输出。
  10. 打印每个标签的假阳率、真阳率、精确率、召回率和 F1 值:

    • 通过训练摘要的相应属性获取这些指标,并进行输出。
  11. 输出加权的准确率、假阳率、真阳率、F1 值、精确率和召回率:

    • 通过训练摘要的相应属性获取这些指标,并进行输出

用途

  1. 文本分类:在自然语言处理领域,多项式逻辑回归经常用于文本分类任务,如垃圾邮件识别、情感分析、新闻分类等。通过将文本特征表示为向量,并使用多项式逻辑回归模型进行分类,可以实现对文本数据的有效分类。

  2. 图像分类:在计算机视觉领域,多项式逻辑回归可以应用于图像分类任务,如图像识别、人脸识别、物体检测等。通过提取图像特征并将其作为模型的输入,可以使用多项式逻辑回归模型对图像进行分类。

# 导入必要的库
# coding: utf8
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.feature import HashingTF, Tokenizer
from pyspark.sql import SparkSession

# 创建或获取SparkSession对象
spark = SparkSession.builder \
    .appName("Text Classification with Polynomial Logistic Regression") \
    .getOrCreate()

# 示例文本数据(包含文本和标签)
data = [(0, "hello world"), (1, "foo bar"), (0, "spark is amazing"), (1, "pyspark is cool")]

# 将数据转换为DataFrame
df = spark.createDataFrame(data, ["label", "text"])

# 使用分词器将文本数据转换为词项序列
tokenizer = Tokenizer(inputCol="text", outputCol="words")
wordsData = tokenizer.transform(df)
# 访问词项序列列并打印
wordsData.select("words").show(truncate=False)

# 使用HashingTF将词项序列转换为特征向量
hashingTF = HashingTF(inputCol="words", outputCol="features")
featuresData = hashingTF.transform(wordsData)

# 初始化多项式逻辑回归模型
lr = LogisticRegression(maxIter=10, regParam=0.001, elasticNetParam=0.1)

# 拟合模型
lrModel = lr.fit(featuresData)

# 打印模型系数和截距
print("Coefficients: " + str(lrModel.coefficients))
print("Intercept: " + str(lrModel.intercept))

# 关闭SparkSession
spark.stop()

简单的一个python代码实现多项式逻辑回归

在自然语言处理中,分词器(Tokenizer)是一个用于将文本拆分成词项(tokens)的工具。词项通常是文本的基本单元,可以是单词、标点符号、数字等。分词器的作用是根据一定的规则将连续的字符序列拆分成具有独立含义的词项。

在上面的代码示例中,分词器被用来将文本数据中的句子拆分成单词序列,从而为之后的特征提取和机器学习模型训练做准备。在分词完成后,每个文本数据都会被表示为一个包含词项的列表,方便后续的处理和分析。

词项序列(token sequence)指的是文本经过分词器处理后得到的词项(tokens)的序列。在自然语言处理中,文本数据经常需要被转换成数值表示,而词项序列是这一过程中的重要中间步骤。

一个词项序列可以简单地理解为文本中的单词、标点符号或其他基本单元的排列顺序。这个序列反映了文本中各个词项的出现顺序和频率,是进行后续特征提取和模型训练的基础。

在机器学习任务中,词项序列通常会被转换成数值特征向量,例如使用词袋模型(Bag of Words)或者词嵌入(Word Embeddings)等技术,以便让机器学习模型能够处理和学习文本数据

特征向量的维度表示特征空间的维度,也就是用来描述样本的特征的空间的维度大小。在机器学习中,每个样本都可以用一个特征向量表示,这个特征向量的维度取决于所选择的特征数量或特征提取方法。

在多项式逻辑回归模型中,特征向量的维度是由特征提取方法(如 HashingTF)和特征空间的大小决定的。在你提供的代码中,哈希特征提取器 (HashingTF) 被用来将词项序列转换为特征向量。哈希特征提取器会将词项通过哈希函数映射到一个固定大小的特征空间中,这个特征空间的大小决定了特征向量的维度。

在你的输出中,特征向量的维度为262144,这表示特征空间的大小为262144,因此每个样本的特征向量都是一个包含262144个特征的向量。

Decision tree classifier决策树分类

# 导入必要的库
# coding: utf8
from pyspark.ml import Pipeline
from pyspark.ml.classification import DecisionTreeClassifier
from pyspark.ml.feature import StringIndexer, VectorIndexer
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("DTC") \
    .getOrCreate()

# Load the data stored in LIBSVM format as a DataFrame.
data = spark.read.format("libsvm").load("hdfs://node1:8020/user/hadoop/sample_libsvm_data.txt")

# Index labels, adding metadata to the label column.
# Fit on whole dataset to include all labels in index.
labelIndexer = StringIndexer(inputCol="label", outputCol="indexedLabel").fit(data)
# Automatically identify categorical features, and index them.
# We specify maxCategories so features with > 4 distinct values are treated as continuous.
featureIndexer =\
    VectorIndexer(inputCol="features", outputCol="indexedFeatures", maxCategories=4).fit(data)

# Split the data into training and test sets (30% held out for testing)
(trainingData, testData) = data.randomSplit([0.7, 0.3])

# Train a DecisionTree model.
dt = DecisionTreeClassifier(labelCol="indexedLabel", featuresCol="indexedFeatures")

# Chain indexers and tree in a Pipeline
pipeline = Pipeline(stages=[labelIndexer, featureIndexer, dt])

# Train model.  This also runs the indexers.
model = pipeline.fit(trainingData)

# Make predictions.
predictions = model.transform(testData)

# Select example rows to display.
predictions.select("prediction", "indexedLabel", "features").show(5)

# Select (prediction, true label) and compute test error
evaluator = MulticlassClassificationEvaluator(
    labelCol="indexedLabel", predictionCol="prediction", metricName="accuracy")
accuracy = evaluator.evaluate(predictions)
print("Test Error = %g " % (1.0 - accuracy))

treeModel = model.stages[2]
# summary only
print(treeModel)

/home/hadoop/.virtualenvs/sparkkkk/bin/python /tmp/pycharm_project_330/mlib/Decision tree classifier.py 
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/04/16 10:11:04 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
24/04/16 10:11:10 WARN LibSVMFileFormat: 'numFeatures' option not specified, determining the number of features by going though the input. If you know the number in advance, please specify it via 'numFeatures' option to avoid the extra scan.
+----------+------------+--------------------+
|prediction|indexedLabel|            features|
+----------+------------+--------------------+
|       1.0|         1.0|(692,[121,122,123...|
|       1.0|         1.0|(692,[122,123,148...|
|       1.0|         1.0|(692,[124,125,126...|
|       1.0|         1.0|(692,[124,125,126...|
|       1.0|         1.0|(692,[124,125,126...|
+----------+------------+--------------------+
only showing top 5 rows

Test Error = 0.0454545 
DecisionTreeClassificationModel: uid=DecisionTreeClassifier_026252faa46f, depth=1, numNodes=3, numClasses=2, numFeatures=692

Process finished with exit code 0

代码分析·

  1. 导入必要的库:导入了需要使用的PySpark库,包括PipelineDecisionTreeClassifierStringIndexerVectorIndexerMulticlassClassificationEvaluatorSparkSession

  2. 创建SparkSession:创建了一个SparkSession对象,命名为"DTC",如果没有现有的SparkSession,则会新建一个。

  3. 加载数据:使用spark.read.format("libsvm").load("hdfs://node1:8020/user/hadoop/sample_libsvm_data.txt")从HDFS加载LIBSVM格式的数据,并将其存储为DataFrame对象data

  4. 标签索引(Label Indexing):使用StringIndexer将标签列转换为索引,并将结果存储在新的列"indexedLabel"中。

  5. 特征索引(Feature Indexing):使用VectorIndexer自动识别分类特征并进行索引,指定了maxCategories参数以处理具有大于4个不同值的特征。

  6. 划分数据集:使用randomSplit方法将数据集划分为训练集(70%)和测试集(30%)。

  7. 训练决策树模型:创建了一个DecisionTreeClassifier对象dt,并指定了标签列和特征列,然后将其与标签索引器和特征索引器一起放入Pipeline中。

  8. 训练模型:使用pipeline.fit(trainingData)对训练数据进行拟合,这也会运行索引器。

  9. 进行预测:使用训练好的模型对测试数据进行预测,得到预测结果并存储在DataFrame对象predictions中。

  10. 显示预测结果:选择了前5个样本的预测结果,显示了预测值、索引标签和特征。

  11. 计算测试错误率:使用MulticlassClassificationEvaluator计算模型的准确率,并打印出测试错误率。

  12. 获取决策树模型:从拟合的Pipeline模型中获取训练得到的决策树模型,并将其存储在treeModel变量中。

  13. 打印决策树模型:打印决策树模型的摘要信息,包括模型的深度、节点数量、类别数量和特征数量等。

以下是预处理后图像的数据文本截取

1 130:116 131:255 132:123 157:29 158:213 159:253 160:122 185:189 186:253 187:253 188:122 213:189 214:253 215:253 216:122 241:189 242:253 243:253 244:122 267:2 268:114 269:243 270:253 271:186 272:19 295:100 296:253 297:253 298:253 299:48 323:172 324:253 325:253 326:253 327:48 351:172 352:253 353:253 354:182 355:19 378:133 379:251 380:253 381:175 382:4 405:107 406:251 407:253 408:253 409:65 432:26 433:194 434:253 435:253 436:214 437:40 459:105 460:205 461:253 462:253 463:125 464:40 487:139 488:253 489:253 490:253 491:81 514:41 515:231 516:253 517:253 518:159 519:16 541:65 542:155 543:253 544:253 545:172 546:4 569:124 570:253 571:253 572:253 573:98 597:124 598:253 599:253 600:214 601:41 624:22 625:207 626:253 627:253 628:139 653:124 654:253 655:162 656:9

这些数据是以LIBSVM格式存储的,每一行表示一个样本。在LIBSVM格式中,每一行的第一个数字是样本的标签,后面是特征索引:特征值的键值对,以空格分隔。

例如,第一行 0 125:218 126:253 127:253 ... 中的数字 0 是标签,后面的键值对表示特征索引和特征值。这些特征索引是根据图像中像素的位置分配的,特征值则表示像素的强度或颜色。

Random forest classifier随机森林分类

# coding: utf8
from pyspark.ml import Pipeline
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.feature import IndexToString, StringIndexer, VectorIndexer
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("Random_forest") \
    .getOrCreate()

# Load and parse the data file, converting it to a DataFrame.
data = spark.read.format("libsvm").load("hdfs://node1:8020/user/hadoop/sample_libsvm_data.txt")

# Index labels, adding metadata to the label column.
# Fit on whole dataset to include all labels in index.
labelIndexer = StringIndexer(inputCol="label", outputCol="indexedLabel").fit(data)

# Automatically identify categorical features, and index them.
# Set maxCategories so features with > 4 distinct values are treated as continuous.
featureIndexer =\
    VectorIndexer(inputCol="features", outputCol="indexedFeatures", maxCategories=4).fit(data)

# Split the data into training and test sets (30% held out for testing)
(trainingData, testData) = data.randomSplit([0.7, 0.3])

# Train a RandomForest model.
rf = RandomForestClassifier(labelCol="indexedLabel", featuresCol="indexedFeatures", numTrees=10)

# Convert indexed labels back to original labels.
labelConverter = IndexToString(inputCol="prediction", outputCol="predictedLabel",
                               labels=labelIndexer.labels)

# Chain indexers and forest in a Pipeline
pipeline = Pipeline(stages=[labelIndexer, featureIndexer, rf, labelConverter])

# Train model.  This also runs the indexers.
model = pipeline.fit(trainingData)

# Make predictions.
predictions = model.transform(testData)

# Select example rows to display.
predictions.select("predictedLabel", "label", "features").show(5)

# Select (prediction, true label) and compute test error
evaluator = MulticlassClassificationEvaluator(
    labelCol="indexedLabel", predictionCol="prediction", metricName="accuracy")
accuracy = evaluator.evaluate(predictions)
print("Test Error = %g" % (1.0 - accuracy))

rfModel = model.stages[2]
print(rfModel)  # summary only

/home/hadoop/.virtualenvs/sparkkkk/bin/python /tmp/pycharm_project_330/mlib/Random forest classifier.py 
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/04/16 11:03:52 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
24/04/16 11:04:00 WARN LibSVMFileFormat: 'numFeatures' option not specified, determining the number of features by going though the input. If you know the number in advance, please specify it via 'numFeatures' option to avoid the extra scan.
+--------------+-----+--------------------+
|predictedLabel|label|            features|
+--------------+-----+--------------------+
|           1.0|  0.0|(692,[100,101,102...|
|           0.0|  0.0|(692,[122,123,124...|
|           0.0|  0.0|(692,[122,123,148...|
|           0.0|  0.0|(692,[123,124,125...|
|           0.0|  0.0|(692,[127,128,129...|
+--------------+-----+--------------------+
only showing top 5 rows

Test Error = 0.037037
RandomForestClassificationModel: uid=RandomForestClassifier_e92b91676967, numTrees=10, numClasses=2, numFeatures=692

Process finished with exit code 0
 

  1. 导入必要的库和模块。
  2. 创建一个 SparkSession 对象。
  3. 从 LIBSVM 格式的文件中加载数据并转换为 DataFrame。
  4. 对标签进行索引,添加标签列的元数据。
  5. 自动识别分类特征,并对其进行索引。
  6. 将数据划分为训练集和测试集。
  7. 训练一个随机森林分类器模型。
  8. 将索引后的标签转换回原始标签。
  9. 在管道中连接索引器、随机森林模型和标签转换器。
  10. 训练模型,并进行预测。
  11. 显示预测结果的一些示例行。
  12. 计算测试错误率。
  13. 打印随机森林模型的摘要信息。

随机森林(Random Forest)和决策树(Decision Tree)都是常见的机器学习算法,用于分类和回归任务。它们之间有一些联系和区别:

联系:

  1. 基于树结构: 随机森林和决策树都是基于树结构的算法,通过构建一系列的决策树来进行预测。

  2. 可解释性: 决策树和随机森林在一定程度上具有可解释性,可以直观地理解每个决策的依据,特别是决策树更容易解释,因为它们的决策过程是一系列的 if-else 语句。

  3. 易于实现和使用: 决策树和随机森林都比较容易实现和使用,尤其是在处理分类问题时,它们通常能够获得不错的性能。

区别:

  1. 单棵树 vs. 多棵树: 决策树是单个树模型,它通过一系列的决策节点和叶子节点来进行分类或回归预测;而随机森林是由多个决策树组成的集成模型,通过投票或取平均值来做出最终的预测。

  2. 随机性: 随机森林在构建每棵树的过程中引入了随机性,具体表现在两个方面:一是通过自助采样(bootstrap sampling)从原始数据集中随机选择样本,用于训练每棵树;二是在每个节点处随机选择一部分特征进行分裂。而决策树通常采用贪婪算法,根据某种准则(如信息增益、基尼系数)选择最佳特征进行节点分裂,不引入随机性。

  3. 性能和鲁棒性: 随机森林通常比单个决策树具有更好的性能和鲁棒性,因为它可以减少过拟合的风险,通过集成多个模型降低方差。随机森林在处理高维度数据和特征关联性较高的情况下表现更优。

  4. 计算成本: 随机森林通常比单个决策树的计算成本更高,因为需要构建多棵树并进行集成,但随机森林可以通过并行计算来加速训练过程。

分类与回归的区别
  1. 分类(Classification)

    • 概念:分类是一种监督学习任务,其目标是将数据实例分到预定义的类别或标签中。这意味着我们试图建立一个模型来预测数据实例属于哪个类别。
    • 示例:例如,我们可以使用分类来预测电子邮件是垃圾邮件还是正常邮件,将图片中的动物识别为猫、狗或鸟,或者根据病人的症状来预测是否患有某种疾病。
  2. 回归(Regression)

    • 概念:回归也是一种监督学习任务,其目标是预测连续值的输出,而不是离散的类别标签。这意味着我们试图建立一个模型来预测一个数值型的目标。
    • 示例:例如,我们可以使用回归来预测房屋的售价,根据历史数据预测股票的价格,或者预测一个人的年龄基于其身高、体重和其他特征。

总之,分类和回归都是监督学习任务,其区别在于分类问题的目标是预测离散的类别标签,而回归问题的目标是预测连续的数值输出。

Gradient-boosted tree classifier梯度提升分类器

在机器学习和优化领域,梯度是指函数在某一点的变化率或斜率。具体来说,对于一个多元函数,它的梯度是一个向量,每个分量表示函数对于相应变量的偏导数,描述了函数在该点沿着每个变量方向上的变化率。

from pyspark.ml import Pipeline
from pyspark.ml.classification import GBTClassifier
from pyspark.ml.feature import StringIndexer, VectorIndexer
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("Multilayer perceptron classifier") \
    .getOrCreate()

# Load and parse the data file, converting it to a DataFrame.
data = spark.read.format("libsvm").load("hdfs://node1:8020/user/hadoop/sample_libsvm_data.txt")

# Index labels, adding metadata to the label column.
# Fit on whole dataset to include all labels in index.
labelIndexer = StringIndexer(inputCol="label", outputCol="indexedLabel").fit(data)
# Automatically identify categorical features, and index them.
# Set maxCategories so features with > 4 distinct values are treated as continuous.
featureIndexer =\
    VectorIndexer(inputCol="features", outputCol="indexedFeatures", maxCategories=4).fit(data)

# Split the data into training and test sets (30% held out for testing)
(trainingData, testData) = data.randomSplit([0.7, 0.3])

# Train a GBT model.
gbt = GBTClassifier(labelCol="indexedLabel", featuresCol="indexedFeatures", maxIter=10)

# Chain indexers and GBT in a Pipeline
pipeline = Pipeline(stages=[labelIndexer, featureIndexer, gbt])

# Train model.  This also runs the indexers.
model = pipeline.fit(trainingData)

# Make predictions.
predictions = model.transform(testData)

# Select example rows to display.
predictions.select("prediction", "indexedLabel", "features").show(5)

# Select (prediction, true label) and compute test error
evaluator = MulticlassClassificationEvaluator(
    labelCol="indexedLabel", predictionCol="prediction", metricName="accuracy")
accuracy = evaluator.evaluate(predictions)
print("Test Error = %g" % (1.0 - accuracy))

gbtModel = model.stages[2]
print(gbtModel)  # summary only

/home/hadoop/.virtualenvs/sparkkkk/bin/python /tmp/pycharm_project_330/mlib/Gradient-boosted tree classifier.py 
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/04/16 14:14:26 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
24/04/16 14:14:29 WARN LibSVMFileFormat: 'numFeatures' option not specified, determining the number of features by going though the input. If you know the number in advance, please specify it via 'numFeatures' option to avoid the extra scan.
24/04/16 14:14:43 WARN InstanceBuilder$NativeBLAS: Failed to load implementation from:dev.ludovic.netlib.blas.JNIBLAS
24/04/16 14:14:43 WARN InstanceBuilder$NativeBLAS: Failed to load implementation from:dev.ludovic.netlib.blas.ForeignLinkerBLAS
+----------+------------+--------------------+
|prediction|indexedLabel|            features|
+----------+------------+--------------------+
|       1.0|         1.0|(692,[95,96,97,12...|
|       1.0|         1.0|(692,[121,122,123...|
|       1.0|         1.0|(692,[122,123,148...|
|       1.0|         1.0|(692,[123,124,125...|
|       1.0|         1.0|(692,[123,124,125...|
+----------+------------+--------------------+
only showing top 5 rows

Test Error = 0.027027
GBTClassificationModel: uid = GBTClassifier_48705517b6e0, numTrees=10, numClasses=2, numFeatures=692

Process finished with exit code 0
 

Multilayer perceptron classifier多层感知器分类

# coding: utf8
from pyspark.ml.classification import MultilayerPerceptronClassifier
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.sql import SparkSession

# Load training data
spark = SparkSession.builder \
    .appName("Multilayer perceptron classifier") \
    .getOrCreate()
data = spark.read.format("libsvm")\
    .load("hdfs://node1:8020/user/hadoop/sample_multiclass_classification_data.txt")

# Split the data into train and test
splits = data.randomSplit([0.6, 0.4], 1234)
train = splits[0]
test = splits[1]

# specify layers for the neural network:
# input layer of size 4 (features), two intermediate of size 5 and 4
# and output of size 3 (classes)
layers = [4, 5, 4, 3]

# create the trainer and set its parameters
trainer = MultilayerPerceptronClassifier(maxIter=100, layers=layers, blockSize=128, seed=1234)

# train the model
model = trainer.fit(train)

# compute accuracy on the test set
result = model.transform(test)
predictionAndLabels = result.select("prediction", "label")
evaluator = MulticlassClassificationEvaluator(metricName="accuracy")
print("Test set accuracy = " + str(evaluator.evaluate(predictionAndLabels)))

/home/hadoop/.virtualenvs/sparkkkk/bin/python /tmp/pycharm_project_330/mlib/Multilayer perceptron classifier.py 
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/04/16 14:24:51 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
24/04/16 14:24:56 WARN LibSVMFileFormat: 'numFeatures' option not specified, determining the number of features by going though the input. If you know the number in advance, please specify it via 'numFeatures' option to avoid the extra scan.
24/04/16 14:25:04 WARN InstanceBuilder$NativeBLAS: Failed to load implementation from:dev.ludovic.netlib.blas.JNIBLAS
24/04/16 14:25:04 WARN InstanceBuilder$NativeBLAS: Failed to load implementation from:dev.ludovic.netlib.blas.ForeignLinkerBLAS
Test set accuracy = 0.9523809523809523

Process finished with exit code 0

多层感知器(Multilayer Perceptron,MLP)是一种基于神经网络的分类器。它由一个或多个中间层(也称为隐藏层)和一个输出层组成,每个中间层都有多个神经元。每个神经元接收来自上一层的输入,并将加权和传递给下一层的神经元,最终输出结果由输出层决定。

MLP 可以用于解决分类问题,其中输入数据被输入到网络中,通过每一层的计算,最终得到一个输出,通常是一个概率分布或者类别标签。在训练过程中,MLP 通过反向传播算法来调整神经元之间的权重,以最小化预测输出与实际标签之间的误差。

MLP 的优点包括:

  1. 能够处理复杂的非线性关系。
  2. 具有较强的泛化能力,可以适应各种类型的数据。
  3. 可以通过调整网络结构和参数来适应不同的问题。

但是,MLP 也存在一些缺点:

  1. 对于参数的选择较为敏感,需要仔细调整网络结构和参数。
  2. 训练过程可能较为耗时,特别是在处理大规模数据集时。
  3. 对于输入数据的特征工程要求较高,需要进行适当的预处理和特征选择。

梯度提升树(Gradient Boosting Trees)分类器是一种集成学习算法,通过组合多个决策树来进行分类任务。它是一种迭代式的算法,每一步都在之前模型的基础上构建一个新的决策树,并且通过梯度下降的方法来逐步减小误差。具体来说,每一棵树都在前一棵树的残差(预测值与实际值之间的差异)的基础上进行训练。

梯度提升树分类器的工作原理如下:

  1. 初始化一个简单的模型(比如一个单节点的树,也称为树桩)作为初始预测。
  2. 计算当前模型的残差,即实际值与当前模型预测值之间的差异。
  3. 在残差上构建一个新的决策树模型。
  4. 使用梯度下降方法更新模型,将新模型的预测结果加到前一个模型的预测结果上,以减小残差。
  5. 重复步骤 2-4,直到达到预先设定的迭代次数或者达到其他停止条件。

梯度提升树分类器的优点包括:

  • 可以处理各种类型的数据,包括连续型和离散型特征。
  • 在处理高维稀疏数据时表现良好。
  • 通过组合多个弱分类器,可以得到一个更强大的整体模型,具有较高的准确性。

然而,梯度提升树分类器也有一些缺点:

  • 对参数的选择比较敏感,需要仔细调整模型参数来获得最佳性能。
  • 训练过程相对耗时,尤其是在数据集较大时。
  • 可能会出现过拟合问题,需要进行适当的正则化。

Linear Support Vector Machine支持向量机

from pyspark.ml.classification import LinearSVC
from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("SVM") \
    .getOrCreate()
# Load training data
training = spark.read.format("libsvm").load("hdfs://node1:8020/user/hadoop/sample_libsvm_data.txt")

lsvc = LinearSVC(maxIter=10, regParam=0.1)

# Fit the model
lsvcModel = lsvc.fit(training)

# Print the coefficients and intercept for linear SVC
print("Coefficients: " + str(lsvcModel.coefficients))
print("Intercept: " + str(lsvcModel.intercept))

/home/hadoop/.virtualenvs/sparkkkk/bin/python /tmp/pycharm_project_330/mlib/SVM.py 
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/04/16 14:40:01 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
24/04/16 14:40:05 WARN LibSVMFileFormat: 'numFeatures' option not specified, determining the number of features by going though the input. If you know the number in advance, please specify it via 'numFeatures' option to avoid the extra scan.
24/04/16 14:40:12 WARN InstanceBuilder$NativeBLAS: Failed to load implementation from:dev.ludovic.netlib.blas.JNIBLAS
24/04/16 14:40:12 WARN InstanceBuilder$NativeBLAS: Failed to load implementation from:dev.ludovic.netlib.blas.ForeignLinkerBLAS
24/04/16 14:40:12 WARN BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeSystemBLAS
24/04/16 14:40:12 WARN BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeRefBLAS
Coefficients: [0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-0.000151540818914002,-3.435743262827576e-05,6.886872377165413e-05,0.0005825396368790323,0.00026586674379748743,-5.444899023220583e-06,-0.00041087629891130826,-0.00023771334401618917,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.001505124319737534,0.0005056741785679462,0.0007739871946118442,-7.439317729362514e-05,2.239542955153309e-07,2.1502767568913162e-05,4.0001557959068066e-05,2.8410459888260185e-05,1.2172703998609037e-05,-1.4702408529921132e-05,-4.00596456869388e-05,3.0693747761902226e-06,0.00015395475863074336,0.00015205858963404875,-0.00021785419667457327,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0015601898436303687,0.000530852975841039,0.0004397009308243566,0.00017801052523689224,1.0432168565532878e-06,-4.521576784295899e-05,-5.364735591921346e-06,-5.082816080173391e-05,-8.719944876885249e-05,-0.00010329102584755735,-5.720372189549556e-05,-1.528195691009418e-05,-7.516369861689187e-06,5.902800318288018e-06,6.1921291400703704e-06,-2.902021818279389e-05,-0.00017809566774662006,-9.983865716687191e-05,-0.000606163275656008,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0007310603838725165,0.0003393134198685142,0.0,0.0,4.7022680265078096e-07,-3.0241317159223564e-05,-7.10348104168034e-05,-6.108515880287018e-05,-3.889276771633554e-05,-8.652698004632078e-06,-2.8341941244890384e-05,-2.0466415411014446e-05,-3.674573915017901e-05,-4.947197882344821e-06,-3.2188554418463576e-05,-2.8248522773452058e-05,-6.284496014016098e-05,-3.394514343673645e-05,-5.812524561085008e-05,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-0.0003115063595155373,-0.0,5.553299293139149e-07,7.441711449751372e-07,0.0,-8.30091238147767e-05,-8.553731759621603e-05,-4.181406659858972e-05,-0.0,-0.0,2.0234777223150883e-06,-8.142025582791757e-05,-7.129749332452181e-05,-4.0213407153275385e-05,-3.1803153621365364e-05,1.9648605400311107e-06,4.44595506009114e-05,-3.8779179213229064e-05,-3.394514343673645e-05,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-0.0011643888904754936,-0.0001442635664612041,-9.610481791078417e-05,0.0,1.0743649829862987e-05,-5.139755198444354e-05,-0.00011570846247739716,-3.9208774186147584e-05,1.0112528287628887e-05,3.049939229306956e-05,3.895560584341707e-05,-1.493593811857197e-05,-5.9516946069939794e-05,-0.00011187187632913675,-9.937928168157008e-05,-6.757258785653084e-05,-5.691368023902606e-05,7.799370328841447e-08,-8.279979951133446e-05,-4.001765855548631e-05,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-0.00019968045299434062,-9.854528194292579e-05,-9.06787956689978e-05,-3.2958470805746296e-05,-2.7168478207666738e-05,-9.659694587186775e-05,-0.000133216562658181,1.9035264647730756e-07,2.0656673287296924e-05,2.4034390766692266e-05,5.084861154094744e-05,1.1016212232150315e-05,2.1594040378739754e-05,-5.5580114840823516e-05,-0.00015508630557679168,-0.00012366326328170244,-0.00015020950238405558,-0.00011116516675139367,-8.802408072215427e-05,-7.553443325953689e-05,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-0.0002458631820038749,-0.00010029984235020674,-8.989186858906073e-05,-5.770623432643776e-05,-7.014304439686145e-05,-0.00012918480020107103,-7.437813271347968e-05,7.1882850247797465e-06,-2.3405278472273958e-05,8.366700159972187e-05,0.00011805317477016696,6.786235197852911e-05,8.633480526970694e-05,-7.376604807031049e-07,-0.000151434097017997,-0.00016052664166849652,-0.00016031428263756824,-0.00016712444608586575,-0.0002702285028789599,-0.00010208014389252895,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-0.00011381649808139322,-0.00012853900224732654,-7.858987300608407e-05,-4.961326895893608e-05,-7.368923124545225e-05,-8.904140787642681e-05,-8.176832354046701e-05,-4.641614552837142e-06,6.227181073130982e-05,9.372925449086643e-05,0.00014397438677981196,0.00012049540268348206,0.00011923985682587849,-5.272604839089627e-07,-0.00020505931953059513,-0.00015772514809660589,-9.956028412826106e-05,-0.00014288872963524575,-0.00022510278740142168,-0.00029794915497267064,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-0.00015325073343250197,-0.00012604815072848468,-4.317256349973646e-06,-1.4477656396112423e-05,-8.554728785579455e-05,-0.00012679008560940376,-7.736057830884105e-05,-1.6514032330287893e-05,0.00010615875760051061,0.00016729758900305763,0.00019420966427041673,0.00018477147022107505,0.00014949777799656936,-0.00010194954765362675,-0.00023030613467332517,-0.00016305495834265712,-9.293149102117208e-05,-8.087309643293946e-05,-0.00017920150773378875,-0.000323500783961243,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-0.00021357274846074226,-6.935298874221008e-05,-4.033841423713148e-05,-0.00010319138181598709,-0.0001325551947981126,-0.00015774942194115841,-0.00013206249740797867,-1.4714948506689e-05,0.00016617832380765878,0.00017603927920376526,0.0002125073318448685,0.00021641375548555675,0.00019174541878850927,-0.00019028890977160822,-0.0001981140237184899,-0.0001692251300989073,-0.00010578794487033083,-7.460329842588546e-05,-0.00014474304562096655,-0.00031154045654731803,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-0.00044664662416758484,-0.00020933591722690955,-0.00010000066384810988,-7.635465682249809e-05,-0.00012395312447696194,-0.000154047582760572,-0.00017144872308501,-0.0001803373579249996,6.0406941582332507e-05,0.0001694072676243569,0.00018190988075040748,0.0002398686582274298,0.00026176654027378315,0.00022201038400100373,-0.00022961196551806026,-0.00017177659316280024,-0.00017123648335213264,-0.00013446375032799744,-6.661951198606293e-05,-0.00013297432787403674,-0.00030819277984344865,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-0.0003343106385704522,-0.00023415443894331335,-0.00010289409968140962,-8.365317853941958e-05,-0.00012876320116322128,-0.0001671386182832452,-0.00018560488934962793,-0.0002266501304483015,9.996194014947e-05,0.00016797705458354632,0.0002221386537218973,0.00023911820548994426,0.0003630317804106842,3.558707179600037e-07,-0.0002217493228035484,-0.0001616212414813466,-0.00017262335165929733,-9.765145366137346e-05,-0.00010794021242249001,-0.0001700816693393262,-0.0002769208151028972,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-0.00032059342868248484,-0.00019274642289361955,-0.00010548344141013058,-9.006855001130945e-05,-0.00014162758357978548,-0.0001669941613272473,-0.00020132940540833397,0.0,9.062509847652712e-05,0.00015443925047166118,0.0002294566423104623,0.00023558067616924393,0.0003246208764546575,4.4890058040356185e-07,-0.00016227639709536374,-0.0001570682369622788,-0.0001548548595559019,-0.00016398027475394782,-0.00014925650513146824,-0.00011395588919865061,-0.00011232383931359654,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-8.922486401510357e-05,-0.00017471381878775272,-0.00012695133948604663,-9.320093204992395e-05,-0.0001454925680202322,-0.00016370360895537824,-0.0001733211964705024,-5.093027056578112e-07,7.551361434969898e-05,0.0001542706974096051,0.00022010133887952977,0.0002227708458770283,0.00026739147227770497,-2.200990413225511e-05,-0.00010912058056384217,-0.00011576848333402944,-0.0001924870269381932,-0.00018299765092443575,-9.844832186325834e-05,-6.853925950903189e-05,-8.923190137342486e-05,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-0.0001439864748763968,-0.00012312957163008355,-0.00015127036919041953,-0.00012454729625486,-0.0001519246272455006,-0.00011276708744913927,-7.742595706608118e-05,-1.0863782311958194e-06,9.45285567479168e-05,0.00019742096418079005,0.00019392492623082654,0.00019118645233514786,6.298261424113727e-05,-0.0001277190368753301,-0.0001288365096541885,-0.00012727420492068204,-0.00011282472516802393,-6.417229621722396e-05,-6.3789344963145e-05,-8.059997981966417e-05,-0.00013517507036611858,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-0.0001475474259972673,-9.192016024768132e-05,-0.00012986904186441852,-0.00015856282055163302,-0.0001861620132347176,-0.00010791390785661079,-1.4844402943742794e-05,9.562257307504934e-06,9.793339513047204e-05,0.0001783239297671775,0.0001456032490062522,7.354807281694767e-05,-5.718262068948761e-05,-9.20618496664701e-05,-6.624439603592721e-05,-2.0249996445565e-05,-3.4275078547386504e-05,-3.897591392047523e-05,-6.981400709195266e-05,-8.687105802525928e-05,-0.0002638800493838414,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-0.0001501842189676077,-9.155635645507783e-05,-6.876580989034069e-05,-6.99863863061361e-05,-0.00013474741540825817,-0.00013149764673220412,-6.405274837139368e-05,-4.589294834697901e-05,-7.115900395761934e-06,4.4177268762394034e-05,7.991417730925031e-05,-6.270768090017473e-06,-4.465249533704511e-05,-2.971509202096875e-05,-0.0,-5.4356289833655976e-05,-1.071930174767036e-05,-2.1810710520004282e-05,-8.322340266075764e-05,-0.000130048593760808,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-0.00011625049122170016,-5.3297640906905504e-05,4.441368650716584e-06,1.772090583579029e-06,-3.873592555499278e-05,-7.963109993815722e-05,-4.6868163212621335e-05,-3.380593785614324e-05,-6.846160004383539e-05,-1.385957654535632e-05,-3.3664902421694043e-07,-1.996510936009237e-05,-4.3518957558364956e-05,1.190900618069273e-06,-3.7381298089749174e-05,-5.335612631049502e-05,-1.836784427568525e-05,-2.8900386182550268e-05,-0.0001090242933191252,-0.0007289428170489049,0.0,0.0,1.1815009038687371e-06,0.0,0.0,0.0,0.0,0.0,-0.0001368755783739373,-4.350604887948135e-05,0.0001269910572855575,8.005658121415814e-05,1.0400146012537673e-05,-1.9921298152031216e-05,-4.837380209162559e-05,-3.24369897718408e-05,-4.4482531438842537e-05,-4.3184927286749934e-05,-3.547449857404612e-05,-3.307553089358565e-05,-4.870008745351271e-05,-4.781035590260356e-05,-2.449269808855253e-05,6.050017533961689e-07,5.617076854743118e-07,-0.0,-0.00010695947623675277,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-0.00010224440794197724,8.064781270992841e-07,0.000105341801503681,4.7861259399939504e-05,4.161450832317059e-05,5.940875403936617e-07,-6.894347397445866e-05,-9.021191881030569e-05,-8.588272080759164e-05,-6.595014381683418e-05,-7.964235031047637e-05,-0.00010656859529076194,-4.082346321723607e-05,5.288457946562819e-07,-0.0,0.00018527198217770186,0.0005891609660550852,-0.0008486285859184112,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-0.00022232785090115658,-3.49699481850325e-05,-2.7088787275580592e-05,3.0516365750812148e-05,8.364705619979777e-06,-5.186363349249731e-05,-0.00016275341617348908,-0.00017843013591671848,-0.00013374558661216322,-0.00012220046604981695,-0.0001648483289658346,-8.874569657409987e-05,-0.00011159793689126132,-0.0002517725309076807]
Intercept: 0.5232286178786096

Process finished with exit code 0
 

  1. training = spark.read.format("libsvm").load("hdfs://node1:8020/user/hadoop/sample_libsvm_data.txt"):使用SparkSession读取LIBSVM格式的训练数据集。数据集的路径是"hdfs://node1:8020/user/hadoop/sample_libsvm_data.txt"。LIBSVM格式是一种常用的数据格式,用于表示稀疏数据,特别适用于分类问题。

  2. lsvc = LinearSVC(maxIter=10, regParam=0.1):创建一个LinearSVC对象,设置最大迭代次数为10(maxIter=10),正则化参数为0.1(regParam=0.1)。

  3. lsvcModel = lsvc.fit(training):使用训练数据集(training)拟合(训练)LinearSVC模型,得到一个训练好的模型(lsvcModel)。

  4. print("Coefficients: " + str(lsvcModel.coefficients)):打印线性支持向量机模型的系数(Coefficients),系数表示模型中特征的权重。

  5. print("Intercept: " + str(lsvcModel.intercept)):打印线性支持向量机模型的截距(Intercept),截距表示模型的偏置项。

这些输出是针对支持向量分类器(LinearSVC)的模型参数。具体解释如下:

  • Coefficients(系数): 这是模型的特征系数。每个特征都有一个对应的系数,用来衡量该特征对于分类的影响程度。系数的绝对值越大,表示该特征在分类中的重要性越高。在这里,输出的系数包含了输入数据中的每个特征,其值表示对应特征的权重。

  • Intercept(截距): 这是模型的截距项,类似于线性回归中的截距。截距表示当所有特征的值都为0时,模型对于分类的偏置。在这里,截距项为0.5232286178786096。

通过这些系数和截距项,可以构建一个线性分类器,用于对新的样本进行分类预测

One-vs-Rest 分类器

from pyspark.ml.classification import LogisticRegression, OneVsRest
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.sql import SparkSession
# load data file.
spark = SparkSession.builder \
        .appName("one_vs_Rest") \
        .master("local[*]") \
        .getOrCreate()

# Load training data

inputData = spark.read.format("libsvm") \
    .load("hdfs://node1:8020/user/hadoop/sample_multiclass_classification_data.txt")

# generate the train/test split.
(train, test) = inputData.randomSplit([0.8, 0.2])

# instantiate the base classifier.
lr = LogisticRegression(maxIter=10, tol=1E-6, fitIntercept=True)

# instantiate the One Vs Rest Classifier.
ovr = OneVsRest(classifier=lr)

# train the multiclass model.
ovrModel = ovr.fit(train)

# score the model on test data.
predictions = ovrModel.transform(test)

# obtain evaluator.
evaluator = MulticlassClassificationEvaluator(metricName="accuracy")

# compute the classification error on test data.
accuracy = evaluator.evaluate(predictions)
print("Test Error = %g" % (1.0 - accuracy))

/home/hadoop/.virtualenvs/sparkkkk/bin/python /tmp/pycharm_project_330/mlib/one_vs_Rest.py 
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/04/16 15:06:37 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
24/04/16 15:06:41 WARN LibSVMFileFormat: 'numFeatures' option not specified, determining the number of features by going though the input. If you know the number in advance, please specify it via 'numFeatures' option to avoid the extra scan.
24/04/16 15:06:54 WARN InstanceBuilder$NativeBLAS: Failed to load implementation from:dev.ludovic.netlib.blas.JNIBLAS
24/04/16 15:06:54 WARN InstanceBuilder$NativeBLAS: Failed to load implementation from:dev.ludovic.netlib.blas.ForeignLinkerBLAS
Test Error = 0.0322581

Process finished with exit code 0

One-vs-Rest (OvR) 策略

在多类别分类问题中,有时候我们需要将数据分为多个类别。One-vs-Rest策略是一种常见的多类别分类方法之一。它的思想是将每个类别都作为一个二分类任务来处理,即将某个类别视为“正例”,而其他所有类别视为“负例”。对于每个类别,我们都训练一个分类器,该分类器的目标是将属于该类别的样本与其他所有类别的样本区分开来。

Logistic回归算法

Logistic回归是一种经典的机器学习算法,用于二分类问题。它使用逻辑函数(Logistic函数)来建模概率,将线性回归模型的输出通过逻辑函数映射到[0, 1]的范围内,表示样本属于某个类别的概率。

代码实现步骤

  1. 数据加载与准备:首先,代码加载了LIBSVM格式的数据文件,这是Spark MLlib所支持的一种数据格式。然后,将数据集划分为训练集和测试集。

  2. 实例化分类器:在这里,选择了Logistic回归算法作为基分类器,并设置了一些参数,如最大迭代次数、收敛阈值等。

  3. 实例化One Vs Rest分类器:接着,使用Logistic回归算法实例化了一个One Vs Rest分类器,该分类器会将每个类别与其他所有类别进行区分。

  4. 模型训练:调用fit()方法在训练集上训练One Vs Rest分类器,得到一个多类别分类模型。

  5. 模型评估:使用测试集对训练好的模型进行评估,计算模型在测试集上的准确率。

  6. 输出结果:最后,输出测试误差,即模型在测试集上的错误率。

Naive Bayes朴素贝叶斯

# coding utf8
from pyspark.ml.classification import NaiveBayes
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.sql import SparkSession
spark = SparkSession.builder \
    .appName("SVM") \
    .getOrCreate()
# Load trainaining ing data


# Load training data
data = spark.read.format("libsvm") \
    .load("hdfs://node1:8020/user/hadoop/sample_libsvm_data.txt")

# Split the data into train and test
splits = data.randomSplit([0.6, 0.4], 1234)
train = splits[0]
test = splits[1]

# create the trainer and set its parameters
nb = NaiveBayes(smoothing=1.0, modelType="multinomial")

# train the model
model = nb.fit(train)

# select example rows to display.
predictions = model.transform(test)
predictions.show()

# compute accuracy on the test set
evaluator = MulticlassClassificationEvaluator(labelCol="label", predictionCol="prediction",
                                              metricName="accuracy")
accuracy = evaluator.evaluate(predictions)
print("Test set accuracy = " + str(accuracy))

/home/hadoop/.virtualenvs/sparkkkk/bin/python /tmp/pycharm_project_330/mlib/Naive Bayes.py 
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/04/16 15:23:57 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
24/04/16 15:24:06 WARN LibSVMFileFormat: 'numFeatures' option not specified, determining the number of features by going though the input. If you know the number in advance, please specify it via 'numFeatures' option to avoid the extra scan.
+-----+--------------------+--------------------+-----------+----------+
|label|            features|       rawPrediction|probability|prediction|
+-----+--------------------+--------------------+-----------+----------+
|  0.0|(692,[95,96,97,12...|[-172664.79564650...|  [1.0,0.0]|       0.0|
|  0.0|(692,[98,99,100,1...|[-176279.15054306...|  [1.0,0.0]|       0.0|
|  0.0|(692,[122,123,124...|[-189600.55409526...|  [1.0,0.0]|       0.0|
|  0.0|(692,[124,125,126...|[-274673.88337431...|  [1.0,0.0]|       0.0|
|  0.0|(692,[124,125,126...|[-183393.03869049...|  [1.0,0.0]|       0.0|
|  0.0|(692,[125,126,127...|[-256992.48807619...|  [1.0,0.0]|       0.0|
|  0.0|(692,[126,127,128...|[-210411.53649773...|  [1.0,0.0]|       0.0|
|  0.0|(692,[127,128,129...|[-170627.63616681...|  [1.0,0.0]|       0.0|
|  0.0|(692,[127,128,129...|[-212157.96750469...|  [1.0,0.0]|       0.0|
|  0.0|(692,[127,128,129...|[-183253.80108550...|  [1.0,0.0]|       0.0|
|  0.0|(692,[128,129,130...|[-246528.93739632...|  [1.0,0.0]|       0.0|
|  0.0|(692,[150,151,152...|[-158348.34683571...|  [1.0,0.0]|       0.0|
|  0.0|(692,[152,153,154...|[-210229.50765957...|  [1.0,0.0]|       0.0|
|  0.0|(692,[152,153,154...|[-242985.16248889...|  [1.0,0.0]|       0.0|
|  0.0|(692,[152,153,154...|[-94622.933454005...|  [1.0,0.0]|       0.0|
|  0.0|(692,[153,154,155...|[-266465.39689814...|  [1.0,0.0]|       0.0|
|  0.0|(692,[153,154,155...|[-144989.71469229...|  [1.0,0.0]|       0.0|
|  0.0|(692,[154,155,156...|[-283834.57437738...|  [1.0,0.0]|       0.0|
|  0.0|(692,[181,182,183...|[-155256.59399829...|  [1.0,0.0]|       0.0|
|  1.0|(692,[100,101,102...|[-147726.11958982...|  [0.0,1.0]|       1.0|
+-----+--------------------+--------------------+-----------+----------+
only showing top 20 rows

Test set accuracy = 1.0

Process finished with exit code 0
 

精度值为1.0表示在测试集上的预测完全准确,即所有样本的预测类别与其真实类别完全一致。这意味着在给定的数据集上,该朴素贝叶斯分类器达到了100% 的准确率,没有错误地将任何样本分到了错误的类别中。

在实际应用中,获得100% 的测试精度可能是不太常见的,通常情况下,即使在很好的模型和数据集上,也会有一些错误分类。因此,当看到测试精度为1.0时,应该审慎考虑可能存在的问题,例如是否存在数据泄露或者是否数据集过小等。此外,也可以考虑增加交叉验证、使用更复杂的模型或者进行特征工程等方法来进一步验证模型的泛化能力。

不同的分类器在处理不同类型的数据和问题时可能会有不同的性能表现、优点和缺点。以下是一些常见的分类器及其特点:

  1. 逻辑回归 (Logistic Regression)

    • 优点:简单、容易实现和理解,适用于线性可分或近似线性可分的问题,可以提供概率估计。
    • 缺点:对于非线性数据的拟合能力有限。
  2. 支持向量机 (Support Vector Machine, SVM)

    • 优点:对于高维空间和非线性数据的分类效果较好,可以使用不同的核函数适应不同类型的数据。
    • 缺点:在大型数据集上训练速度较慢,对于数据量较大的问题可能不太适用。
  3. 多层感知器 (Multilayer Perceptron, MLP)

    • 优点:能够适应复杂的非线性关系,拥有较强的拟合能力,对于大型数据集表现较好。
    • 缺点:对于参数的选择和调优较为敏感,可能需要更多的计算资源和时间。
  4. 梯度提升树 (Gradient Boosting Trees)

    • 优点:能够处理各种类型的数据,包括分类和回归问题,具有较高的预测性能。
    • 缺点:对异常值和噪声较敏感,可能需要进行参数调优以避免过拟合。
  5. 朴素贝叶斯 (Naive Bayes)

    • 优点:简单、快速,对于高维数据和大型数据集表现较好,能够处理缺失值。
    • 缺点:假设特征之间相互独立,可能在某些情况下不太准确。

每个分类器都有其适用的场景和擅长的领域,选择合适的分类器取决于数据的特性、问题的性质以及对预测性能和可解释性的要求。在实际应用中,通常需要根据具体情况进行试验和比较,以确定最适合的分类器。

  • 17
    点赞
  • 20
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
以下是一个使用 Java 编写的基于 Spark遥感图像处理示例代码,实现了图像读取、特征提取、PCA 降维、分类器训练和模型评估等功能: ```java import org.apache.spark.SparkConf; import org.apache.spark.api.java.JavaRDD; import org.apache.spark.api.java.JavaSparkContext; import org.apache.spark.ml.classification.RandomForestClassifier; import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator; import org.apache.spark.ml.feature.PCA; import org.apache.spark.ml.feature.PCAModel; import org.apache.spark.ml.feature.VectorAssembler; import org.apache.spark.ml.linalg.Vector; import org.apache.spark.sql.DataFrame; import org.apache.spark.sql.SparkSession; public class RemoteSensingClassification { public static void main(String[] args) { // 初始化 Spark 上下文 SparkConf conf = new SparkConf().setAppName("RemoteSensingClassification"); JavaSparkContext sc = new JavaSparkContext(conf); SparkSession spark = SparkSession.builder().appName("RemoteSensingClassification").getOrCreate(); // 读取图像数据 DataFrame data = spark.read().format("image").load("hdfs://path/to/image/directory"); // 图像特征提取 // TODO: 根据具体的算法进行特征提取 // 特征转换 VectorAssembler assembler = new VectorAssembler().setInputCols(new String[]{"features"}).setOutputCol("featureVector"); DataFrame vectorizedData = assembler.transform(data).select("featureVector"); PCAModel pcaModel = new PCA().setInputCol("featureVector").setOutputCol("pcaFeatures").setK(50).fit(vectorizedData); DataFrame transformedData = pcaModel.transform(vectorizedData).select("pcaFeatures"); // 数据集划分 JavaRDD<Vector> transformedJavaRDD = transformedData.javaRDD().map(row -> row.getAs(0)); JavaRDD<LabeledPoint> labeledData = transformedJavaRDD.zipWithIndex().map(tuple -> new LabeledPoint(tuple._2(), tuple._1())); JavaRDD<LabeledPoint>[] splits = labeledData.randomSplit(new double[]{0.7, 0.3}); JavaRDD<LabeledPoint> trainingData = splits[0]; JavaRDD<LabeledPoint> testData = splits[1]; // 训练分类器 RandomForestClassifier rf = new RandomForestClassifier().setLabelCol("label").setFeaturesCol("pcaFeatures").setNumTrees(10); RandomForestClassificationModel model = rf.fit(trainingData.toDF()); // 模型评估 DataFrame predictions = model.transform(testData.toDF()); MulticlassClassificationEvaluator evaluator = new MulticlassClassificationEvaluator().setLabelCol("label").setPredictionCol("prediction").setMetricName("accuracy"); double accuracy = evaluator.evaluate(predictions); System.out.println("Accuracy = " + accuracy); } } ``` 需要注意的是,Java 版本的 Spark API 与 Scala 版本略有不同,需要根据具体情况进行调整。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值