决策树分类红酒数据的pyspark.ml的pipeline

最新推荐文章于 2022-03-15 20:16:20 发布

11号的乔乔

最新推荐文章于 2022-03-15 20:16:20 发布

阅读量377

点赞数

分类专栏： Spark 文章标签： python 决策树大数据 spark

本文链接：https://blog.csdn.net/DaB_za/article/details/115676483

版权

Spark 专栏收录该内容

4 篇文章 1 订阅

订阅专栏

该博客演示了如何使用PySpark MLlib库构建决策树分类器来处理葡萄酒数据集。首先，对数据进行预处理，包括特征和标签的编码。接着，创建决策树分类器，并构建管道进行训练。通过随机分割数据进行训练和测试，最后评估模型的多类分类准确性。

摘要由CSDN通过智能技术生成

数据地址：http://archive.ics.uci.edu/ml/datasets/Wine

from pyspark.ml.classification import DecisionTreeClassificationModel
from pyspark.ml.classification import DecisionTreeClassifier
from pyspark.ml import Pipeline,PipelineModel
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.ml.linalg import Vector,Vectors
from pyspark.sql import Row
from pyspark.ml.feature import IndexToString,StringIndexer,VectorIndexer

def getFeaAndLab(x):
    res = {}
    res['features'] = Vectors.dense(float(x[1]), float(x[2]), 
                                   float(x[3]), float(x[4]),
                                   float(x[5]), float(x[6]), 
                                   float(x[7]), float(x[8]),
                                   float(x[9]), float(x[10]),
                                   float(x[11]), float(x[12]),
                                   float(x[13]))
                                  
    res['label'] = str(x[0])
    return res

def model(data):
    # ------------------------data procesing-------------------------
    labelIndexer = StringIndexer(inputCol='label', outputCol='indexedLabel').fit(data)
    featureIndexer = VectorIndexer(inputCol='features', outputCol='indexedFeatures').fit(data)
    labelConverter = IndexToString(inputCol='prediction', outputCol='predictedLabel', labels=labelIndexer.labels)
    
    # -----------------------choose your model--------------------------------
    dtClassifier = DecisionTreeClassifier().setLabelCol("indexedLabel").setFeaturesCol("indexedFeatures")
    
#     print("DecisionTree parameters:\n" + dtClassifier.explainParams()) # 参数解释
    # ------------------------------pipeline-------------------------------------------------
    dtPipeline = Pipeline().setStages([labelIndexer, featureIndexer, dtClassifier, labelConverter])
    
    # ------------------------------split-------------------------------------------------
    trainingData, testData = data.randomSplit([0.7, 0.3]) # 自带打乱功能
    
    # ------------------------------Train-------------------------------------------------
    dtPipelineModel = dtPipeline.fit(trainingData)
    
    # ------------------------------test-------------------------------------------------
    dtPredictions = dtPipelineModel.transform(testData)
    
    # ------------------------------show-------------------------------------------------    
    dtPredictions.select("predictedLabel", "label", "features").show(20)
    
    # ------------------------------evaluate-------------------------------------------------    
    evaluator = MulticlassClassificationEvaluator().setLabelCol("indexedLabel").setPredictionCol("prediction")
    dtAccuracy = evaluator.evaluate(dtPredictions)
    print(dtAccuracy)
    
def main():
    # 读取数据
    rdd = spark.sparkContext.textFile('file:///usr/local/spark/mycode/ml/wine.txt')
    # 将数据进行分割并转换为dataframe
    data = rdd.map(lambda x: x.split(',')).map(lambda x: Row(**getFeaAndLab(x))).toDF()
    model(data)
if __name__ == '__main__':
	main()

11号的乔乔

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
决策树分类红酒数据的pyspark.ml的pipeline

数据地址：http://archive.ics.uci.edu/ml/datasets/Winefrom pyspark.ml.classification import DecisionTreeClassificationModelfrom pyspark.ml.classification import DecisionTreeClassifierfrom pyspark.ml import Pipeline,PipelineModelfrom pyspark.ml.evaluation imp
复制链接

扫一扫

专栏目录