Spark MLlib 入门学习笔记 - 决策树

最新推荐文章于 2024-05-05 12:15:56 发布

hjh00

最新推荐文章于 2024-05-05 12:15:56 发布

阅读量716

点赞数

分类专栏： Spark 文章标签：决策数 MLLib Spark

本文链接：https://blog.csdn.net/hjh00/article/details/72828060

版权

本文是关于Spark MLLib中决策树的学习笔记，以kyphosis数据集为例，介绍了数据集的背景及包含的特征，包括 kyphosis、Age、Number 和 Start。通过测试代码展示了如何运用Spark进行决策树模型训练和应用。

摘要由CSDN通过智能技术生成

在官方API文档可以查到用法。

def trainClassifier(input: RDD[LabeledPoint], numClasses: Int, categoricalFeaturesInfo: Map[Int, Int], impurity: String, maxDepth: Int, maxBins: Int): DecisionTreeModel
Method to train a decision tree model for binary or multiclass classification.
input Training dataset: RDD of org.apache.spark.mllib.regression.LabeledPoint. Labels should take values {0, 1, ..., numClasses-1}.
numClasses number of classes for classification.
categoricalFeaturesInfo Map storing arity of categorical features. E.g., an entry (n -> k) indicates that feature n is categorical with k categories indexed from 0: {0, 1, ..., k-1}.
impurity Criterion used for information gain calculation. Supported values: "gini" (recommended) or "entropy".
maxDepth Maximum depth of th