大数据----Spark中决策树模型Pipeline的建立和两种验证方法(完整版)

最新推荐文章于 2022-12-21 02:09:30 发布

sakura小樱

最新推荐文章于 2022-12-21 02:09:30 发布

阅读量1.4k

点赞数

分类专栏：大数据建模文章标签： spark pipeline 验证

本文链接：https://blog.csdn.net/Sakura55/article/details/81290543

版权

建模同时被 2 个专栏收录

7 篇文章 0 订阅

订阅专栏

大数据

5 篇文章 0 订阅

订阅专栏

文章目录

@[toc]

一、数据预处理
1、加载数据
2、SparkSession读取CSV格式文件
3、清洗数据
4、特征处理
4.1、StringIndexer
4.2、OneHotEncoder
4.3、VectorAssembler

二、建模
分类决策树DecisionTreeClassifier

三、评估（ROC曲线）
四、打包（ML Pipeline）
Step 1. 创建流程中转换器和模型学习器
Step 2. 创建Pipeline实例对象
step3. Pipeline 数据处理与训练模型
Step 4. PipelineModel模型预测
step5、PipelineModel模型保存于加载
step6、调用

五、验证选择最优模型
5.1、创建 TrainValidationSplit 实例对象
5.1、Cross-Validation交叉验证

六、提升：随即森林（RF算法）

数据链接

一、数据预处理

1、加载数据

# 导入包
import os
import time
from pyspark.sql import SparkSession

# 实例化SparkSession对象，以本地模式是运行Spark程序
spark = SparkSession \
    .builder \
    .appName("PySpark_ML_Pipeline") \
    .master("local[4]")\
    .getOrCreate()


print spark
print spark.sparkContext
'''
<pyspark.sql.session.SparkSession object at 0x00000000066CB5C0>
<SparkContext master=local[4] appName=PySpark_ML_Pipeline>
'''

2、SparkSession读取CSV格式文件

help(spark.read.csv)
# 读取数据集，
raw_df = spark.read.csv('./datas/train.tsv', header='true', sep='\t',\
							 inferSchema='true')
# 显示条目数
print raw_df.count()
==>7395
raw_df.printSchema()

# 由于字段太多，选择某些字段值
raw_df.select('url', 'alchemy_category', 'alchemy_category_score', \
							'label').show(10)

3、清洗数据

# 定义函数转换 ？转换为 0
def replace_question_func(x):
    return '0' if x == '?' else x

# 注册函数
from  pyspark.sql.functions import udf
replace_question = udf(replace_question_func)



# col函数将 一个字符串转换为DataFrame中列, 获取对应DataFrame中此列的值
from pyspark.sql.functions import col

# 使用自定义的函数，转换数据
df = raw_df.select(['url', 'alchemy_category'] +\
			 [ replace_question(col(column)).cast('double')\
			 .alias(column) for column in raw_df.columns[4:]])


df.printSchema()

df.select('url', 'alchemy_category', 'alchemy_category_score', \
				'label').show(10)

这里写图片描述

# 将数据集分为 训练集和测试集
train_df, test_df = df.randomSplit([0.7, 0.3])

print train_df.cache().count()
print test_df.cache().count()
"""
5216
2179
"""

4、特征处理

1、alchemy_category
    类别特征数据转换
    第一特征转换器、StringIndexer
        将文字的类别特征 转换 数字
    第二特征转换器、OneHotEncoder
        将数值的 类别特征字段 转换为 多个字段的Vector
2、特征的组合
    第二特征转换器、VectorAssembler
        将多个特征整合到一起

4.1、StringIndexer

网址：http://spark.apache.org/docs/2.2.0/ml-features.html#stringindexer

# 导入模块
from pyspark.ml.feature import StringIndexer
help(StringIndexer)


# 创建StringIndexer实例对象
"""
    参数说明：
        inputCol -> 要转换的字段名称
        outputCol -> 转换后的字段名称
"""
categoryIndexer = StringIndexer(inputCol='alchemy_category',\
						 outputCol='alchemy_category_index')

print type(categoryIndexer)
"""
==><class 'pyspark.ml.feature.StringIndexer'>
"""

调用StringIndexer类中的 fit 方法，获取到转换器Transformer

categoryTransformer = categoryIndexer.fit(df)
print type(categoryTransformer)


# 使用 categoryTransformer 转换器 将所有的 train_df 进行转换
df1 = categoryTransformer.transform(train_df)

df1.select('alchemy_category', 'alchemy_category_index').show(10)
"""
+------------------+----------------------+
|  alchemy_category|alchemy_category_index|
+------------------+----------------------+
|                 ?|                   0.0|
|arts_entertainment|                   2.0|
|                 ?|                   0.0|
|          business|                   3.0|
|arts_entertainment|                   2.0|
|                 ?|                   0.0|
|                 ?|                   0.0|
|        recreation|                   1.0|
|          business|                   3.0|
|arts_entertainment|                   2.0|
+------------------+----------------------+
only showing top 10 rows
"""



df1.printSchema() #查看结构数据

4.2、OneHotEncoder

OneHotEncoder可以将一个数值的类别特征字段转换为多个字段的Vector向量

from pyspark.ml.feature import OneHotEncoder
# 创建 OneHotEncoder 实例对象
encoder = OneHotEncoder(inputCol='alchemy_category_index', 
                        outputCol='alchemy_category_index_vector')

print type(encoder)
"""
<class 'pyspark.ml.feature.OneHotEncoder'>
"""


df2 = encoder.transform(df1)

df2.printSchema()

df2.select('alchemy_category', 'alchemy_category_index',\
			 'alchemy_category_index_vector').show(10)

这里写图片描述

4.3、VectorAssembler

特征的组合
第二特征转换器、VectorAssembler，将多个特征整合到一起

from pyspark.ml.feature import VectorAssembler
assembler_inputs = ['alchemy_category_index_vector'] \
					+ raw_df.columns[4:-1]
print assembler_inputs

"""
['alchemy_category_index_vector', 'alchemy_category_score', 
'avglinksize', 'commonlinkratio_1', 'commonlinkratio_2', 
'commonlinkratio_3', 'commonlinkratio_4', 'compression_ratio',
 'embed_ratio', 'framebased', 'frameTagRatio', 'hasDomainLink', 
'linkwordscore', 'news_front_page', 'non_markup_alphanum_characters', 
'numberOfLinks', 'numwords_in_url', 'parametrizedLinkRatio', 
'spelling_errors_ratio']
"""

######创建 VectorAssembler 实例对象，传递参数，指定合并哪些字段，输出的字段名称
assembler = VectorAssembler(inputCols=assembler_inputs, 
								outputCol='features')
df3 = assembler.transform(df2)

df3.printSchema()

"""
+--------------------+-----+
|            features|label|
+--------------------+-----+
|(35,[0,14,15,16,1...|  1.0|
|(35,[2,13,14,15,1...|  1.0|
|(35,[0,14,15,19,2...|  0.0|
|(35,[3,13,14,15,1...|  1.0|
|(35,[2,13,14,15,1...|  0.0|
+--------------------+-----+
only showing top 5 rows
"""

df3.select('features').take(1)
"""
[Row(features=SparseVector(35, 
{0: 1.0, 14: 2.1446, 15: 0.7969, 16: 0.3945, 17: 0.332, 
18: 0.3203, 19: 0.5022, 22: 0.028, 24: 0.1898, 25: 0.2354,
 26: 1.0, 27: 1.0, 28: 17.0, 30: 10588.0, 31: 256.0, 32: 
 5.0, 33: 0.3828, 34: 0.1368}))]
"""

二、建模

分类决策树DecisionTreeClassifier

from pyspark.ml.classification import DecisionTreeClassifier

# 使用决策树分类算法
dtc = DecisionTreeClassifier(featuresCol='features', labelCol='label',
                            impurity='gini', maxDepth=5, maxBins=32)

# 将 训练数据 应用到 算法
dtc_model = dtc.fit(df3)

# 使用模型预测
df4 = dtc_model.transform(df3)
df4.select('label', 'prediction', 
				'rawPrediction', 'probability')
				.show(20, truncate=False)

label	prediction	rawPrediction	probability
1.0	1.0	[909.0,1104.0]	[0.45156482861400893,0.5484351713859911]
1.0	1.0	[909.0,1104.0]	[0.45156482861400893,0.5484351713859911]
0.0	0.0	[38.0,1.0]	[0.9743589743589743,0.02564102564102564]
1.0	1.0	[27.0,177.0]	[0.1323529411764706,0.8676470588235294]
0.0	0.0	[95.0,28.0]	[0.7723577235772358,0.22764227642276422]
1.0	1.0	[909.0,1104.0]	[0.45156482861400893,0.5484351713859911]
1.0	1.0	[909.0,1104.0]	[0.45156482861400893,0.5484351713859911]
1.0	0.0	[144.0,95.0]	[0.602510460251046,0.39748953974895396]
0.0	0.0	[363.0,146.0]	[0.7131630648330058,0.2868369351669941]
0.0	0.0	[86.0,23.0]	[0.7889908256880734,0.21100917431192662]
0.0	0.0	[144.0,95.0]	[0.602510460251046,0.39748953974895396]
0.0	0.0	[144.0,95.0]	[0.602510460251046,0.39748953974895396]
0.0	0.0	[43.0,1.0]	[0.9772727272727273,0.022727272727272728]
1.0	1.0	[909.0,1104.0]	[0.45156482861400893,0.5484351713859911]
1.0	1.0	[909.0,1104.0]	[0.45156482861400893,0.5484351713859911]
1.0	1.0	[27.0,177.0]	[0.1323529411764706,0.8676470588235294]
1.0	1.0	[129.0,417.0]	[0.23626373626373626,0.7637362637362637]
1.0	1.0	[909.0,1104.0]	[0.45156482861400893,0.5484351713859911]
0.0	1.0	[909.0,1104.0]	[0.45156482861400893,0.5484351713859911]
1.0	1.0	[909.0,1104.0]	[0.45156482861400893,0.5484351713859911]

only showing top 20 rows

三、评估（ROC曲线）

from pyspark.ml.evaluation import BinaryClassificationEvaluator
# 创建 实例对象， 传递参数值
evaluator = BinaryClassificationEvaluator(labelCol='label',
		 rawPredictionCol='rawPrediction')
# 计算指标  metricName="areaUnderROC"
auc = evaluator.evaluate(df4)
print auc
"""
0.6087142511
"""

总结上述开发流程：
    1、从原始数据 提取特征数据
    2、特征数据应用到算法，得到模型
    3、使用模型预测数据
    4、评估模型

Pipeline:
    相当于一个“算法” -> 模型学习器
    包含两部分内容；
        -a. Estimator 模型学习器
            fit()
        -b. transformers 转换器
            transformer()
pipeline = Pipeline(Stages(.....))

pipeline.fit().....
model.transfor().....

四、打包（ML Pipeline）

Step 1. 创建流程中转换器和模型学习器

# 1. 导入全部需要 模块
from pyspark.ml import Pipeline
from pyspark.ml.feature import StringIndexer, OneHotEncoder, VectorAssembler
from pyspark.ml.classification import DecisionTreeClassifier
# a. StringIndexer
string_indexer = StringIndexer(inputCol='alchemy_category',\
					 outputCol='alchemy_category_index')

# b. OneHotEncoding
one_hot_encoder = OneHotEncoder(inputCol='alchemy_category_index',\
					 outputCol='alchemy_category_index_vector')

# c. VectorAessmbler
assembler_inputs = ['alchemy_category_index_vector'] \
						+ raw_df.columns[4:-1]
vector_assembler = VectorAssembler(inputCols=assembler_inputs,\
						 outputCol='features')


# d. DecisionTreeClassifier 模型学习器
dt = DecisionTreeClassifier(featuresCol='features', labelCol='label',\
                            impurity='gini', maxDepth=5, maxBins=32)

Step 2. 创建Pipeline实例对象

# 按照数据处理顺序
pipeline = Pipeline(stages=[string_indexer,
			 one_hot_encoder, vector_assembler, dt])
pipeline.getStages()

"""
[StringIndexer_43e8b50676a58dad4d05,
 OneHotEncoder_4bf2a31a6b4b12aebd78,
 VectorAssembler_4429bf16ed1cc6c14207,
 DecisionTreeClassifier_451682088ef8fcaa79ae]
 """

step3. Pipeline 数据处理与训练模型

# 调用fit方法学，
pipleline_model = pipeline.fit(train_df)

type(pipleline_model)   #pyspark.ml.pipeline.PipelineModel
pipleline_model.stages[3]

Step 4. PipelineModel模型预测

predict_df = pipleline_model.transform(test_df)

step5、PipelineModel模型保存于加载

# 保存 模型
pipleline_model.save('./datas/dtc-model')

step6、调用

# 加载模型
from pyspark.ml.pipeline import PipelineModel

load_pipeline_model = PipelineModel.load('./datas/dtc-model')
load_pipeline_model.stages[3]


# 预测
load_pipeline_model.transform(test_df) \
    .select('label', 'prediction', 'rawPrediction',\
     'probability').show(20, truncate=False)

label	prediction	rawPrediction	probability
0.0	0.0	[361.0,300.0]	[0.546142208774584,0.45385779122541603]
1.0	0.0	[144.0,95.0]	[0.602510460251046,0.39748953974895396]
0.0	1.0	[0.0,8.0]	[0.0,1.0]
1.0	1.0	[129.0,417.0]	[0.23626373626373626,0.7637362637362637]
0.0	0.0	[363.0,146.0]	[0.7131630648330058,0.2868369351669941]
0.0	0.0	[363.0,146.0]	[0.7131630648330058,0.2868369351669941]
1.0	1.0	[909.0,1104.0]	[0.45156482861400893,0.5484351713859911]
1.0	1.0	[129.0,417.0]	[0.23626373626373626,0.7637362637362637]
1.0	1.0	[27.0,177.0]	[0.1323529411764706,0.8676470588235294]
1.0	1.0	[27.0,177.0]	[0.1323529411764706,0.8676470588235294]
1.0	1.0	[27.0,177.0]	[0.1323529411764706,0.8676470588235294]
1.0	1.0	[27.0,177.0]	[0.1323529411764706,0.8676470588235294]
1.0	1.0	[27.0,177.0]	[0.1323529411764706,0.8676470588235294]
1.0	1.0	[909.0,1104.0]	[0.45156482861400893,0.5484351713859911]
0.0	0.0	[363.0,146.0]	[0.7131630648330058,0.2868369351669941]
1.0	1.0	[909.0,1104.0]	[0.45156482861400893,0.5484351713859911]
1.0	1.0	[909.0,1104.0]	[0.45156482861400893,0.5484351713859911]
1.0	1.0	[909.0,1104.0]	[0.45156482861400893,0.5484351713859911]
1.0	0.0	[361.0,300.0]	[0.546142208774584,0.45385779122541603]
0.0	0.0	[86.0,23.0]	[0.7889908256880734,0.21100917431192662]

only showing top 20 rows

五、验证选择最优模型

5.1、创建 TrainValidationSplit 实例对象

（训练检验分离选择最优）
导入模块

from pyspark.ml.tuning import TrainValidationSplit, ParamGridBuilder

构建一个决策树分类算法网格参数

"""
    调整三个参数：
        -1. 不纯度度量
        -2. 最多深度
        -3. 最大分支数
"""
param_grid = ParamGridBuilder() \
    .addGrid(dt.impurity, ['gini', 'entropy']) \
    .addGrid(dt.maxDepth, [5, 10, 20]) \
    .addGrid(dt.maxBins, [8, 16, 32]) \
    .build()
    
print type(param_grid)
for param in param_grid:
    print param

针对二分类创建模型评估器

binary_class_evaluator = BinaryClassificationEvaluator(labelCol='label',\
                                rawPredictionCol='rawPrediction')

创建 TrainValidationSplit 实例对象

"""
    __init__(self, estimator=None, estimatorParamMaps=None, evaluator=None, trainRatio=0.75,  seed=None)
    参数解释：
        estimator：
            模型学习器，针对哪个算法进行调整超参数，这里是DT
        estimatorParamMaps:
            算法调整的参数组合
        evaluator：
            评估模型的评估器，比如二分类的话，使用auc面积
        trainRatio:
            训练集与验证集 所占的比例，此处的值表示的是 训练集比例
"""

train_validataion_split = TrainValidationSplit(estimator=dt,
				 evaluator=binary_class_evaluator, 
	              estimatorParamMaps=param_grid, trainRatio=0.8)

type(train_validataion_split)
#pyspark.ml.tuning.TrainValidationSplit

建立新的Pipeline实例对象

#使用 train_validataion_split 代替 原先 dt 
tvs_pipeline = Pipeline(stages=[string_indexer, \
								one_hot_encoder, vector_assembler, \
                                train_validataion_split])
# tvs_pipeline 进行数据处理、模型训练（找到最佳模型）
tvs_pipeline_model = tvs_pipeline.fit(train_df)

best_model = tvs_pipeline_model.stages[3].bestModel
"""
DecisionTreeClassificationModel (uid=DecisionTreeClassifier_\
451682088ef8fcaa79ae) of depth 20 with 1851 nodes
"""

评估最佳模型

predictions_df = tvs_pipeline_model.transform(test_df)

model_auc = binary_class_evaluator.evaluate(predictions_df)
print model_auc

0.649609702764

5.1、Cross-Validation交叉验证

"""
     __init__(self, estimator=None, estimatorParamMaps=None, \
				evaluator=None, numFolds=3, seed=None)
    假设 K-Fold的CrossValidation交叉验证  K = 3,将数据分为3个部分：
        1、A + B作为训练，C作为验证
        2、B + C作为训练，A作为验证
        3、A + C最为训练，B作为验证

"""


# 导入模块
from pyspark.ml.tuning import CrossValidator
# 构建 CrossValidator实例对象，设置相关参数
cross_validator = CrossValidator(estimator=dt, \
								evaluator=binary_class_evaluator,\
                                estimatorParamMaps=param_grid, numFolds=3)

# 创建Pipeline
cv_pipeline = Pipeline(stages=[string_indexer, one_hot_encoder, \
								vector_assembler, cross_validator])

使用 cv_pipeline 进行训练与验证（交叉）

cv_pipeline_model = cv_pipeline.fit(train_df)

查看最佳模型

best_model = cv_pipeline_model.stages[3].bestModel
"""
DecisionTreeClassificationModel (uid=DecisionTreeClassifier_ \
451682088ef8fcaa79ae) of depth 10 with 527 nodes
"""

使用测试集评估最佳模型

cv_predictions = cv_pipeline_model.transform(test_df)
cv_model_auc =  binary_class_evaluator.evaluate(cv_predictions)
print cv_model_auc

六、提升：随即森林（RF算法）

# 导入随机森林分类算法模块
from pyspark.ml.classification import RandomForestClassifier

# 创建RFC实例对象
rfc = RandomForestClassifier(labelCol='label', \
							featuresCol='features',\
							 numTrees=10, \
							 featureSubsetStrategy="auto",\
							 maxDepth=5, \
							 maxBins=32, \
							 impurity="gini")


# 创建Pipeline实例对象
rfc_pipeline = Pipeline(stages=[string_indexer, one_hot_encoder, \
						 vector_assembler, rfc])


# 使用训练数据训练模型
rfc_pipeline_model = rfc_pipeline.fit(train_df)


# 预测
rfc_predictions = rfc_pipeline_model.transform(test_df)

rfc_model_auc =  binary_class_evaluator.evaluate(rfc_predictions)
print rfc_model_auc
"""
0.716242043615
"""

sakura小樱

关注

0
点赞
踩
8

收藏

觉得还不错? 一键收藏
0
评论
大数据----Spark中决策树模型Pipeline的建立和两种验证方法(完整版)

一、数据预处理1、加载数据2、SparkSession读取CSV格式文件3、清洗数据4、特征处理4.1、StringIndexer4.2、OneHotEncoder4.3、VectorAssembler二、建模分类决策树DecisionTreeClassifier三、评估（ROC曲线）四、打包（ML Pipeline）Step 1...
复制链接

扫一扫