Spark ML pipeline学习流程 2元分类

最新推荐文章于 2024-05-05 12:16:29 发布

大胖头leo

最新推荐文章于 2024-05-05 12:16:29 发布

阅读量507

点赞数 4

分类专栏： PySpark学习日志文章标签： SparkML

本文链接：https://blog.csdn.net/a8131357leo/article/details/100867624

版权

PySpark学习日志专栏收录该内容

40 篇文章

订阅专栏

Spark ML使用的数据格式是DataFrame，所以必须使用Dataframe储存处理数据。

准备数据

option('header'):数据是否由标题
option('delimiter')：分隔符
load: 路径
format：读取格式


row_df = sqlContext.read.format('csv')\

        .option('header','true')\

        .option('delimiter','\t')\

        .load(Path+'/train.tsv')

print(row_df.count())

7395

SparkSession.read 和 SparkSession.textfile: read读取的是 DataFrame格式，textfile读取的是 RDD格式

row_df.select('url','alchemy_category','alchemy_category_score','is_news','label').show(10)

+--------------------+------------------+----------------------+-------+-----+
|                 url|  alchemy_category|alchemy_category_score|is_news|label|
+--------------------+------------------+----------------------+-------+-----+
|http://www.bloomb...|          business|              0.789131|      1|    0|
|http://www.popsci...|        recreation|              0.574147|      1|    1|
|http://www.menshe...|            health|              0.996526|      1|    1|
|http://www.dumbli...|            health|              0.801248|      1|    1|
|http://bleacherre...|            sports|              0.719157|      1|    0|
|http://www.conven...|                 ?|                     ?|      ?|    0|
|http://gofashionl...|arts_entertainment|               0.22111|      1|    1|
|http://www.inside...|                 ?|                     ?|      ?|    0|
|http://www.valetm...|                 ?|                     ?|      1|    1|
|http://www.howswe...|                 ?|                     ?|      ?|    1|
+--------------------+------------------+----------------------+-------+-----+
only showing top 10 rows

有些位置是未知值，所以要把None值转换为其他值，就需要使用UDF函数操作dataframe。

UDF就是直接操作dataframe的函数

from pyspark.sql.functions import udf
from pyspark.sql.functions import col
import pyspark.sql.types

#处理？列
def replace_question(x):
    return ("0" if x=="?" else x)

# 把python函数变为UDF函数
replace_column = udf(replace_question)

df = row_df.select(
    ['url','alchemy_category']+[replace_column(col(column)).cast('double').alias(column) for column in row_df.columns[4:]]
)

使用UDF操作dataframe的每一列，然后转换成double型cast（‘double’），添加别名

将数据分为训练集和测试集

train_df,test_df = df.randomSplit([0.7,0.3])

数据处理完成，开始训练数据

pipeline流程

StringIndexer：将文字的分类特征转换为数字
OneHotEncoder：将分类特征生成的数字转换成独热码
VectorAssembler：将多个特征字段整合成一个vector的特征字段
DecisionTreeClassifier：（这里使用决策树进行分类）

关于2，独热码前提是这个分类特征本身是不可比较的。没有关系的，例子：如果文字分类是：尖峰，棉花糖，合金这样的赛车类别，他们本身是不可比较的分类，当将他们转换为数字时：

	数字	独热码
尖峰	1	1，0，0
棉花糖	2	0，1，0
合金	3	0，0，1

因为不能比较，但是在数字里确有1<2<3 那就有额外的信息了，而独热码则没有了互相自建的比较

相反：对于衣服： s，m，l，xl，他们本身就有关系， s<m<l<xl, 那就不应该用独热码转换。

参数

StringIndexer：inputCol(输入的列），outputCol（转换后生成的列名）
OneHotEncoder：dropLast（包不包括最后一个特征，如果不，最后一个分类的独热码全是0， inputCol（输入），outputCol（输出的名）
VectorAssembler：inputCol（输入所有的字段名），outputCol（转换后的列名）
分析模型（分类模型，回归模型等）

Pipeline 流程

Pipeline的stage参数提供了pipeline的各个组件变量名

这里训练的流程为：【stringIndexer，encoder，assembler，dt】

from pyspark.ml import Pipeline
from pyspark.ml.feature import StringIndexer, OneHotEncoder,VectorAssembler
from pyspark.ml.classification import DecisionTreeClassifier

#生成分类数字
categoryIndexer = StringIndexer(inputCol='alchemy_category',outputCol='alchemy_category_Index')

#独热码
encoder = OneHotEncoder(dropLast=False,
                       inputCol='alchemy_category_Index',
                       outputCol='alchemy_category_IndexVec')
#整合成一个vector
assemblerInput = ['alchemy_category_IndexVec'] + row_df.columns[4:-1]
assembler = VectorAssembler(
            inputCols=assemblerInput,
            outputCol='features'
)

#分类模型
dt = DecisionTreeClassifier(labelCol='label',featuresCol='features',impurity='gini',maxDepth=10,maxBins=14)

#pipeline流程
pipeline = Pipeline(stages=[categoryIndexer,encoder,assembler,dt])

Pipeline预测

#使用训练数据训练模型
pipelineModel = pipeline.fit(train_df)

#进行预测
predicted = pipelineModel.transform(test_df)

结果会在test_df的结尾添加预测结果

print(predicted.columns)

['url', 'alchemy_category', 'alchemy_category_score', 'avglinksize', 'commonlinkratio_1', 
'commonlinkratio_2', 'commonlinkratio_3', 'commonlinkratio_4', 'compression_ratio', 
'embed_ratio', 'framebased', 'frameTagRatio', 'hasDomainLink', 'html_ratio', 'image_ratio',
 'is_news', 'lengthyLinkDomain', 'linkwordscore', 'news_front_page', 
'non_markup_alphanum_characters', 'numberOfLinks', 'numwords_in_url', 
'parametrizedLinkRatio', 'spelling_errors_ratio', 'label', 'alchemy_category_Index', 
'alchemy_category_IndexVec', 'features', 'rawPrediction', 'probability', 'prediction']

结尾多了rawPrediction（评测模型准确度时使用），probability（概率），prediction（预测结果）三个

Pipline的模型评估

这里模型预测结果时0，1 所以时2元分类，用BinaryClassificationEvaluator（）

参数：

rawPredictionCol：rawPrediction那列
labelCol：本来的结果
metricName：用啥来评估准确度

建立Evaluator

from pyspark.ml.evaluation import BinaryClassificationEvaluator
evaluator = BinaryClassificationEvaluator(
    rawPredictionCol='rawPrediction',
    labelCol='label',
    metricName='areaUnderROC' #用AUC
)

参数调优

通过网格搜索来解决参数调优，

网格越密，时间越长，模型性能更好，所以时时间和准确度的权衡

训练集验证集：

from pyspark.ml.tuning import ParamGridBuilder,TrainValidationSplit

#测试网格
paramGrid = ParamGridBuilder()\
        .addGrid(dt.impurity,['gini','entropy'])\
        .addGrid(dt.maxDepth,[5,10,15])\
        .addGrid(dt.maxBins,[10,15,20])\
        .build()

tvs = TrainValidationSplit(estimator=dt,evaluator=evaluator,estimatorParamMaps=paramGrid,trainRatio=0.8)

#estimator:评估模型
#evaluator：用啥来评估
#estimatorParamMaps：参数网格
#trainRation=0.8：测试/验证划分


tvs_pipline= Pipeline(stages=[categoryIndexer,encoder,assembler,tvs])

tvs_piplineModel = tvs_pipline.fit(train_df)

#tvs的第三个会生成最佳模型
bestModel=tvs_piplineModel.stages[3].bestModel

交叉验证

from pyspark.ml.tuning import CrossValidator

cv = CrossValidator(estimator=dt,evaluator=evaluator,estimatorParamMaps=paramGrid,numFolds=3)
#numFolds = 3进行几折CV验证

cv_pipeline = Pipeline(stages=[categoryIndexer,encoder,assembler,cv])
cv_pipelineModel = cv_pipeline.fit(train_df)
bestModel = cv_pipelineModel.stages[3].bestModel
bestModel

使用随机算林进行分类预测

决策树只是通过一个数进行预测，RF则是同时建立很多棵树，多棵树则会降低模型的高方差，提高泛化性

from pyspark.ml.classification import RandomForestClassifier


rf = RandomForestClassifier(labelCol='label',
                           featuresCol='features',
                           numTrees=10)

#numTree = 10，随机森林一共有多少棵树

rfpipeline = Pipeline(stages=[categoryIndexer,encoder,assembler,rf])


rfcv = CrossValidator(estimator=rf,evaluator=evaluator,estimatorParamMaps=paramGrid,numFolds=3)
rfcv_pipeline = Pipeline(stages=[categoryIndexer,encoder,assembler,rfcv])
rfcv_pipelineModel = rfcv_pipeline.fit(train_df)
rf_prediction = rfcv_pipelineModel.transform(test_df)


#查看AUC数值
auc = evaluator.evaluate(rf_prediction)

auc

0.7454870754112359

明显相较于单颗决策树，RD的AUC明显要高一些