《深度实践Spark机器学习 》第11章 pyspark决策树模型

由于此书不配代码,以下代码都是本宝宝在ipynb测试过的,运行环境为hdp2.6.2和Anaconda2。

完整ipynb和py代码地址:

https://gitee.com/iscas/deep_spark_ml/tree/master


11.3 数据加载
删除标题
sed 1d train.tsv > train_noheader.tsv

上传到hdfs
hdfs dfs -put train_noheader.tsv /u01/bigdata/data/




11.4 数据探索
PYSPARK_DRIVER_PYTHON=ipython PYSPARK_DRIVER_PYTHON_OPTS="notebook" pyspark
Path = "hdfs://XX:8020/u01/bigdata/"
raw_data = sc.textFile(Path + "data/train_noheader.tsv")
raw_data.take(2)

查看总行数:
numRaw = raw_data.count()
numRaw

按键进行统计
raw_data.countByKey()

11.5 数据预处理
1)过滤时间戳和网页内容
records = raw_data.map(lambda line: line.split('\t'))
records.first()

3)查看每行列数
len(records.first())

4)将RDD中的所有元素以列表的形式返回
data = records.collect()

5)查看data数据一行有多少列
numColumns = len(data[0])
numColumns

6)对数据进行清理工作,并且定义一个列表data1,存放处理过的数据,格式为[(label_1, feature_1), (label_2, feature_2), ...]
清理步骤如下:
1)去掉引号
2)把标签列(即最后一列)转换为整数
3)把第4列的?转换为0.0
代码如下:
from pyspark.ml.linalg import Vectors
from pyspark.ml.classification import DecisionTreeClassifier

data1 = []
for i in range(numRaw):
trimmed = [ each.replace('"', "") for each in data[i] ]
label = int(trimmed[-1])
features = map(lambda x: 0.0 if x == "?" else x, trimmed[4: numColumns-1]) #只取第5到27列
c = (label, Vectors.dense(map(float, features)))
data1.append(c)

查看data1数据:
data1[0] #下面是输出结果
(0, DenseVector([0.7891, 2.0556, 0.6765, 0.2059, 0.0471, 0.0235, 0.4438, 0.0, 0.0, 0.0908, 0.0, 0.2458, 0.0039, 1.0, 1.0, 24.0, 0.0, 5424.0, 170.0, 8.0, 0.1529, 0.0791]))


11.6 创建决策树模型
1)将data1转换为DataFrame对象,label为标签列,features为特征值列
df = spark.createDataFrame(data1, ["label", "features"])
df.show(2)

#显示df的schema
df.printSchema(),下面是结果
root
|-- label: long (nullable = true)
|-- features: vector (nullable = true)
 
2)将df载入内存
df.cache()

3)建立特征索引
from pyspark.ml.feature import VectorIndexer
featureIndexer = VectorIndexer(inputCol="features", outputCol="indexedFeatures", maxCategories=24).fit(df)

4)将数据切分成80%训练集和20%测试集
(trainData, testData) = df.randomSplit([0.8, 0.2], seed=1234L) #seed=1234L表示每次生成的训练集和测试集总行数不变
trainData.count()
testData.count()

5)指定决策树模型的深度、标签列、特征值列、使用信息熵作为评估方法,并训练数据
dt = DecisionTreeClassifier(maxDepth=5, labelCol="label", featuresCol="indexedFeatures", impurity="entropy")

6)构建流水工作流
from pyspark.ml import Pipeline
pipeline = Pipeline(stages = [featureIndexer, dt])
model = pipeline.fit(trainData) #训练模型


11.7 训练模型进行预测
2)使用数据集中第一行的特征值数据进行预测
test0 = spark.createDataFrame([(data1[0][1],)], ["features"])
result = model.transform(test0)
result.show()
输出结果:
+--------------------+--------------------+-------------+--------------------+----------+
|            features|     indexedFeatures|rawPrediction|         probability|prediction|
+--------------------+--------------------+-------------+--------------------+----------+
|[0.789131,2.05555...|[0.789131,2.05555...|[564.0,578.0]|[0.49387040280210...|       1.0|
+--------------------+--------------------+-------------+--------------------+----------+

result.select(['prediction']).show() #只获取预测值
输出结果:
+----------+
|prediction|
+----------+
|       1.0|
+----------+

3)将第一行的特征值数据修改掉2个(这里换掉第1个和第2个值),进行该特征值下的预测
firstRaw = list(data1[0][1])
firstRaw[0]
firstRaw[1]

predictData = Vectors.dense(firstRaw)
predictData
结果:
DenseVector([0.7891, 2.0556, 0.6765, 0.2059, 0.0471, 0.0235, 0.4438, 0.0, 0.0, 0.0908, 0.0, 0.2458, 0.0039, 1.0, 1.0, 24.0, 0.0, 5424.0, 170.0, 8.0, 0.1529, 0.0791])

4)进行新数据的预测
predictRaw = spark.createDataFrame([(predictData,)], ["features"])
predictResult = model.transform(predictRaw)
predictResult.show()
结果:
+--------------------+--------------------+-------------+--------------------+----------+
|            features|     indexedFeatures|rawPrediction|         probability|prediction|
+--------------------+--------------------+-------------+--------------------+----------+
|[0.789131,2.05555...|[0.789131,2.05555...|[564.0,578.0]|[0.49387040280210...|       1.0|
+--------------------+--------------------+-------------+--------------------+----------+

5)用测试数据做测试
#通过模型,预测测试集
predictResultAll = model.transform(testData)

predictResultAll.select(['prediction']).show()

#由于预测值是DataFrame对象,每一行是Raw型,不可做修改
#需将预测值转换为pandas,然后转换为列表
df_predict = predictResultAll.select(['prediction']).toPandas()
dtPredict = list(df_predict.prediction)

#查看前10个预测值
dtPredict[:10]

#对预测值做准确性统计
dtTotalCorrect = 0

#获取测试集的总行数
testRaw = testData.count();
#testLabel = testData.select("label").collect() #这个获取的row不是list?版本原因
df_test = testData.select(['label']).toPandas()
testLabel = list(df_test.label)
testLabel[:10]

for i in range(testRaw):
if dtPredict[i] == testLabel[i]:
dtTotalCorrect += 1

1.0 * dtTotalCorrect / testRaw 


11.8 模型优化
11.8.1 特征值的优化
1)将之前用到的一些代码加进来
from pyspark.ml.linalg import Vectors
from pyspark.ml.classification import DecisionTreeClassifier
from pyspark.ml.feature import VectorIndexer
from pyspark.ml import Pipeline
raw_data = sc.textFile(Path + "data/train_noheader.tsv")
numRaw = raw_data.count()
records = raw_data.map(lambda line: line.split('\t'))
data = records.collect()
numColumns = len(data[0])
data1 = []

2)由于这里对网页类型的标识有很多,需要单独挑选出来做处理
#将第3列网页类型的引号除掉
category = records.map(lambda x: x[3].replace("\"", ""))
categories = sorted(category.distinct().collect())
categories

3)查看网页类型的个数
numCategories = len(categories)
numCategories

4)可以使用“One of K”来进行标签转换,即[0,0,0,0]和[1,0,0,0]这样的列表;这里定义一个函数,用于返回当前网页类型列表
def transform_category(x):
markCategory = [0] * numCategorys
index = categories.index(x)
markCategory[index] = 1
return markCategory

5)通过这样的处理,我们将网页类型这一特征值转化为14个特征值,整体的特征值其实就是增加了14个。接下来,我们在处理的时候将这些特征值加入进去
for i in range(numRaw):
trimmed = [ each.replace('"', "") for each in data[i] ]
label = int(trimmed[-1])
cate = transform_category(trimmed[3]) #调用参数,返回一个类型列表
features = cate + map(lambda x: 0.0 if x == "?" else x, trimmed[4: numColumns-1]) #只取第5到27列
c = (label, Vectors.dense(map(float, features)))
data1.append(c)

6)创建DataFrame对象
df = spark.createDataFrame(data1, ["label", "features"])
df.cache()

7)建立特征索引
featureIndexer = VectorIndexer(inputCol="features", outputCol="indexedFeatures", maxCategories=24).fit(df)

8)将数据切分成80%训练集和20%测试集
(trainData, testData) = df.randomSplit([0.8, 0.2], seed=1234L) #seed=1234L表示每次生成的训练集和测试集总行数不变
trainData.count()
testData.count()

9)创建决策树模型
dt = DecisionTreeClassifier(maxDepth=5, labelCol="label", featuresCol="indexedFeatures", impurity="entropy")

10)构建流水工作流
pipeline = Pipeline(stages = [featureIndexer, dt])
model = pipeline.fit(trainData) #训练模型

11)用测试数据在下一次做下决策树准确度测试
predictResultAll = model.transform(testData)
df_predict = predictResultAll.select(['prediction']).toPandas()
dtPredict = list(df_predict.prediction)

#对预测值做准确性统计
dtTotalCorrect = 0

#获取测试集的总行数
testRaw = testData.count();
#testLabel = testData.select("label").collect() #这个获取的row不是list?版本原因
df_test = testData.select(['label']).toPandas()
testLabel = list(df_test.label)

for i in range(testRaw):
if dtPredict[i] == testLabel[i]:
dtTotalCorrect += 1

1.0 * dtTotalCorrect / testRaw 


11.8.2 交叉验证和网格参数
#导入交叉验证和参数网格
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
#导入二分类评估器
from pyspark.ml.evaluation import BinaryClassificationEvaluator
evaluator = BinaryClassificationEvaluator() #初始化一个评估器
#设置参数网格
paramGrid = ParamGridBuilder().addGrid(dt.maxDepth, [4,5,6]).build()
#设置交叉认证的参数
crossval = CrossValidator(estimator=pipeline,
                          estimatorParamMaps=paramGrid,
                          evaluator=evaluator,
                          numFolds=2)  # use 3+ folds in practice
  
# 通过交叉认证来训练模型, and choose the best set of parameters.
cvModel = crossval.fit(trainData)
# 测试模型
predictResultAll = cvModel.transform(testData)
df_predict = predictResultAll.select(['prediction']).toPandas()
dtPredict = list(df_predict.prediction)

dtTotalCorrect = 0
testRaw = testData.count();
df_test = testData.select(['label']).toPandas()
testLabel = list(df_test.label)

for i in range(testRaw):
if dtPredict[i] == testLabel[i]:
dtTotalCorrect += 1

1.0 * dtTotalCorrect / testRaw 


我们还可以查看最匹配模型的具体参数:
bestModel = cvModel.bestModel.stages[1]
bestModel

bestModel.numFeatures #决策树有36个特征值

bestModel.depth #最大深度为10

bestModel.numNodes #决策树节点

11.9 脚本方式运行,拷贝进去的时候,记得tab等,可运行文件在码云
from pyspark import SparkConf, SparkContext
from pyspark.sql import SparkSession
# coding: utf-8


if __name__ == "__main__":
#指定本地运行spark
sparkConf = SparkConf().setMaster("local[*]") 
sc = SparkContext(conf = sparkConf)
spark = SparkSession.builder.master('local').appName("DecisionTree").config("spark.some.config.option", "some-value").getOrCreate()


spark-submit ch11_decisionTree.py



  • 1
    点赞
  • 6
    收藏
    觉得还不错? 一键收藏
  • 1
    评论
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值