由于此书不配代码,以下代码都是本宝宝在ipynb测试过的,运行环境为hdp2.6.2和Anaconda2。
完整ipynb和py代码地址:
https://gitee.com/iscas/deep_spark_ml/tree/master
11.3 数据加载
删除标题
sed 1d train.tsv > train_noheader.tsv
上传到hdfs
hdfs dfs -put train_noheader.tsv /u01/bigdata/data/
11.4 数据探索
PYSPARK_DRIVER_PYTHON=ipython PYSPARK_DRIVER_PYTHON_OPTS="notebook" pyspark
Path = "hdfs://XX:8020/u01/bigdata/"
raw_data = sc.textFile(Path + "data/train_noheader.tsv")
raw_data.take(2)
查看总行数:
numRaw = raw_data.count()
numRaw
按键进行统计
raw_data.countByKey()
11.5 数据预处理
1)过滤时间戳和网页内容
records = raw_data.map(lambda line: line.split('\t'))
records.first()
3)查看每行列数
len(records.first())
4)将RDD中的所有元素以列表的形式返回
data = records.collect()
5)查看data数据一行有多少列
numColumns = len(data[0])
numColumns
6)对数据进行清理工作,并且定义一个列表data1,存放处理过的数据,格式为[(label_1, feature_1), (label_2, feature_2), ...]
清理步骤如下:
1)去掉引号
2)把标签列(即最后一列)转换为整数
3)把第4列的?转换为0.0
代码如下:
from pyspark.ml.linalg import Vectors
from pyspark.ml.classification import DecisionTreeClassifier
data1 = []
for i in range(numRaw):
trimmed = [ each.replace('"', "") for each in data[i] ]
label = int(trimmed[-1])
features = map(lambda x: 0.0 if x == "?" else x, trimmed[4: numColumns-1]) #只取第5到27列
c = (label, Vectors.dense(map(float, features)))
data1.append(c)
查看data1数据:
data1[0] #下面是输出结果
(0, DenseVector([0.7891, 2.0556, 0.6765, 0.2059, 0.0471, 0.0235, 0.4438, 0.0, 0.0, 0.0908, 0.0, 0.2458, 0.0039, 1.0, 1.0, 24.0, 0.0, 5424.0, 170.0, 8.0, 0.1529, 0.0791]))
11.6 创建决策树模型
1)将data1转换为DataFrame对象,label为标签列,features为特征值列
df = spark.createDataFrame(data1, ["label", "features"])
df.show(2)
#显示df的schema
df.printSchema(),下面是结果
root
|-- label: long (nullable = true)
|-- features: vector (nullable = true)
2)将df载入内存
df.cache()
3)建立特征索引
from pyspark.ml.feature import VectorIndexer
featureIndexer = VectorIndexer(inputCol="features", outputCol="indexedFeatures", maxCategories=24).fit(df)
4)将数据切分成80%训练集和20%测试集
(trainData, testData) = df.randomSplit([0.8, 0.2], seed=1234L) #seed=1234L表示每次生成的训练集和测试集总行数不变
trainData.count()
testData.count()
5)指定决策树模型的深度、标签列、特征值列、使用信息熵作为评估方法,并训练数据
dt = DecisionTreeClassifier(maxDepth=5, labelCol="label", featuresCol="indexedFeatures", impurity="entropy")
6)构建流水工作流
from pyspark.ml import Pipeline
pipeline = Pipeline(stages = [featureIndexer, dt])
model = pipeline.fit(trainData) #训练模型
11.7 训练模型进行预测
2)使用数据集中第一行的特征值数据进行预测
test0 = spark.createDataFrame([(data1[0][1],)], ["features"])
result = model.transform(test0)
result.show()
输出结果:
+--------------------+--------------------+-------------+--------------------+----------+
| features| indexedFeatures|rawPrediction| probability|prediction|
+--------------------+--------------------+-------------+--------------------+----------+
|[0.789131,2.05555...|[0.789131,2.05555...|[564.0,578.0]|[0.49387040280210...| 1.0|
+--------------------+--------------------+-------------+--------------------+----------+
result.select(['prediction']).show() #只获取预测值
输出结果:
+----------+
|prediction|
+----------+
| 1.0|
+----------+
3)将第一行的特征值数据修改掉2个(这里换掉第1个和第2个值),进行该特征值下的预测
firstRaw = list(data1[0][1])
firstRaw[0]
firstRaw[1]
predictData = Vectors.dense(firstRaw)
predictData
结果:
DenseVector([0.7891, 2.0556, 0.6765, 0.2059, 0.0471, 0.0235, 0.4438, 0.0, 0.0, 0.0908, 0.0, 0.2458, 0.0039, 1.0, 1.0, 24.0, 0.0, 5424.0, 170.0, 8.0, 0.1529, 0.0791])
4)进行新数据的预测
predictRaw = spark.createDataFrame([(predictData,)], ["features"])
predictResult = model.transform(predictRaw)
predictResult.show()
结果:
+--------------------+--------------------+-------------+--------------------+----------+
| features| indexedFeatures|rawPrediction| probability|prediction|
+--------------------+--------------------+-------------+--------------------+----------+
|[0.789131,2.05555...|[0.789131,2.05555...|[564.0,578.0]|[0.49387040280210...| 1.0|
+--------------------+--------------------+-------------+--------------------+----------+
5)用测试数据做测试
#通过模型,预测测试集
predictResultAll = model.transform(testData)
predictResultAll.select(['prediction']).show()
#由于预测值是DataFrame对象,每一行是Raw型,不可做修改
#需将预测值转换为pandas,然后转换为列表
df_predict = predictResultAll.select(['prediction']).toPandas()
dtPredict = list(df_predict.prediction)
#查看前10个预测值
dtPredict[:10]
#对预测值做准确性统计
dtTotalCorrect = 0
#获取测试集的总行数
testRaw = testData.count();
#testLabel = testData.select("label").collect() #这个获取的row不是list?版本原因
df_test = testData.select(['label']).toPandas()
testLabel = list(df_test.label)
testLabel[:10]
for i in range(testRaw):
if dtPredict[i] == testLabel[i]:
dtTotalCorrect += 1
1.0 * dtTotalCorrect / testRaw
11.8 模型优化
11.8.1 特征值的优化
1)将之前用到的一些代码加进来
from pyspark.ml.linalg import Vectors
from pyspark.ml.classification import DecisionTreeClassifier
from pyspark.ml.feature import VectorIndexer
from pyspark.ml import Pipeline
raw_data = sc.textFile(Path + "data/train_noheader.tsv")
numRaw = raw_data.count()
records = raw_data.map(lambda line: line.split('\t'))
data = records.collect()
numColumns = len(data[0])
data1 = []
2)由于这里对网页类型的标识有很多,需要单独挑选出来做处理
#将第3列网页类型的引号除掉
category = records.map(lambda x: x[3].replace("\"", ""))
categories = sorted(category.distinct().collect())
categories
3)查看网页类型的个数
numCategories = len(categories)
numCategories
4)可以使用“One of K”来进行标签转换,即[0,0,0,0]和[1,0,0,0]这样的列表;这里定义一个函数,用于返回当前网页类型列表
def transform_category(x):
markCategory = [0] * numCategorys
index = categories.index(x)
markCategory[index] = 1
return markCategory
5)通过这样的处理,我们将网页类型这一特征值转化为14个特征值,整体的特征值其实就是增加了14个。接下来,我们在处理的时候将这些特征值加入进去
for i in range(numRaw):
trimmed = [ each.replace('"', "") for each in data[i] ]
label = int(trimmed[-1])
cate = transform_category(trimmed[3]) #调用参数,返回一个类型列表
features = cate + map(lambda x: 0.0 if x == "?" else x, trimmed[4: numColumns-1]) #只取第5到27列
c = (label, Vectors.dense(map(float, features)))
data1.append(c)
6)创建DataFrame对象
df = spark.createDataFrame(data1, ["label", "features"])
df.cache()
7)建立特征索引
featureIndexer = VectorIndexer(inputCol="features", outputCol="indexedFeatures", maxCategories=24).fit(df)
8)将数据切分成80%训练集和20%测试集
(trainData, testData) = df.randomSplit([0.8, 0.2], seed=1234L) #seed=1234L表示每次生成的训练集和测试集总行数不变
trainData.count()
testData.count()
9)创建决策树模型
dt = DecisionTreeClassifier(maxDepth=5, labelCol="label", featuresCol="indexedFeatures", impurity="entropy")
10)构建流水工作流
pipeline = Pipeline(stages = [featureIndexer, dt])
model = pipeline.fit(trainData) #训练模型
11)用测试数据在下一次做下决策树准确度测试
predictResultAll = model.transform(testData)
df_predict = predictResultAll.select(['prediction']).toPandas()
dtPredict = list(df_predict.prediction)
#对预测值做准确性统计
dtTotalCorrect = 0
#获取测试集的总行数
testRaw = testData.count();
#testLabel = testData.select("label").collect() #这个获取的row不是list?版本原因
df_test = testData.select(['label']).toPandas()
testLabel = list(df_test.label)
for i in range(testRaw):
if dtPredict[i] == testLabel[i]:
dtTotalCorrect += 1
1.0 * dtTotalCorrect / testRaw
11.8.2 交叉验证和网格参数
#导入交叉验证和参数网格
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
#导入二分类评估器
from pyspark.ml.evaluation import BinaryClassificationEvaluator
evaluator = BinaryClassificationEvaluator() #初始化一个评估器
#设置参数网格
paramGrid = ParamGridBuilder().addGrid(dt.maxDepth, [4,5,6]).build()
#设置交叉认证的参数
crossval = CrossValidator(estimator=pipeline,
estimatorParamMaps=paramGrid,
evaluator=evaluator,
numFolds=2) # use 3+ folds in practice
# 通过交叉认证来训练模型, and choose the best set of parameters.
cvModel = crossval.fit(trainData)
# 测试模型
predictResultAll = cvModel.transform(testData)
df_predict = predictResultAll.select(['prediction']).toPandas()
dtPredict = list(df_predict.prediction)
dtTotalCorrect = 0
testRaw = testData.count();
df_test = testData.select(['label']).toPandas()
testLabel = list(df_test.label)
for i in range(testRaw):
if dtPredict[i] == testLabel[i]:
dtTotalCorrect += 1
1.0 * dtTotalCorrect / testRaw
我们还可以查看最匹配模型的具体参数:
bestModel = cvModel.bestModel.stages[1]
bestModel
bestModel.numFeatures #决策树有36个特征值
bestModel.depth #最大深度为10
bestModel.numNodes #决策树节点
11.9 脚本方式运行,拷贝进去的时候,记得tab等,可运行文件在码云
from pyspark import SparkConf, SparkContext
from pyspark.sql import SparkSession
# coding: utf-8
if __name__ == "__main__":
#指定本地运行spark
sparkConf = SparkConf().setMaster("local[*]")
sc = SparkContext(conf = sparkConf)
spark = SparkSession.builder.master('local').appName("DecisionTree").config("spark.some.config.option", "some-value").getOrCreate()
spark-submit ch11_decisionTree.py