一、分类模型的种类
1.1、线性模型
1.1.1、逻辑回归
1.2.3、线性支持向量机
1.2、朴素贝叶斯模型
1.3、决策树模型
二、从数据中抽取合适的特征
MLlib中的分类模型通过LabeledPoint(label: Double, features: Vector)对象操作,其中封装了目标变量(标签)和特征向量
从Kaggle/StumbleUpon evergreen分类数据集中抽取特征
该数据集设计网页中推荐的网页是短暂(短暂存在。很快就不流行了)还是长久(长时间流行)
使用sed ld train.tsv > train_noheader.ts可以将第一行的标题栏去掉
下面开始看代码
import org.apache.spark.mllib.classification.{ClassificationModel, LogisticRegressionWithSGD, NaiveBayes, SVMWithSGD} import org.apache.spark.mllib.evaluation.BinaryClassificationMetrics import org.apache.spark.mllib.feature.StandardScaler import org.apache.spark.mllib.linalg.Vectors import org.apache.spark.mllib.linalg.distributed.RowMatrix import org.apache.spark.mllib.optimization.{SimpleUpdater, SquaredL2Updater, Updater} import org.apache.spark.mllib.regression.LabeledPoint import org.apache.spark.mllib.tree.DecisionTree import org.apache.spark.mllib.tree.configuration.Algo import org.apache.spark.mllib.tree.impurity.{Entropy, Gini, Impurity} import org.apache.spark.rdd.RDD import org.apache.spark.{SparkConf, SparkContext} object Evergreen { def main(args: Array[String]): Unit = { val sparkConf = new SparkConf().setAppName("Evergreen").setMaster("local") //设置在本地模式运行 val BASEDIR = "hdfs://pc1:9000/" //HDFS文件 //val BASEDIR = "file:///home/chenjie/" // 本地文件 //val sparkConf = new SparkConf().setAppName("Evergreen-cluster").setMaster("spark://pc1:7077").setJars(List("untitled2.jar")) //设置在集群模式运行 val sc = new SparkContext(sparkConf) //初始化sc val rawData = sc.textFile(BASEDIR + "train_noheader.tsv") //加载数据 println("rawData.first()=" + rawData.first()) //打印第一条 //"http://www.bloomberg.com/news/2010-12-23/ibm-predicts-holographic-calls-air-breathing-batteries-by-2015.html" // "4042" // "{""title"":""IBM hic calies"", // ""body"":""A sign the tahe cwlett Packard Co t last."", // ""url"":""bloomberg news 2010 12 23 ibm predicts holographic calls air breathing batteries by 2015 html""}" // "business" "0.789131" "2.055555556" "0.676470588" "0.205882353" "0.047058824" "0.023529412" "0.443783175" "0" "0" "0.09077381" "0" "0.245831182" "0.003883495" "1" "1" "24" "0" "5424" "170" "8" "0.152941176" "0.079129575" "0" }
以上代码加载了数据集,并观察第一行的数据。注意到该数据包括URL、页面的ID、原始的文本内容和分配给文本的类别。接下来22列包含各种各样的数值或者类属特征。最后一列为目标值,-1为长久,0为短暂。
由于数据格式的问题,需要对数据进行清洗,在处理过程中把额外的引号去掉。并把原始数据中的?号代替的缺失数据用0替换。
下面的代码将依次加入main函数中val records = rawData.map(line => line.split("\t")) println(records.first()) val data = records.map{ r => val trimmed = r.map(_.replaceAll("\"",""))//去掉引号 val label = trimmed(r.size - 1).toInt//得到最后一列,即类别信息 val features = trimmed.slice(4, r.size - 1).map( d => if(d == "?") 0.0 else d.toDouble)//将?用0代替 LabeledPoint(label, Vectors.dense(features)) } data.cache() val numData = data.count println("numData=" + numData) //numData=7395 val nbData = records.map{ r => val trimmed = r.map(_.replaceAll("\"","")) val label = trimmed(r.size - 1).toInt val features = trimmed.slice(4, r.size - 1).map( d => if(d == "?") 0.0 else d.toDouble) .map( d => if (d < 0) 0.0 else d) LabeledPoint(label, Vectors.dense(features)) } //在对数据集进一步处理之前,我们发现数值数据中包含负数特征值。我们知道,朴素贝叶斯模型要求特征值非负,否则遇到负的特征值就会抛出异常 //因此需要为朴素贝叶斯模型构建一份输入特征向量的数据,将负特征值设为0
三、训练分类模型
//------------训练分类模型------------------------------------------------------------------ val numItetations = 10 val maxTreeDepth = 5 val lrModel = LogisticRegressionWithSGD.train(data, numItetations) val svmModel = SVMWithSGD.train(data, numItetations) val nbModel = NaiveBayes.train(nbData) val dtModel = DecisionTree.train(data, Algo.Classification, Entropy, maxTreeDepth) //在决策树中,设置模式或者Algo时使用了Entript不纯度估计
四、使用分类模型
val dataPoint = data.first() val trueLabel = dataPoint.label println("真实分类:" + trueLabel) val prediction1 = lrModel.predict(dataPoint.features) val prediction2 = svmModel.predict(dataPoint.features) val prediction3 = nbModel.predict(dataPoint.features) val prediction4 = dtModel.predict(dataPoint.features) println("lrModel预测分类:" + prediction1) println("svmModel预测分类:" + prediction2) println("nbModel预测分类:" + prediction3) println("dtModel预测分类:" + prediction4) /* * 真实分类:0.0 lrModel预测分类:1.0 svmModel预测分类:1.0 nbModel预测分类:1.0 dtModel预测分类:0.0 * */ //也可以将RDD[Vector]整体作为输入做预测 /* val preditions = lrModel.predict(data.map(lp => lp.features)) preditions.take(5).foreach(println)*/
五、评估分类模型的性能
5.1、预测的正确率和错误率
//--------评估分类模型的性能:预测的正确率和错误率-------------------------------- val lrTotalCorrect = data.map{ point => if(lrModel.predict(point.features) == point.label) 1 else 0 }.sum val svmTotalCorrect = data.map{ point => if(svmModel.predict(point.features) == point.label) 1 else 0 }.sum val nbTotalCorrect = nbData.map{ point => if(nbModel.predict(point.features) == point.label) 1 else 0 }.sum val dtTotalCorrect = data.map{ point => val socre = dtModel.predict(point.features) val predicted = if(socre > 0.5) 1 else 0 if(predicted == point.label) 1 else 0 }.sum val lrAccuracy = lrTotalCorrect / data.count val svmAccuracy = svmTotalCorrect / numData val nbAccuracy = nbTotalCorrect / numData val dtAccuracy = dtTotalCorrect / numData println("lrModel预测分类正确率:" + lrAccuracy) println("svmModel预测分类正确率:" + svmAccuracy) println("nbModel预测分类正确率:" + nbAccuracy) println("dtModel预测分类正确率:" + dtAccuracy) /* * lrModel预测分类正确率:0.5146720757268425 svmModel预测分类正确率:0.5146720757268425 nbModel预测分类正确率:0.5803921568627451 dtModel预测分类正确率:0.6482758620689655 * */
5.2、准确率和召回律
//--------评估分类模型的性能:准确率和召回律-------------------------------- /**准确率用于评价结果的质量,召回律用来评价结果的完整性 * * 真阳性的数目(被正确预测的类别为1的样本) * 在二分类的问题中,准确率= ------------------------- --------------------- * 真阳性的数目 + 假阳性的数目(被错误预测为类别1的样本) * * 真阳性的数目(被正确预测的类别为1的样本) * 召回率= --------------------------------------------- * 真阳性的数目 + 假阴性的数目(被错误预测为类别0的样本) * 准确率-召回率(PR)曲线下的面积为平均准确率 */ val metrics = Seq(lrModel, svmModel).map{ model => val scoreAndLabels = data.map{ point => (model.predict(point.features), point.label) } val metrics = new BinaryClassificationMetrics(scoreAndLabels) (model.getClass.getSimpleName, metrics.areaUnderPR(), metrics.areaUnderROC()) } val nbMetrics = Seq(nbModel).map{ model => val scoreAndLabels = nbData.map{ point => val score = model.predict(point.features) (if (score > 0.5) 1.0 else 0.0, point.label) } val metrics = new BinaryClassificationMetrics(scoreAndLabels) (model.getClass.getSimpleName, metrics.areaUnderPR(), metrics.areaUnderROC()) } val dtMetrics = Seq(dtModel).map{ model => val scoreAndLabels = nbData.map { point => val score = model.predict(point.features) (if (score > 0.5) 1.0 else 0.0, point.label) } val metrics = new BinaryClassificationMetrics(scoreAndLabels) (model.getClass.getSimpleName, metrics.areaUnderPR(), metrics.areaUnderROC()) } val allMetrics = metrics ++ nbMetrics ++ dtMetrics allMetrics.foreach{ case (model,pr,roc) => println(f"$model, Area under PR : ${pr * 100.0}%2.4f%%,Area under ROC: ${roc * 100.0}%2.4f%%") } //LogisticRegressionModel, Area under PR : 75.6759%,Area under ROC: 50.1418% //SVMModel, Area under PR : 75.6759%,Area under ROC: 50.1418% //NaiveBayesModel, Area under PR : 68.0851%,Area under ROC: 58.3559% //DecisionTreeModel, Area under PR : 74.3081%,Area under ROC: 64.8837%
5.3、ROC曲线和AUC
//--------评估分类模型的性能:ROC曲线和AUC-------------------------------- /**ROC曲线在概念上与PR曲线类似,它是对分类器的真阳性率-假阳性率的图形化解释。 * * 真阳性的数目(被正确预测的类别为1的样本) * 真阳性率= ----------------------------------------------- , 与召回率类似,也称为敏感度。 * 真阳性的数目 + 假阴性的数目(被错误预测为类别0的样本) * * ROC曲线表现了分类器性能在不同决策阈值下TPR对FPR的折衷。ROC下的面积,称为AUC,表示平均值。 * * */
六、改进模型性能以及参数调优
6.1、特征标准化
//------改进模型性能以及参数调优------------------------------------------- val vectors = data.map(lp => lp.features) val matrix = new RowMatrix(vectors) val matrixSummary = matrix.computeColumnSummaryStatistics() println("每列的均值:") println(matrixSummary.mean) //[0.41225805299526774,2.76182319198661,0.46823047328613876,0.21407992638350257,0.0920623607189991,0.04926216043908034,2.255103452212025,-0.10375042752143329,0.0,0.05642274498417848,0.02123056118999324,0.23377817665490225,0.2757090373659231,0.615551048005409,0.6603110209601082,30.077079107505178,0.03975659229208925,5716.598242055454,178.75456389452327,4.960649087221106,0.17286405047031753,0.10122079189276531] println("每列的最小值:") //[0.0,0.0,0.0,0.0,0.0,0.0,0.0,-1.0,0.0,0.0,0.0,0.045564223,-1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0] println(matrixSummary.min) println("每列的最大值:") println(matrixSummary.max) //[0.999426,363.0,1.0,1.0,0.980392157,0.980392157,21.0,0.25,0.0,0.444444444,1.0,0.716883117,113.3333333,1.0,1.0,100.0,1.0,207952.0,4997.0,22.0,1.0,1.0] println("每列的方差:") println(matrixSummary.variance) //[0.10974244167559023,74.30082476809655,0.04126316989120245,0.021533436332001124,0.009211817450882448,0.005274933469767929,32.53918714591818,0.09396988697611537,0.0,0.001717741034662896,0.020782634824610638,0.0027548394224293023,3.6837889196744116,0.2366799607085986,0.22433071201674218,415.87855895438463,0.03818116876739597,7.877330081138441E7,32208.11624742624,10.453009045764313,0.03359363403832387,0.0062775328842146995] println("每列的非0项数目:") println(matrixSummary.numNonzeros) //[5053.0,7354.0,7172.0,6821.0,6160.0,5128.0,7350.0,1257.0,0.0,7362.0,157.0,7395.0,7355.0,4552.0,4883.0,7347.0,294.0,7378.0,7395.0,6782.0,6868.0,7235.0] //观察到第二列的方差和均值比其他都要高,为了使数据更符合模型的假设,可以对每个特征进行标准化,使得每个特征都是0均值和单位标准差。 //具体做法是对每个特征值减去列的均值,然后除以列的标准差进行缩放 //可以使用Spark的StandardScaler中的方法方便地完成这些操作。 val scaler = new StandardScaler(withMean = true, withStd = true).fit(vectors) val scaledData = data.map(lp => LabeledPoint(lp.label, scaler.transform(lp.features))) println("标准化前:" + data.first().features) println("标准化后:" + scaledData.first().features) //标准化前:[0.789131,2.055555556,0.676470588,0.205882353,0.047058824,0.023529412,0.443783175,0.0,0.0,0.09077381,0.0,0.245831182,0.003883495,1.0,1.0,24.0,0.0,5424.0,170.0,8.0,0.152941176,0.079129575] //标准化后:[1.1376473364976751,-0.08193557169294784,1.0251398128933333,-0.05586356442541853,-0.4688932531289351,-0.35430532630793654,-0.3175352172363122,0.3384507982396541,0.0,0.8288221733153222,-0.14726894334628504,0.22963982357812907,-0.14162596909880876,0.7902380499177364,0.7171947294529865,-0.29799681649642484,-0.2034625779299476,-0.03296720969690467,-0.04878112975579767,0.9400699751165406,-0.10869848852526329,-0.27882078231369967] //下面使用标准化的数据重新训练模型。这里只训练逻辑回归,因为决策树和朴素贝叶斯不受特征标准化的影响。 val lrModelScaled = LogisticRegressionWithSGD.train(scaledData, numItetations) val lrTotalCorrectScaled = scaledData.map{ point => if(lrModelScaled.predict(point.features) == point.label) 1 else 0 }.sum val lrAccuracyScaled = lrTotalCorrectScaled / numData val lrPreditionsVsTrue = scaledData.map{ point => (lrModelScaled.predict(point.features), point.label) } val lrMetricsScaled = new BinaryClassificationMetrics(lrPreditionsVsTrue) val lrPr = lrMetricsScaled.areaUnderPR() val lrRoc = lrMetricsScaled.areaUnderROC() println(f"${lrModelScaled.getClass.getSimpleName}\n Accuracy:${lrAccuracyScaled * 100}%2.4f%%\n Area under PR : ${lrPr * 100.0}%2.4f%%,Area under ROC: ${lrRoc * 100.0}%2.4f%%") //LogisticRegressionModel //Accuracy:62.0419% // Area under PR : 72.7254%,Area under ROC: 61.9663% //对比之前的 //lrModel预测分类正确率:0.5146720757268425 // LogisticRegressionModel, Area under PR : 75.6759%,Area under ROC: 50.1418% //正确率和ROC提高了很多,这就算特征标准化的作用
6.2、使用其他特征
//-------------其他特征-------------------------------------------------- //之前我们只使用了数据的部分特征 val categories = records.map(r => r(3)).distinct().collect().zipWithIndex.toMap val numCategories = categories.size println(categories) println("种类数:" + numCategories) //Map("weather" -> 0, "sports" -> 1, "unknown" -> 10, "computer_internet" -> 11, "?" -> 8, "culture_politics" -> 9, "religion" -> 4, "recreation" -> 7, "arts_entertainment" -> 5, "health" -> 12, "law_crime" -> 6, "gaming" -> 13, "business" -> 2, "science_technology" -> 3) //种类数:14 //下面使用一个长为14的向量来表示类别特征,然后根据每个样本所属类别索引,对相应的维度赋值为1,其他为0.我们假定这个新的特征向量和其他的数值特征向量一样 val dataCategories = records.map{ r => val trimmed = r.map(_.replaceAll("\"","")) val label = trimmed(r.size - 1).toInt val categoryIdx = categories(r(3)) val categoryFeatures = Array.ofDim[Double](numCategories) categoryFeatures(categoryIdx) = 1.0 val otherFeatures = trimmed.slice(4, r.size - 1).map(d => if(d == "?") 0.0 else d.toDouble) val features = categoryFeatures ++ otherFeatures LabeledPoint(label, Vectors.dense(features)) } println("观察第一行:" + dataCategories.first()) //观察第一行:(0.0,[0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.789131,2.055555556,0.676470588,0.205882353,0.047058824,0.023529412,0.443783175,0.0,0.0,0.09077381,0.0,0.245831182,0.003883495,1.0,1.0,24.0,0.0,5424.0,170.0,8.0,0.152941176,0.079129575]) //发现此前的类别特征已经转为14维向量 val scalerCats = new StandardScaler(withMean = true, withStd = true).fit(dataCategories.map(lp => lp.features)) val scaledDataCasts = dataCategories.map( lp => LabeledPoint(lp.label, scalerCats.transform(lp.features)) ) scaledDataCasts.cache() println("标准化后:" + scaledDataCasts.first()) //标准化后:(0.0,[-0.02326210589837061,-0.23272797709480803,2.7207366564548514,-0.2016540523193296,-0.09914991930875496,-0.38181322324318134,-0.06487757239262681,-0.4464212047941535,-0.6807527904251456,-0.22052688457880879,-0.028494000387023734,-0.20418221057887365,-0.2709990696925828,-0.10189469097220732,1.1376473364976751,-0.08193557169294784,1.0251398128933333,-0.05586356442541853,-0.4688932531289351,-0.35430532630793654,-0.3175352172363122,0.3384507982396541,0.0,0.8288221733153222,-0.14726894334628504,0.22963982357812907,-0.14162596909880876,0.7902380499177364,0.7171947294529865,-0.29799681649642484,-0.2034625779299476,-0.03296720969690467,-0.04878112975579767,0.9400699751165406,-0.10869848852526329,-0.27882078231369967]) val nbDataCategories = records.map{ r => val trimmed = r.map(_.replaceAll("\"","")) val label = trimmed(r.size - 1).toInt val categoryIdx = categories(r(3)) val categoryFeatures = Array.ofDim[Double](numCategories) categoryFeatures(categoryIdx) = 1.0 val otherFeatures = trimmed.slice(4, r.size - 1).map(d => if(d == "?") 0.0 else d.toDouble) .map( d => if (d < 0) 0.0 else d) val features = categoryFeatures ++ otherFeatures LabeledPoint(label, Vectors.dense(features)) } println("观察第一行:" + nbDataCategories.first()) //观察第一行:(0.0,[0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.789131,2.055555556,0.676470588,0.205882353,0.047058824,0.023529412,0.443783175,0.0,0.0,0.09077381,0.0,0.245831182,0.003883495,1.0,1.0,24.0,0.0,5424.0,170.0,8.0,0.152941176,0.079129575]) nbDataCategories.cache() val lrModelScaledCats = LogisticRegressionWithSGD.train(scaledDataCasts, numItetations)//增加类型矩阵并标准化后的逻辑回归分类模型 val svmModelScaledCats = SVMWithSGD.train(scaledDataCasts, numItetations)//增加类型矩阵并标准化后的支持向量机分类模型 val nbModelScaledCats = NaiveBayes.train(nbDataCategories)//增加类型矩阵并标准化后的朴素贝叶斯分类模型 val dtModelScaledCats = DecisionTree.train(dataCategories, Algo.Classification, Entropy, maxTreeDepth) //增加类型矩阵并标准化后的决策树分类模型 //注意,决策树和朴素贝叶斯不受特征标准化的影响。反而标准化后出现负值无法使用贝叶斯 val lrTotalCorrectScaledCats = scaledDataCasts.map{ point => if(lrModelScaledCats.predict(point.features) == point.label) 1 else 0 }.sum val svmTotalCorrectScaledCats = scaledDataCasts.map{ point => if(svmModelScaledCats.predict(point.features) == point.label) 1 else 0 }.sum val nbTotalCorrectScaledCats = nbDataCategories.map{ point => if(nbModelScaledCats.predict(point.features) == point.label) 1 else 0 }.sum val dtTotalCorrectScaledCats = dataCategories.map{ point => val socre = dtModelScaledCats.predict(point.features) val predicted = if(socre > 0.5) 1 else 0 if(predicted == point.label) 1 else 0 }.sum val lrAccuracyScaledCats = lrTotalCorrectScaledCats / numData val svmAccuracyScaledCats = svmTotalCorrectScaledCats / numData val nbAccuracyScaledCats = nbTotalCorrectScaledCats / numData val dtAccuracyScaledCats = dtTotalCorrectScaledCats / numData println("新 lrModel预测分类正确率:" + lrAccuracyScaledCats) println("新svmModel预测分类正确率:" + svmAccuracyScaledCats) println("新 nbModel预测分类正确率:" + nbAccuracyScaledCats) println("新 dtModel预测分类正确率:" + dtAccuracyScaledCats) /* 此前的 * lrModel预测分类正确率:0.5146720757268425 svmModel预测分类正确率:0.5146720757268425 nbModel预测分类正确率:0.5803921568627451 dtModel预测分类正确率:0.6482758620689655 * */ /*** *新 lrModel预测分类正确率:0.6657200811359026 新svmModel预测分类正确率:0.6645030425963488 新 nbModel预测分类正确率:0.5832319134550372 新 dtModel预测分类正确率:0.6655848546315077 * */ val lrPreditionsVsTrueScaledCats = dataCategories.map{ point => (lrModelScaledCats.predict(point.features), point.label) } val lrMetricsScaledCats = new BinaryClassificationMetrics(lrPreditionsVsTrueScaledCats) val lrPrScaledCats = lrMetricsScaledCats.areaUnderPR() val lrRocScaledCats = lrMetricsScaledCats.areaUnderROC() println(f"${lrModelScaledCats.getClass.getSimpleName}\n Accuracy:${lrAccuracyScaledCats * 100}%2.4f%%\n Area under PR : ${lrPrScaledCats * 100.0}%2.4f%%,Area under ROC: ${lrRocScaledCats * 100.0}%2.4f%%") //LogisticRegressionModel //Accuracy:66.5720% //Area under PR : 75.6015%,Area under ROC: 52.1977% val metrics2 = Seq(lrModelScaledCats, svmModelScaledCats).map{ model => val scoreAndLabels = dataCategories.map{ point => (model.predict(point.features), point.label) } val metrics = new BinaryClassificationMetrics(scoreAndLabels) (model.getClass.getSimpleName, metrics.areaUnderPR(), metrics.areaUnderROC()) } val nbMetrics2 = Seq(nbModelScaledCats).map{ model => val scoreAndLabels = nbDataCategories.map{ point => val score = model.predict(point.features) (if (score > 0.5) 1.0 else 0.0, point.label) } val metrics = new BinaryClassificationMetrics(scoreAndLabels) (model.getClass.getSimpleName, metrics.areaUnderPR(), metrics.areaUnderROC()) } val dtMetrics2 = Seq(dtModelScaledCats).map{ model => val scoreAndLabels = dataCategories.map { point => val score = model.predict(point.features) (if (score > 0.5) 1.0 else 0.0, point.label) } val metrics = new BinaryClassificationMetrics(scoreAndLabels) (model.getClass.getSimpleName, metrics.areaUnderPR(), metrics.areaUnderROC()) } val allMetrics2 = metrics2 ++ nbMetrics2 ++ dtMetrics2 allMetrics2.foreach{ case (model,pr,roc) => println(f"新$model, Area under PR : ${pr * 100.0}%2.4f%%,Area under ROC: ${roc * 100.0}%2.4f%%") } //LogisticRegressionModel, Area under PR : 75.6759%,Area under ROC: 50.1418% //SVMModel, Area under PR : 75.6759%,Area under ROC: 50.1418% //NaiveBayesModel, Area under PR : 68.0851%,Area under ROC: 58.3559% //DecisionTreeModel, Area under PR : 74.3081%,Area under ROC: 64.8837% //新LogisticRegressionModel, Area under PR : 75.6015%,Area under ROC: 52.1977% //新SVMModel, Area under PR : 75.5180%,Area under ROC: 54.1606% //新NaiveBayesModel, Area under PR : 68.3386%,Area under ROC: 58.6397% //新DecisionTreeModel, Area under PR : 75.8784%,Area under ROC: 66.5005%
6.3、使用正确的数据格式
//--------使用正确的数据格式---------------------------------------------- //现在我们仅仅使用类型特征,也就是只使用前14个向量,因为1-of-k编码的类型特征更符合朴素贝叶斯模型 val nbDataOnlyCategories = records.map{ r => val trimmed = r.map(_.replaceAll("\"","")) val label = trimmed(r.size - 1).toInt val categoryIdx = categories(r(3)) val categoryFeatures = Array.ofDim[Double](numCategories) categoryFeatures(categoryIdx) = 1.0 LabeledPoint(label, Vectors.dense(categoryFeatures)) } println("观察第一行:" + nbDataOnlyCategories.first()) val nbModelScaledOnlyCats = NaiveBayes.train(nbDataOnlyCategories)//只有类型矩阵并标准化后的朴素贝叶斯分类模型 val nbMetricsOnlyCats = Seq(nbModelScaledOnlyCats).map{ model => val scoreAndLabels = nbDataOnlyCategories.map{ point => val score = model.predict(point.features) (if (score > 0.5) 1.0 else 0.0, point.label) } val metrics = new BinaryClassificationMetrics(scoreAndLabels) (model.getClass.getSimpleName, metrics.areaUnderPR(), metrics.areaUnderROC()) } nbMetricsOnlyCats.foreach{ case (model,pr,roc) => println(f"新$model, Area under PR : ${pr * 100.0}%2.4f%%,Area under ROC: ${roc * 100.0}%2.4f%%") } //新NaiveBayesModel, Area under PR : 74.0522%,Area under ROC: 60.5138% //对比此前的: //NaiveBayesModel, Area under PR : 68.0851%,Area under ROC: 58.3559% //提升了2个百分点 val nbTotalCorrectScaledOnlyCats = nbDataOnlyCategories.map{ point => if(nbModelScaledOnlyCats.predict(point.features) == point.label) 1 else 0 }.sum val nbAccuracyScaledOnlyCats = nbTotalCorrectScaledOnlyCats / numData println("新 nbModel预测分类正确率:" + nbAccuracyScaledOnlyCats) //新 nbModel预测分类正确率:0.6096010818120352 //对比此前的 //新 nbModel预测分类正确率:0.5832319134550372 //提升了2个百分点
6.4、模型参数调优
6.4.1、线性模型调优
//--------模型参数调优:线性模型----------------------------------------------------------------------------- scaledDataCasts.cache() //(1)迭代次数的影响 val iterResults = Seq(1,5,10,50).map{ param => val model = trainLRWithParams(scaledDataCasts, 0.0, param, new SimpleUpdater, 1.0) createMetrics(s"$param iterations", scaledDataCasts, model) } iterResults.foreach{ case (param, auc) => println(f"$param, AUC=${auc * 100}%2.4f%%") } /*1 iterations, AUC=64.9520% 5 iterations, AUC=66.6161% 10 iterations, AUC=66.5483% 50 iterations, AUC=66.8143%*/ //(2)步长的影响 val stepResults = Seq(0.001, 0.01, 0.1, 1.0, 10.0).map{ param => val model = trainLRWithParams(scaledDataCasts, 0.0, numItetations, new SimpleUpdater, param) createMetrics(s"$param step size", scaledDataCasts, model) } stepResults.foreach{ case (param, auc) => println(f"$param, AUC=${auc * 100}%2.4f%%") } /*0.001 step size, AUC=64.9659% 0.01 step size, AUC=64.9644% 0.1 step size, AUC=65.5211% 1.0 step size, AUC=66.5483% 10.0 step size, AUC=61.9228%*/ //(3)正则化的影响 val regResults = Seq(0.001, 0.01, 0.1, 1.0, 10.0).map{ param => val model = trainLRWithParams(scaledDataCasts, param, numItetations, new SquaredL2Updater, 1.0) createMetrics(s"$param L2 regularization parameter", scaledDataCasts, model) } regResults.foreach{ case (param, auc) => println(f"$param, AUC=${auc * 100}%2.4f%%") } /*0.001 L2 regularization parameter, AUC=66.5475% 0.01 L2 regularization parameter, AUC=66.5475% 0.1 L2 regularization parameter, AUC=66.5475% 1.0 L2 regularization parameter, AUC=66.5475% 10.0 L2 regularization parameter, AUC=66.5475%*/
6.4.2、决策树调优
//--------模型参数调优:决策树----------------------------------------------------------------------------- //调整树的深度参数 val dtResultsEntropy = Seq(1, 2, 3, 4, 5, 10, 20).map{ param => val model = trainDTWithParams(scaledDataCasts,param, Entropy) val scoreAndLabels = scaledDataCasts.map{ point => val score = model.predict(point.features) (if (score > 0.5) 1.0 else 0.0, point.label) } val metrics = new BinaryClassificationMetrics(scoreAndLabels) (s"$param tree depth with Entropy", metrics.areaUnderROC()) } dtResultsEntropy.foreach { case (param, auc) => println(f"$param,AUC=${auc * 100}%2.4f%%") } /*1 tree depth,AUC=59.3268% 2 tree depth,AUC=59.3268% 3 tree depth,AUC=61.8313% 4 tree depth,AUC=62.1519% 5 tree depth,AUC=66.5005% 10 tree depth,AUC=75.9120% 20 tree depth,AUC=96.4347%*/ //调整不纯度度量方式:Gini或者Entropy val dtResultsEntropy2 = Seq(1, 2, 3, 4, 5, 10, 20).map{ param => val model = trainDTWithParams(scaledDataCasts,param, Gini) val scoreAndLabels = scaledDataCasts.map{ point => val score = model.predict(point.features) (if (score > 0.5) 1.0 else 0.0, point.label) } val metrics = new BinaryClassificationMetrics(scoreAndLabels) (s"$param tree depth with Gini", metrics.areaUnderROC()) } dtResultsEntropy2.foreach { case (param, auc) => println(f"$param,AUC=${auc * 100}%2.4f%%") } /* 1 tree depth with Gini,AUC=59.3268% 2 tree depth with Gini,AUC=61.6106% 3 tree depth with Gini,AUC=61.8349% 4 tree depth with Gini,AUC=62.0433% 5 tree depth with Gini,AUC=66.4518% 10 tree depth with Gini,AUC=76.8962% 20 tree depth with Gini,AUC=98.3514%*/
6.4.3、朴素贝叶斯调优
//--------模型参数调优:朴素贝叶斯----------------------------------------------------------------------------- val nbResults = Seq(0.001, 0.01, 0.1, 1.0, 10.0).map{ param => val model = trainNBWithParams(nbDataCategories, param) val scoreAndLabels = scaledDataCasts.map{ point => (model.predict(point.features), point.label) } val metrics = new BinaryClassificationMetrics(scoreAndLabels) (s"$param lambda", metrics.areaUnderROC()) } nbResults.foreach { case (param, auc) => println(f"$param,AUC=${auc * 100}%2.4f%%") } /*0.001 lambda,AUC=61.2364% 0.01 lambda,AUC=61.3334% 0.1 lambda,AUC=61.4714% 1.0 lambda,AUC=61.5605% 10.0 lambda,AUC=61.8360%*/6.4.4、交叉验证
将数据集划分为训练集和测试集
//---------交叉验证------------------------------------------------------------------------------------------- val trainTestSplit = scaledDataCasts.randomSplit(Array(0.6,0.4),123) val train = trainTestSplit(0) val test = trainTestSplit(1) val regResultsTest = Seq(0.0, 0.001, 0.0025, 0.005, 0.01).map{ param => val model = trainLRWithParams(train, param, numItetations, new SquaredL2Updater, 1.0) createMetrics(s"$param L2 regularization parameter", test, model) } regResultsTest.foreach { case (param, auc) => println(f"$param,AUC=${auc * 100}%2.4f%%") }
完整代码:
import org.apache.spark.mllib.classification.{ClassificationModel, LogisticRegressionWithSGD, NaiveBayes, SVMWithSGD} import org.apache.spark.mllib.evaluation.BinaryClassificationMetrics import org.apache.spark.mllib.feature.StandardScaler import org.apache.spark.mllib.linalg.Vectors import org.apache.spark.mllib.linalg.distributed.RowMatrix import org.apache.spark.mllib.optimization.{SimpleUpdater, SquaredL2Updater, Updater} import org.apache.spark.mllib.regression.LabeledPoint import org.apache.spark.mllib.tree.DecisionTree import org.apache.spark.mllib.tree.configuration.Algo import org.apache.spark.mllib.tree.impurity.{Entropy, Gini, Impurity} import org.apache.spark.rdd.RDD import org.apache.spark.{SparkConf, SparkContext} object Evergreen { def main(args: Array[String]): Unit = { val sparkConf = new SparkConf().setAppName("Evergreen").setMaster("local") //设置在本地模式运行 val BASEDIR = "hdfs://pc1:9000/" //HDFS文件 //val BASEDIR = "file:///home/chenjie/" // 本地文件 //val sparkConf = new SparkConf().setAppName("Evergreen-cluster").setMaster("spark://pc1:7077").setJars(List("untitled2.jar")) //设置在集群模式运行 val sc = new SparkContext(sparkConf) //初始化sc val rawData = sc.textFile(BASEDIR + "train_noheader.tsv") //加载数据 println("rawData.first()=" + rawData.first()) //打印第一条 //"http://www.bloomberg.com/news/2010-12-23/ibm-predicts-holographic-calls-air-breathing-batteries-by-2015.html" // "4042" // "{""title"":""IBM hic calies"", // ""body"":""A sign the tahe cwlett Packard Co t last."", // ""url"":""bloomberg news 2010 12 23 ibm predicts holographic calls air breathing batteries by 2015 html""}" // "business" "0.789131" "2.055555556" "0.676470588" "0.205882353" "0.047058824" "0.023529412" "0.443783175" "0" "0" "0.09077381" "0" "0.245831182" "0.003883495" "1" "1" "24" "0" "5424" "170" "8" "0.152941176" "0.079129575" "0" val records = rawData.map(line => line.split("\t")) println(records.first()) val data = records.map{ r => val trimmed = r.map(_.replaceAll("\"",""))//去掉引号 val label = trimmed(r.size - 1).toInt//得到最后一列,即类别信息 val features = trimmed.slice(4, r.size - 1).map( d => if(d == "?") 0.0 else d.toDouble)//将?用0代替 LabeledPoint(label, Vectors.dense(features)) } data.cache() val numData = data.count println("numData=" + numData) //numData=7395 val nbData = records.map{ r => val trimmed = r.map(_.replaceAll("\"","")) val label = trimmed(r.size - 1).toInt val features = trimmed.slice(4, r.size - 1).map( d => if(d == "?") 0.0 else d.toDouble) .map( d => if (d < 0) 0.0 else d) LabeledPoint(label, Vectors.dense(features)) } //在对数据集进一步处理之前,我们发现数值数据中包含负数特征值。我们知道,朴素贝叶斯模型要求特征值非负,否则遇到负的特征值就会抛出异常 //因此需要为朴素贝叶斯模型构建一份输入特征向量的数据,将负特征值设为0 //------------训练分类模型------------------------------------------------------------------ val numItetations = 10 val maxTreeDepth = 5 val lrModel = LogisticRegressionWithSGD.train(data, numItetations) val svmModel = SVMWithSGD.train(data, numItetations) val nbModel = NaiveBayes.train(nbData) val dtModel = DecisionTree.train(data, Algo.Classification, Entropy, maxTreeDepth) //在决策树中,设置模式或者Algo时使用了Entript不纯度估计 //-----------使用分类模型----------------------------------------- val dataPoint = data.first() val trueLabel = dataPoint.label println("真实分类:" + trueLabel) val prediction1 = lrModel.predict(dataPoint.features) val prediction2 = svmModel.predict(dataPoint.features) val prediction3 = nbModel.predict(dataPoint.features) val prediction4 = dtModel.predict(dataPoint.features) println("lrModel预测分类:" + prediction1) println("svmModel预测分类:" + prediction2) println("nbModel预测分类:" + prediction3) println("dtModel预测分类:" + prediction4) /* * 真实分类:0.0 lrModel预测分类:1.0 svmModel预测分类:1.0 nbModel预测分类:1.0 dtModel预测分类:0.0 * */ //也可以将RDD[Vector]整体作为输入做预测 /* val preditions = lrModel.predict(data.map(lp => lp.features)) preditions.take(5).foreach(println)*/ //--------评估分类模型的性能:预测的正确率和错误率-------------------------------- val lrTotalCorrect = data.map{ point => if(lrModel.predict(point.features) == point.label) 1 else 0 }.sum val svmTotalCorrect = data.map{ point => if(svmModel.predict(point.features) == point.label) 1 else 0 }.sum val nbTotalCorrect = nbData.map{ point => if(nbModel.predict(point.features) == point.label) 1 else 0 }.sum val dtTotalCorrect = data.map{ point => val socre = dtModel.predict(point.features) val predicted = if(socre > 0.5) 1 else 0 if(predicted == point.label) 1 else 0 }.sum val lrAccuracy = lrTotalCorrect / data.count val svmAccuracy = svmTotalCorrect / numData val nbAccuracy = nbTotalCorrect / numData val dtAccuracy = dtTotalCorrect / numData println("lrModel预测分类正确率:" + lrAccuracy) println("svmModel预测分类正确率:" + svmAccuracy) println("nbModel预测分类正确率:" + nbAccuracy) println("dtModel预测分类正确率:" + dtAccuracy) /* * lrModel预测分类正确率:0.5146720757268425 svmModel预测分类正确率:0.5146720757268425 nbModel预测分类正确率:0.5803921568627451 dtModel预测分类正确率:0.6482758620689655 * */ //--------评估分类模型的性能:准确率和召回律-------------------------------- /**准确率用于评价结果的质量,召回律用来评价结果的完整性 * * 真阳性的数目(被正确预测的类别为1的样本) * 在二分类的问题中,准确率= ------------------------- --------------------- * 真阳性的数目 + 假阳性的数目(被错误预测为类别1的样本) * * 真阳性的数目(被正确预测的类别为1的样本) * 召回率= --------------------------------------------- * 真阳性的数目 + 假阴性的数目(被错误预测为类别0的样本) * 准确率-召回率(PR)曲线下的面积为平均准确率 */ val metrics = Seq(lrModel, svmModel).map{ model => val scoreAndLabels = data.map{ point => (model.predict(point.features), point.label) } val metrics = new BinaryClassificationMetrics(scoreAndLabels) (model.getClass.getSimpleName, metrics.areaUnderPR(), metrics.areaUnderROC()) } val nbMetrics = Seq(nbModel).map{ model => val scoreAndLabels = nbData.map{ point => val score = model.predict(point.features) (if (score > 0.5) 1.0 else 0.0, point.label) } val metrics = new BinaryClassificationMetrics(scoreAndLabels) (model.getClass.getSimpleName, metrics.areaUnderPR(), metrics.areaUnderROC()) } val dtMetrics = Seq(dtModel).map{ model => val scoreAndLabels = nbData.map { point => val score = model.predict(point.features) (if (score > 0.5) 1.0 else 0.0, point.label) } val metrics = new BinaryClassificationMetrics(scoreAndLabels) (model.getClass.getSimpleName, metrics.areaUnderPR(), metrics.areaUnderROC()) } val allMetrics = metrics ++ nbMetrics ++ dtMetrics allMetrics.foreach{ case (model,pr,roc) => println(f"$model, Area under PR : ${pr * 100.0}%2.4f%%,Area under ROC: ${roc * 100.0}%2.4f%%") } //LogisticRegressionModel, Area under PR : 75.6759%,Area under ROC: 50.1418% //SVMModel, Area under PR : 75.6759%,Area under ROC: 50.1418% //NaiveBayesModel, Area under PR : 68.0851%,Area under ROC: 58.3559% //DecisionTreeModel, Area under PR : 74.3081%,Area under ROC: 64.8837% //--------评估分类模型的性能:ROC曲线和AUC-------------------------------- /**ROC曲线在概念上与PR曲线类似,它是对分类器的真阳性率-假阳性率的图形化解释。 * * 真阳性的数目(被正确预测的类别为1的样本) * 真阳性率= ----------------------------------------------- , 与召回率类似,也称为敏感度。 * 真阳性的数目 + 假阴性的数目(被错误预测为类别0的样本) * * ROC曲线表现了分类器性能在不同决策阈值下TPR对FPR的折衷。ROC下的面积,称为AUC,表示平均值。 * * */ //------改进模型性能以及参数调优------------------------------------------- val vectors = data.map(lp => lp.features) val matrix = new RowMatrix(vectors) val matrixSummary = matrix.computeColumnSummaryStatistics() println("每列的均值:") println(matrixSummary.mean) //[0.41225805299526774,2.76182319198661,0.46823047328613876,0.21407992638350257,0.0920623607189991,0.04926216043908034,2.255103452212025,-0.10375042752143329,0.0,0.05642274498417848,0.02123056118999324,0.23377817665490225,0.2757090373659231,0.615551048005409,0.6603110209601082,30.077079107505178,0.03975659229208925,5716.598242055454,178.75456389452327,4.960649087221106,0.17286405047031753,0.10122079189276531] println("每列的最小值:") //[0.0,0.0,0.0,0.0,0.0,0.0,0.0,-1.0,0.0,0.0,0.0,0.045564223,-1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0] println(matrixSummary.min) println("每列的最大值:") println(matrixSummary.max) //[0.999426,363.0,1.0,1.0,0.980392157,0.980392157,21.0,0.25,0.0,0.444444444,1.0,0.716883117,113.3333333,1.0,1.0,100.0,1.0,207952.0,4997.0,22.0,1.0,1.0] println("每列的方差:") println(matrixSummary.variance) //[0.10974244167559023,74.30082476809655,0.04126316989120245,0.021533436332001124,0.009211817450882448,0.005274933469767929,32.53918714591818,0.09396988697611537,0.0,0.001717741034662896,0.020782634824610638,0.0027548394224293023,3.6837889196744116,0.2366799607085986,0.22433071201674218,415.87855895438463,0.03818116876739597,7.877330081138441E7,32208.11624742624,10.453009045764313,0.03359363403832387,0.0062775328842146995] println("每列的非0项数目:") println(matrixSummary.numNonzeros) //[5053.0,7354.0,7172.0,6821.0,6160.0,5128.0,7350.0,1257.0,0.0,7362.0,157.0,7395.0,7355.0,4552.0,4883.0,7347.0,294.0,7378.0,7395.0,6782.0,6868.0,7235.0] //观察到第二列的方差和均值比其他都要高,为了使数据更符合模型的假设,可以对每个特征进行标准化,使得每个特征都是0均值和单位标准差。 //具体做法是对每个特征值减去列的均值,然后除以列的标准差进行缩放 //可以使用Spark的StandardScaler中的方法方便地完成这些操作。 val scaler = new StandardScaler(withMean = true, withStd = true).fit(vectors) val scaledData = data.map(lp => LabeledPoint(lp.label, scaler.transform(lp.features))) println("标准化前:" + data.first().features) println("标准化后:" + scaledData.first().features) //标准化前:[0.789131,2.055555556,0.676470588,0.205882353,0.047058824,0.023529412,0.443783175,0.0,0.0,0.09077381,0.0,0.245831182,0.003883495,1.0,1.0,24.0,0.0,5424.0,170.0,8.0,0.152941176,0.079129575] //标准化后:[1.1376473364976751,-0.08193557169294784,1.0251398128933333,-0.05586356442541853,-0.4688932531289351,-0.35430532630793654,-0.3175352172363122,0.3384507982396541,0.0,0.8288221733153222,-0.14726894334628504,0.22963982357812907,-0.14162596909880876,0.7902380499177364,0.7171947294529865,-0.29799681649642484,-0.2034625779299476,-0.03296720969690467,-0.04878112975579767,0.9400699751165406,-0.10869848852526329,-0.27882078231369967] //下面使用标准化的数据重新训练模型。这里只训练逻辑回归,因为决策树和朴素贝叶斯不受特征标准化的影响。 val lrModelScaled = LogisticRegressionWithSGD.train(scaledData, numItetations) val lrTotalCorrectScaled = scaledData.map{ point => if(lrModelScaled.predict(point.features) == point.label) 1 else 0 }.sum val lrAccuracyScaled = lrTotalCorrectScaled / numData val lrPreditionsVsTrue = scaledData.map{ point => (lrModelScaled.predict(point.features), point.label) } val lrMetricsScaled = new BinaryClassificationMetrics(lrPreditionsVsTrue) val lrPr = lrMetricsScaled.areaUnderPR() val lrRoc = lrMetricsScaled.areaUnderROC() println(f"${lrModelScaled.getClass.getSimpleName}\n Accuracy:${lrAccuracyScaled * 100}%2.4f%%\n Area under PR : ${lrPr * 100.0}%2.4f%%,Area under ROC: ${lrRoc * 100.0}%2.4f%%") //LogisticRegressionModel //Accuracy:62.0419% // Area under PR : 72.7254%,Area under ROC: 61.9663% //对比之前的 //lrModel预测分类正确率:0.5146720757268425 // LogisticRegressionModel, Area under PR : 75.6759%,Area under ROC: 50.1418% //正确率和ROC提高了很多,这就算特征标准化的作用 //-------------其他特征-------------------------------------------------- //之前我们只使用了数据的部分特征 val categories = records.map(r => r(3)).distinct().collect().zipWithIndex.toMap val numCategories = categories.size println(categories) println("种类数:" + numCategories) //Map("weather" -> 0, "sports" -> 1, "unknown" -> 10, "computer_internet" -> 11, "?" -> 8, "culture_politics" -> 9, "religion" -> 4, "recreation" -> 7, "arts_entertainment" -> 5, "health" -> 12, "law_crime" -> 6, "gaming" -> 13, "business" -> 2, "science_technology" -> 3) //种类数:14 //下面使用一个长为14的向量来表示类别特征,然后根据每个样本所属类别索引,对相应的维度赋值为1,其他为0.我们假定这个新的特征向量和其他的数值特征向量一样 val dataCategories = records.map{ r => val trimmed = r.map(_.replaceAll("\"","")) val label = trimmed(r.size - 1).toInt val categoryIdx = categories(r(3)) val categoryFeatures = Array.ofDim[Double](numCategories) categoryFeatures(categoryIdx) = 1.0 val otherFeatures = trimmed.slice(4, r.size - 1).map(d => if(d == "?") 0.0 else d.toDouble) val features = categoryFeatures ++ otherFeatures LabeledPoint(label, Vectors.dense(features)) } println("观察第一行:" + dataCategories.first()) //观察第一行:(0.0,[0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.789131,2.055555556,0.676470588,0.205882353,0.047058824,0.023529412,0.443783175,0.0,0.0,0.09077381,0.0,0.245831182,0.003883495,1.0,1.0,24.0,0.0,5424.0,170.0,8.0,0.152941176,0.079129575]) //发现此前的类别特征已经转为14维向量 val scalerCats = new StandardScaler(withMean = true, withStd = true).fit(dataCategories.map(lp => lp.features)) val scaledDataCasts = dataCategories.map( lp => LabeledPoint(lp.label, scalerCats.transform(lp.features)) ) scaledDataCasts.cache() println("标准化后:" + scaledDataCasts.first()) //标准化后:(0.0,[-0.02326210589837061,-0.23272797709480803,2.7207366564548514,-0.2016540523193296,-0.09914991930875496,-0.38181322324318134,-0.06487757239262681,-0.4464212047941535,-0.6807527904251456,-0.22052688457880879,-0.028494000387023734,-0.20418221057887365,-0.2709990696925828,-0.10189469097220732,1.1376473364976751,-0.08193557169294784,1.0251398128933333,-0.05586356442541853,-0.4688932531289351,-0.35430532630793654,-0.3175352172363122,0.3384507982396541,0.0,0.8288221733153222,-0.14726894334628504,0.22963982357812907,-0.14162596909880876,0.7902380499177364,0.7171947294529865,-0.29799681649642484,-0.2034625779299476,-0.03296720969690467,-0.04878112975579767,0.9400699751165406,-0.10869848852526329,-0.27882078231369967]) val nbDataCategories = records.map{ r => val trimmed = r.map(_.replaceAll("\"","")) val label = trimmed(r.size - 1).toInt val categoryIdx = categories(r(3)) val categoryFeatures = Array.ofDim[Double](numCategories) categoryFeatures(categoryIdx) = 1.0 val otherFeatures = trimmed.slice(4, r.size - 1).map(d => if(d == "?") 0.0 else d.toDouble) .map( d => if (d < 0) 0.0 else d) val features = categoryFeatures ++ otherFeatures LabeledPoint(label, Vectors.dense(features)) } println("观察第一行:" + nbDataCategories.first()) //观察第一行:(0.0,[0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.789131,2.055555556,0.676470588,0.205882353,0.047058824,0.023529412,0.443783175,0.0,0.0,0.09077381,0.0,0.245831182,0.003883495,1.0,1.0,24.0,0.0,5424.0,170.0,8.0,0.152941176,0.079129575]) nbDataCategories.cache() val lrModelScaledCats = LogisticRegressionWithSGD.train(scaledDataCasts, numItetations)//增加类型矩阵并标准化后的逻辑回归分类模型 val svmModelScaledCats = SVMWithSGD.train(scaledDataCasts, numItetations)//增加类型矩阵并标准化后的支持向量机分类模型 val nbModelScaledCats = NaiveBayes.train(nbDataCategories)//增加类型矩阵并标准化后的朴素贝叶斯分类模型 val dtModelScaledCats = DecisionTree.train(dataCategories, Algo.Classification, Entropy, maxTreeDepth) //增加类型矩阵并标准化后的决策树分类模型 //注意,决策树和朴素贝叶斯不受特征标准化的影响。反而标准化后出现负值无法使用贝叶斯 val lrTotalCorrectScaledCats = scaledDataCasts.map{ point => if(lrModelScaledCats.predict(point.features) == point.label) 1 else 0 }.sum val svmTotalCorrectScaledCats = scaledDataCasts.map{ point => if(svmModelScaledCats.predict(point.features) == point.label) 1 else 0 }.sum val nbTotalCorrectScaledCats = nbDataCategories.map{ point => if(nbModelScaledCats.predict(point.features) == point.label) 1 else 0 }.sum val dtTotalCorrectScaledCats = dataCategories.map{ point => val socre = dtModelScaledCats.predict(point.features) val predicted = if(socre > 0.5) 1 else 0 if(predicted == point.label) 1 else 0 }.sum val lrAccuracyScaledCats = lrTotalCorrectScaledCats / numData val svmAccuracyScaledCats = svmTotalCorrectScaledCats / numData val nbAccuracyScaledCats = nbTotalCorrectScaledCats / numData val dtAccuracyScaledCats = dtTotalCorrectScaledCats / numData println("新 lrModel预测分类正确率:" + lrAccuracyScaledCats) println("新svmModel预测分类正确率:" + svmAccuracyScaledCats) println("新 nbModel预测分类正确率:" + nbAccuracyScaledCats) println("新 dtModel预测分类正确率:" + dtAccuracyScaledCats) /* 此前的 * lrModel预测分类正确率:0.5146720757268425 svmModel预测分类正确率:0.5146720757268425 nbModel预测分类正确率:0.5803921568627451 dtModel预测分类正确率:0.6482758620689655 * */ /*** *新 lrModel预测分类正确率:0.6657200811359026 新svmModel预测分类正确率:0.6645030425963488 新 nbModel预测分类正确率:0.5832319134550372 新 dtModel预测分类正确率:0.6655848546315077 * */ val lrPreditionsVsTrueScaledCats = dataCategories.map{ point => (lrModelScaledCats.predict(point.features), point.label) } val lrMetricsScaledCats = new BinaryClassificationMetrics(lrPreditionsVsTrueScaledCats) val lrPrScaledCats = lrMetricsScaledCats.areaUnderPR() val lrRocScaledCats = lrMetricsScaledCats.areaUnderROC() println(f"${lrModelScaledCats.getClass.getSimpleName}\n Accuracy:${lrAccuracyScaledCats * 100}%2.4f%%\n Area under PR : ${lrPrScaledCats * 100.0}%2.4f%%,Area under ROC: ${lrRocScaledCats * 100.0}%2.4f%%") //LogisticRegressionModel //Accuracy:66.5720% //Area under PR : 75.6015%,Area under ROC: 52.1977% val metrics2 = Seq(lrModelScaledCats, svmModelScaledCats).map{ model => val scoreAndLabels = dataCategories.map{ point => (model.predict(point.features), point.label) } val metrics = new BinaryClassificationMetrics(scoreAndLabels) (model.getClass.getSimpleName, metrics.areaUnderPR(), metrics.areaUnderROC()) } val nbMetrics2 = Seq(nbModelScaledCats).map{ model => val scoreAndLabels = nbDataCategories.map{ point => val score = model.predict(point.features) (if (score > 0.5) 1.0 else 0.0, point.label) } val metrics = new BinaryClassificationMetrics(scoreAndLabels) (model.getClass.getSimpleName, metrics.areaUnderPR(), metrics.areaUnderROC()) } val dtMetrics2 = Seq(dtModelScaledCats).map{ model => val scoreAndLabels = dataCategories.map { point => val score = model.predict(point.features) (if (score > 0.5) 1.0 else 0.0, point.label) } val metrics = new BinaryClassificationMetrics(scoreAndLabels) (model.getClass.getSimpleName, metrics.areaUnderPR(), metrics.areaUnderROC()) } val allMetrics2 = metrics2 ++ nbMetrics2 ++ dtMetrics2 allMetrics2.foreach{ case (model,pr,roc) => println(f"新$model, Area under PR : ${pr * 100.0}%2.4f%%,Area under ROC: ${roc * 100.0}%2.4f%%") } //LogisticRegressionModel, Area under PR : 75.6759%,Area under ROC: 50.1418% //SVMModel, Area under PR : 75.6759%,Area under ROC: 50.1418% //NaiveBayesModel, Area under PR : 68.0851%,Area under ROC: 58.3559% //DecisionTreeModel, Area under PR : 74.3081%,Area under ROC: 64.8837% //新LogisticRegressionModel, Area under PR : 75.6015%,Area under ROC: 52.1977% //新SVMModel, Area under PR : 75.5180%,Area under ROC: 54.1606% //新NaiveBayesModel, Area under PR : 68.3386%,Area under ROC: 58.6397% //新DecisionTreeModel, Area under PR : 75.8784%,Area under ROC: 66.5005% //--------使用正确的数据格式---------------------------------------------- //现在我们仅仅使用类型特征,也就是只使用前14个向量,因为1-of-k编码的类型特征更符合朴素贝叶斯模型 val nbDataOnlyCategories = records.map{ r => val trimmed = r.map(_.replaceAll("\"","")) val label = trimmed(r.size - 1).toInt val categoryIdx = categories(r(3)) val categoryFeatures = Array.ofDim[Double](numCategories) categoryFeatures(categoryIdx) = 1.0 LabeledPoint(label, Vectors.dense(categoryFeatures)) } println("观察第一行:" + nbDataOnlyCategories.first()) val nbModelScaledOnlyCats = NaiveBayes.train(nbDataOnlyCategories)//只有类型矩阵并标准化后的朴素贝叶斯分类模型 val nbMetricsOnlyCats = Seq(nbModelScaledOnlyCats).map{ model => val scoreAndLabels = nbDataOnlyCategories.map{ point => val score = model.predict(point.features) (if (score > 0.5) 1.0 else 0.0, point.label) } val metrics = new BinaryClassificationMetrics(scoreAndLabels) (model.getClass.getSimpleName, metrics.areaUnderPR(), metrics.areaUnderROC()) } nbMetricsOnlyCats.foreach{ case (model,pr,roc) => println(f"新$model, Area under PR : ${pr * 100.0}%2.4f%%,Area under ROC: ${roc * 100.0}%2.4f%%") } //新NaiveBayesModel, Area under PR : 74.0522%,Area under ROC: 60.5138% //对比此前的: //NaiveBayesModel, Area under PR : 68.0851%,Area under ROC: 58.3559% //提升了2个百分点 val nbTotalCorrectScaledOnlyCats = nbDataOnlyCategories.map{ point => if(nbModelScaledOnlyCats.predict(point.features) == point.label) 1 else 0 }.sum val nbAccuracyScaledOnlyCats = nbTotalCorrectScaledOnlyCats / numData println("新 nbModel预测分类正确率:" + nbAccuracyScaledOnlyCats) //新 nbModel预测分类正确率:0.6096010818120352 //对比此前的 //新 nbModel预测分类正确率:0.5832319134550372 //提升了2个百分点 //--------模型参数调优:线性模型----------------------------------------------------------------------------- scaledDataCasts.cache() //(1)迭代次数的影响 val iterResults = Seq(1,5,10,50).map{ param => val model = trainLRWithParams(scaledDataCasts, 0.0, param, new SimpleUpdater, 1.0) createMetrics(s"$param iterations", scaledDataCasts, model) } iterResults.foreach{ case (param, auc) => println(f"$param, AUC=${auc * 100}%2.4f%%") } /*1 iterations, AUC=64.9520% 5 iterations, AUC=66.6161% 10 iterations, AUC=66.5483% 50 iterations, AUC=66.8143%*/ //(2)步长的影响 val stepResults = Seq(0.001, 0.01, 0.1, 1.0, 10.0).map{ param => val model = trainLRWithParams(scaledDataCasts, 0.0, numItetations, new SimpleUpdater, param) createMetrics(s"$param step size", scaledDataCasts, model) } stepResults.foreach{ case (param, auc) => println(f"$param, AUC=${auc * 100}%2.4f%%") } /*0.001 step size, AUC=64.9659% 0.01 step size, AUC=64.9644% 0.1 step size, AUC=65.5211% 1.0 step size, AUC=66.5483% 10.0 step size, AUC=61.9228%*/ //(3)正则化的影响 val regResults = Seq(0.001, 0.01, 0.1, 1.0, 10.0).map{ param => val model = trainLRWithParams(scaledDataCasts, param, numItetations, new SquaredL2Updater, 1.0) createMetrics(s"$param L2 regularization parameter", scaledDataCasts, model) } regResults.foreach{ case (param, auc) => println(f"$param, AUC=${auc * 100}%2.4f%%") } /*0.001 L2 regularization parameter, AUC=66.5475% 0.01 L2 regularization parameter, AUC=66.5475% 0.1 L2 regularization parameter, AUC=66.5475% 1.0 L2 regularization parameter, AUC=66.5475% 10.0 L2 regularization parameter, AUC=66.5475%*/ //--------模型参数调优:决策树----------------------------------------------------------------------------- //调整树的深度参数 val dtResultsEntropy = Seq(1, 2, 3, 4, 5, 10, 20).map{ param => val model = trainDTWithParams(scaledDataCasts,param, Entropy) val scoreAndLabels = scaledDataCasts.map{ point => val score = model.predict(point.features) (if (score > 0.5) 1.0 else 0.0, point.label) } val metrics = new BinaryClassificationMetrics(scoreAndLabels) (s"$param tree depth with Entropy", metrics.areaUnderROC()) } dtResultsEntropy.foreach { case (param, auc) => println(f"$param,AUC=${auc * 100}%2.4f%%") } /*1 tree depth,AUC=59.3268% 2 tree depth,AUC=59.3268% 3 tree depth,AUC=61.8313% 4 tree depth,AUC=62.1519% 5 tree depth,AUC=66.5005% 10 tree depth,AUC=75.9120% 20 tree depth,AUC=96.4347%*/ //调整不纯度度量方式:Gini或者Entropy val dtResultsEntropy2 = Seq(1, 2, 3, 4, 5, 10, 20).map{ param => val model = trainDTWithParams(scaledDataCasts,param, Gini) val scoreAndLabels = scaledDataCasts.map{ point => val score = model.predict(point.features) (if (score > 0.5) 1.0 else 0.0, point.label) } val metrics = new BinaryClassificationMetrics(scoreAndLabels) (s"$param tree depth with Gini", metrics.areaUnderROC()) } dtResultsEntropy2.foreach { case (param, auc) => println(f"$param,AUC=${auc * 100}%2.4f%%") } /* 1 tree depth with Gini,AUC=59.3268% 2 tree depth with Gini,AUC=61.6106% 3 tree depth with Gini,AUC=61.8349% 4 tree depth with Gini,AUC=62.0433% 5 tree depth with Gini,AUC=66.4518% 10 tree depth with Gini,AUC=76.8962% 20 tree depth with Gini,AUC=98.3514%*/ //--------模型参数调优:朴素贝叶斯----------------------------------------------------------------------------- val nbResults = Seq(0.001, 0.01, 0.1, 1.0, 10.0).map{ param => val model = trainNBWithParams(nbDataCategories, param) val scoreAndLabels = scaledDataCasts.map{ point => (model.predict(point.features), point.label) } val metrics = new BinaryClassificationMetrics(scoreAndLabels) (s"$param lambda", metrics.areaUnderROC()) } nbResults.foreach { case (param, auc) => println(f"$param,AUC=${auc * 100}%2.4f%%") } /*0.001 lambda,AUC=61.2364% 0.01 lambda,AUC=61.3334% 0.1 lambda,AUC=61.4714% 1.0 lambda,AUC=61.5605% 10.0 lambda,AUC=61.8360%*/ //---------交叉验证------------------------------------------------------------------------------------------- val trainTestSplit = scaledDataCasts.randomSplit(Array(0.6,0.4),123) val train = trainTestSplit(0) val test = trainTestSplit(1) val regResultsTest = Seq(0.0, 0.001, 0.0025, 0.005, 0.01).map{ param => val model = trainLRWithParams(train, param, numItetations, new SquaredL2Updater, 1.0) createMetrics(s"$param L2 regularization parameter", test, model) } regResultsTest.foreach { case (param, auc) => println(f"$param,AUC=${auc * 100}%2.4f%%") } } /*** * 使用参数训练线性分类模型 * @param input 输入 * @param regParams * @param numIterations 迭代次数 * @param updater * @param stepSize 步长 * @return */ def trainLRWithParams(input: RDD[LabeledPoint], regParams: Double, numIterations: Int, updater: Updater, stepSize: Double) = { val lr = new LogisticRegressionWithSGD() lr.optimizer.setNumIterations(numIterations).setUpdater(updater).setStepSize(stepSize) lr.run(input) } def createMetrics(label: String, data: RDD[LabeledPoint], model: ClassificationModel) = { val scoreAndLables = data.map{ point => (model.predict(point.features), point.label) } val metrics = new BinaryClassificationMetrics(scoreAndLables) (label, metrics.areaUnderROC()) } def trainDTWithParams(input: RDD[LabeledPoint], maxDepth: Int, impurity: Impurity) = { DecisionTree.train(input, Algo.Classification, impurity, maxDepth) } def trainNBWithParams(input: RDD[LabeledPoint], lambda: Double) = { val nb = new NaiveBayes() nb.setLambda(lambda) nb.run(input) } }