spark实验 | 分类和回归

一 Building a Classification Model with Spark:

Spark’s MLlib library supports binary classification for linear models, decision trees, and naïve Bayes models and multiclass classification for decision trees and naïve Bayes models.

1 从以下地址下载数据。
2 启动hadoop和Spark环境,
3 Change to the directory in which you downloaded the data (referred to as PATH here) and run the following command to remove the first line and pipe the result to a new file called train_noheader.tsv:

sed 1d train.tsv > train_noheader.tsv

Now, we are ready to start up our Spark shell (remember to run this command from your Spark installation directory):

./bin/spark-shell-driver-memory 4g

4 将数据复制到HDFS文件系统。
5 启动eclipse,在eclipse上新建scala工程(假定名称为SC2),选择“Properties”,在弹出的框中,将Scala Compiler选项改为2.10版本。
6 新建scala源文件,假定类名为ClasssifyMoel,

import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.mllib.classification.LogisticRegressionWithSGD 
import org.apache.spark.mllib.classification.SVMWithSGD
import org.apache.spark.mllib.classification.NaiveBayes 
import org.apache.spark.mllib.tree.DecisionTree
import org.apacShe.spark.mllib.tree.configuration.Algo 
import org.apache.spark.mllib.tree.impurity.Entropy 
import org.apache.spark.mllib.evaluation.BinaryClassificationMetrics
import org.apache.spark.mllib.regression.LabeledPoint 
import org.apache.spark.mllib.linalg.Vectors

object ClassifyModel {
    def main(args: Array[String]) {
   // if (args.length < 1) {
   //   System.err.println("Usage: <file>")
   //   System.exit(1)
   // }
    val conf = new SparkConf()
    val sc = new SparkContext(conf)

val numIterations = 10
val maxTreeDepth = 5
val rawData = sc.textFile("/user/train_noheader.tsv") 
val records = => line.split("\t")) 

/* Due to the way the data is formatted, we will have to do a bit of data cleaning during our initial processing by trimming out the extra quotation characters ("). There are also missing values in the dataset; they are denoted by the "?" character. In this case, we will simply assign a zero value to these missing values: */

val data = { r =>
val trimmed ="\"", ""))
val label = trimmed(r.size - 1).toInt
val features = trimmed.slice(4, r.size - 1).map(d => if (d == "?") 0.0 else d.toDouble)
LabeledPoint(label, Vectors.dense(features))
/*In the preceding code, we extracted the label variable from the last column and an array of features for columns 5 to 25 after cleaning and dealing with missing values. We converted the label to an Int value and the features to an Array[Double]. Finally, we wrapped the label and features in a LabeledPoint instance, converting the features into an MLlib Vector. */

//We will also cache the data and count the number of data points:

val numData = data.count
//You will see that the value of numData is 7395.

val nbData = { r =>
val trimmed ="\"", ""))
val label = trimmed(r.size - 1).toInt
val features = trimmed.slice(4, r.size - 1).map(d => if (d == "?") 0.0 else d.toDouble).map(d => if (d < 0) 0.0 else d)
LabeledPoint(label, Vectors.dense(features))

///Training classification models
//we will train an SVM model:

val svmModel = SVMWithSGD.train(data, numIterations)

/*Then, we will train the naïve Bayes model; remember to use your special non-negative feature dataset: */

val nbModel = NaiveBayes.train(nbData)

//Finally, we will train our decision tree:

val dtModel = DecisionTree.train(data, Algo.Classification, Entropy, maxTreeDepth)
//Generating predictions for the Kaggle/StumbleUpon evergreen classification dataset
/*We will use our logistic regression model as an example (the other models are used in the same way): */

val dataPoint = data.first
val lrModel = LogisticRegressionWithSGD.train(data, numIterations)
val prediction = lrModel.predict(dataPoint.features)

val trueLabel = dataPoint.label

val predictions = lrModel.predict( => lp.features)) 

//Evaluating the performance of classification models

/*We can calculate the accuracy of our models in our training data by making predictions on each input feature and comparing them to the true label. We will sum up the number of correctly classified instances and divide this by the total number of data points to get the average classification accuracy: */

val lrTotalCorrect = { point =>
if (lrModel.predict(point.features) == point.label) 1 else 0
val lrAccuracy = lrTotalCorrect / data.count

//What about the other models? Let’s compute the accuracy for the other three:

val svmTotalCorrect = { point =>
if (svmModel.predict(point.features) == point.label) 1 else 0
val nbTotalCorrect = { point =>
if (nbModel.predict(point.features) == point.label) 1 else 0

/*Note that the decision tree prediction threshold needs to be specified explicitly, as highlighted here: */

val dtTotalCorrect = { point =>
val score = dtModel.predict(point.features)
val predicted = if (score > 0.5) 1 else 0 
if (predicted == point.label) 1 else 0

/*We can now inspect the accuracy for the other three models. First, the SVM model: */
val svmAccuracy = svmTotalCorrect / numData

// Next, our naïve Bayes model:

val nbAccuracy = nbTotalCorrect / numData

/*Finally, we compute the accuracy for the decision tree: */

val dtAccuracy = dtTotalCorrect / numData

/*MLlib comes with a set of built-in routines to compute the area under the PR and ROC curves for binary classification. Here, we will compute these metrics for each of our models: */

val metrics = Seq(lrModel, svmModel).map { model =>
val scoreAndLabels = { point => (model.predict(point.features), point.label)
val metrics = new BinaryClassificationMetrics(scoreAndLabels)
(model.getClass.getSimpleName, metrics.areaUnderPR, metrics.areaUnderROC)

/*As we did previously to train the naïve Bayes model and computing accuracy, we need to use the special nbData version of the dataset that we created to compute the classification metrics: */

val nbMetrics = Seq(nbModel).map{ model => val scoreAndLabels = { point =>
val score = model.predict(point.features)
(if (score > 0.5) 1.0 else 0.0, point.label)
val metrics = new BinaryClassificationMetrics(scoreAndLabels) 
(model.getClass.getSimpleName, metrics.areaUnderPR, metrics.areaUnderROC)
/*Note that because the DecisionTreeModel model does not implement the ClassificationModel interface that is implemented by the other three models, we need to compute the results separately for this model in the following code: */

val dtMetrics = Seq(dtModel).map{ model => val scoreAndLabels = { point =>
val score = model.predict(point.features) 
(if (score > 0.5) 1.0 else 0.0, point.label)
val metrics = new BinaryClassificationMetrics(scoreAndLabels)
(model.getClass.getSimpleName, metrics.areaUnderPR, metrics.areaUnderROC)
val allMetrics = metrics ++ nbMetrics ++ dtMetrics
allMetrics.foreach{ case (m, pr, roc) =>
println(f"$m, Area under PR: ${pr * 100.0}%2.4f%%, Area under ROC: ${roc* 100.0}%2.4f%%")


  1. 在Scala工程S2C中,右击“ClassifyModel.scala”,选择“Export”,并在弹出框中选择“Java” –> “JAR File”,进而将该程序编译成jar包,可以起名为“ClassifyModel.jar”(比如导出的jar包目录为是/home/fyj/workspace1/project)
  2. 执行以下命令:
    spark-submit –name ClassifyModel –class ClassifyModel –executor-memory 256M /home/fyj/workspace1/project/ ClassifyModel.jar

二 Building a Regression Model with Spark:

Spark’s MLlib library offers two broad classes of regression models: linear models and decision tree regression models.
(1)Linear models are essentially the same as their classification counterparts, the only difference is that linear regression models use a different loss function, related link function, and decision function. MLlib provides a standard least squares regression model (although other types of generalized linear models for regression are planned).
(2)Decision trees can also be used for regression by changing the impurity measure.

1 数据准备。
The dataset is available at
Once you have downloaded the file, unzip it. This will create a directory called Bike-Sharing-Dataset, which contains the day.csv, hour.csv, and the Readme.txt files
The Readme.txt file contains information on the dataset, including the variable names and descriptions. Take a look at the file, and you will see that we have the following variables available:
instant: This is the record ID
dteday: This is the raw date
season: This is different seasons such as spring, summer, winter, and fall
yr: This is the year (2011 or 2012) mnth: This is the month of the year hr: This is the hour of the day
holiday: This is whether the day was a holiday or not
weekday: This is the day of the week
workingday: This is whether the day was a working day or not
weathersit: This is a categorical variable that describes the weather at a particular time
temp: This is the normalized temperature
atemp: This is the normalized apparent temperature
hum: This is the normalized humidity
windspeed: This is the normalized wind speed
cnt: This is the target variable, that is, the count of bike rentals for that hour

We will work with the hourly data contained in hour.csv. If you look at the first line of the dataset, you will see that it contains the column names as a header. You can do this by running the following command:

head -1 hour.csv

This should output the following result:
Before we work with the data in Spark, we will again remove the header from the first line of the file using the same sed command that we used previously to create a new file called hour_noheader.csv:

sed 1d hour.csv > hour_noheader.csv

Since we will be doing some plotting of our dataset later on, we will use the Python shell for this chapter. This also serves to illustrate how to use MLlib’s linear model and decision tree functionality from PySpark.

2 启动hadoop和Spark环境。

3 将数据复制到HDFS文件系统。

4 安装PIP: apt-get install python-pip

5 Start up your PySpark shell from your Spark installation directory. If you want to use IPython, which we highly recommend, remember to include the IPYTHON=1 environment variable together with the pylab functionality:
进入spark-1.3.1/bin目录,运行: ipython=1 ipython_opts=”–pylab” ./pyspark

6 逐步输入以下命令及代码(代码内容请结合Packt.Machine.Learning.with.Spark电子书中第6章理解),并查看运行结果。

from pyspark.mllib.regression import LabeledPoint 
import numpy as np

path = "./hour_noheader.csv" 
raw_data = sc.textFile(path) 
num_data = raw_data.count()
records = x: x.split(",")) 
first = records.first()
print first 
print num_data

// We will first cache our dataset, since we will be reading from it many times:


/* In order to extract each categorical feature into a binary vector form, we will need to know the feature mapping of each feature value to the index of the nonzero value in our binary vector. Let’s define a function that will extract this mapping from our dataset for a given column: */

def get_mapping(rdd, idx):
return fields: fields[idx]).distinct().zipWithIndex().collectAsMap()

/* Our function first maps the field to its unique values and then uses the zipWithIndex transformation to zip the value up with a unique index such that a key-value RDD is formed, where the key is the variable and the value is the index. This index will be the index of the nonzero entry in the binary vector representation of the feature. We will finally collect this RDD back to the driver as a Python dictionary. */
//We can test our function on the third variable column (index 2):

print "Mapping of first categorical feasture column: %s" % get_mapping(records, 2)

/* Now, we can apply this function to each categorical column (that is, for variable indices 2 to 9): */

mappings = [get_mapping(records, i) for i in range(2,10)] cat_len = sum(map(len, mappings))
num_len = len(records.first()[11:15]) total_len = num_len + cat_len

/*We now have the mappings for each variable, and we can see how many values in total we need for our binary vector representation: */

print "Feature vector length for categorical features: %d" % cat_len 
print "Feature vector length for numerical features: %d" % num_len
print "Total feature vector length: %d" % total_len

//Creating feature vectors for the linear model

def extract_features(record): cat_vec = np.zeros(cat_len) i = 0
step = 0
for field in record[2:9]: 
m = mappings[i]
idx = m[field] 
cat_vec[idx + step] = 1
 i = i + 1
step = step + len(m)
num_vec = np.array([float(field) for field in record[10:14]]) 
return np.concatenate((cat_vec, num_vec))

def extract_label(record): return float(record[-1])

/*In the preceding extract_features function, we ran through each column in the row of data. We extracted the binary encoding for each variable in turn from the mappings we created previously. The step variable ensures that the nonzero feature index in the full feature vector is correct (and is somewhat more efficient than, say, creating many smaller binary vectors and concatenating them). The numeric vector is created directly by first converting the data to floating point numbers and wrapping these in a numpy array. The resulting two vectors are then concatenated. The extract_label function simply converts the last column variable (the count) into a float.
With our utility functions defined, we can proceed with extracting feature vectors and labels from our data records:*/

data = r: LabeledPoint(extract_label(r), extract_features(r)))
Let’s inspect the first record in the extracted feature RDD:

first_point = data.first()
print "Raw data: " + str(first[2:])
print "Label: " + str(first_point.label)
print "Linear Model feature vector:\n" + str(first_point.features) 
print "Linear Model feature vector length: " + str(len(first_point.features))
//Creating feature vectors for the decision tree
/* As we have seen, decision tree models typically work on raw features (that is, it is not required to convert categorical features into a binary vector encoding; they can, instead, be used directly). Therefore, we will create a separate function to extract the decision tree feature vector, which simply converts all the values to floats and wraps them in a numpy array: */

def extract_features_dt(record):
return np.array(map(float, record[2:14]))
data_dt = r: LabeledPoint(extract_label(r), extract_features_dt(r)))
first_point_dt = data_dt.first()
print "Decision Tree feature vector: " + str(first_point_dt.features) 
print "Decision Tree feature vector length: " + str(len(first_point_dt.features))

/* for the decision tree model, which has a trainRegressor method (in addition to a trainClassifier method for classification models): */

from pyspark.mllib.regression import LinearRegressionWithSGD 
from pyspark.mllib.tree import DecisionTree help(LinearRegressionWithSGD.train)


//Training a regression model on the bike sharing dataset
/*  We’ re ready to use the features we have extracted to train our models on the bike sharing data. First, we’ll train the linear regression model and take a look at the first few predictions that the model makes on the data: */

linear_model = LinearRegressionWithSGD.train(data, iterations=10, step=0.1, intercept=False)
true_vs_predicted = p: (p.label, linear_model.predict(p.features)))
print "Linear Model predictions: " + str(true_vs_predicted.take(5))

/*  Next, we will train the decision tree model simply using the default arguments to the trainRegressor method (which equates to using a tree depth of 5). Note that we need to pass in the other form of the dataset, data_dt, that we created from the raw feature values (as opposed to the binary encoded features that we used for the preceding linear model).
We also need to pass in an argument for categoricalFeaturesInfo. This is a dictionary that maps the categorical feature index to the number of categories for the feature. If a feature is not in this mapping, it will be treated as continuous. For our purposes, we will leave this as is, passing in an empty mapping:  */

dt_model = DecisionTree.trainRegressor(data_dt,{})
preds = dt_model.predict( p: p.features))
actual = p: p.label)
true_vs_predicted_dt =
print "Decision Tree predictions: " + str(true_vs_predicted_dt.take(5))
print "Decision Tree depth: " + str(dt_model.depth())
print "Decision Tree number of nodes: " + str(dt_model.numNodes())
//Evaluating the performance of regression models

def squared_error(actual, pred): return (pred - actual)**2

def abs_error(actual, pred): return np.abs(pred - actual)

def squared_log_error(pred, actual):
return (np.log(pred + 1) - np.log(actual + 1))**2

//Linear model
/* Our approach will be to apply the relevant error function to each record in the RDD we computed earlier, which is true_vs_predicted for our linear model: */

mse = (t, p): squared_error(t, p)).mean() 
mae = (t, p): abs_error(t, p)).mean()
rmsle = np.sqrt( (t, p): squared_log_error(t, p)).mean())
print "Linear Model - Mean Squared Error: %2.4f" % mse print "Linear Model - Mean Absolute Error: %2.4f" % mae
print "Linear Model - Root Mean Squared Log Error: %2.4f" % rmsle

//Decision tree
/* We will use the same approach for the decision tree model, using the
true_vs_predicted_dt RDD: */

mse_dt = (t, p): squared_error(t, p)).mean()
mae_dt = (t, p): abs_error(t, p)).mean() rmsle_dt = np.sqrt( (t, p): squared_log_error(t, p)).mean())
print "Decision Tree - Mean Squared Error: %2.4f" % mse_dt print "Decision Tree - Mean Absolute Error: %2.4f" % mae_dt
print "Decision Tree - Root Mean Squared Log Error: %2.4f" % rmsle_dt


  • 广告
  • 抄袭
  • 版权
  • 政治
  • 色情
  • 无意义
  • 其他