环境:mac osx python3.5
由于系统自带的python是2.7版本,所以需要进行配置修改。
1, 添加pyspark对应的python的系统环境变量。
2, 直接修改pyspark,将 vim ./bin/pysaprk
, 中的部分修改为
if hash python2.7 2>/dev/null; then
# Attempt to use Python 2.7, if installed:
#DEFAULT_PYTHON="python2.7"
DEFAULT_PYTHON="python3"
else
DEFAULT_PYTHON="python"
数据集
使用数据集 KDD Cup 1999
http://kdd.ics.uci.edu/databases/kddcup99/kddcup99
获取数据并创建RDD
下载数据集
http://kdd.ics.uci.edu/databases/kddcup99/kddcup.data.gz
打开链接直接进行下载
打开pysaprk, ./bin/pyspark
data_file = "./kddcup.data.gz"
raw_data = sc.textFile(data_file)
raw_data.count()
#4898431
#查看数据,例如前 4
for data in raw_data.take(4):
print(data)
#0,tcp,http,SF,215,45076,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,1,0.00,0.00,0.00,0.00,1.00,0.00,0.00,0,0,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,normal.
#0,tcp,http,SF,162,4528,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,2,2,0.00,0.00,0.00,0.00,1.00,0.00,0.00,1,1,1.00,0.00,1.00,0.00,0.00,0.00,0.00,0.00,normal.
#0,tcp,http,SF,236,1228,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,1,0.00,0.00,0.00,0.00,1.00,0.00,0.00,2,2,1.00,0.00,0.50,0.00,0.00,0.00,0.00,0.00,normal.
#0,tcp,http,SF,233,2032,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,2,2,0.00,0.00,0.00,0.00,1.00,0.00,0.00,3,3,1.00,0.00,0.33,0.00,0.00,0.00,0.00,0.00,normal.
同理测试数据集
http://kdd.ics.uci.edu/databases/kddcup99/corrected.gz
test_data_file = "./corrected.gz"
test_raw_data = sc.textFile(test_data_file)
test_raw_data.count()
#311029
准备训练数据
在本例中,仅仅探查网络攻击,而不需要知道处理的攻击属于哪种类型。因此,网络的访问将被标记为非攻击(normal),与攻击。
from pyspark.mllib.regression import LabeledPoint
from numpy import array
def parse_interaction(line):
line_split = line.split(",")
clean_line_split = line_split[0:1]+line_split[4:41]
attack = 1.0
if line_split[41] == "normal.":
# or if "normal" in line_split[41]:
attck = 0.0
return LabeledPoint(attack, array([float(x) for x in clean_line_split]))
training_data = raw_data.map(parse_interaction)
准备测试数据
同理:
test_data = test_raw_data.map(parse_interaction)
训练分类器
逻辑回归广泛用于预测二分类。此处使用L-BFGS算法,而不是小批梯度下降 (Mini-Batch Gradient Descent) 。
from pyspark.mllib.classification import LogisticRegressionWithLBFGS
from time import time
t0 = time()
logit_model = LogisticRegressionWithLBFGS.train(training_data)
tt = time() - t0
使用新数据评估模型
使用测试集的数据进行评估error。使用map方法,利用模型预测测试数据的类别。
labels_and_preds = test_data.map(lambda p: (p.label, logit_model.predict(p.features)))
分类的结果以值对的形式返回,包括正确的标签,预测的标签(0或1)。这将被filter,count 用来计算分类误差。
t0 = time()
test_accuracy = labels_and_preds.filter(lambda (v, p): v==p).count()/float()test_data.count()
tt = time() - t0
test_accuracy
# 0.9164
准确度还是不错的。当然还可以选择更好的变量来优化。
特征选择
在训练模型之前,寻找合适的特征是很重要。
使用假设检验
假设检验在统计推理以及学习判断一个结果是否是具有统计重要性防霾呢是个有力的工具。MLlib 对拟合优度与独立性支持Pearson’s chi-squared ( χ2) 皮尔森卡方检定。拟合优度测试的输入类型要求是 向量vector。而独立测试需要矩阵作为输入。MLlib支持输入类型RDD[LabeledPoint]通过卡方chi-squared独立测试进行特征选择。这些方法是 Statistics包中一部分。
在本例中,希望进行特征选择排序。因此会用到 LabeledPoint。
MLlib会计算一个contingency矩阵,执行Persons’s chi-squared (χ2) 测试。特征是可以分类的。实值的不同的值将被视为可分类的。因此,要么去掉一些特征,要么分类这些特征。本例中数据集,视特征为布尔值或者少量不同的数值。将定义一个复杂的parse_interaction函数来适当的分类这些特征。
feature_names = ["land","wrong_fragment",
"urgent","hot","num_failed_logins",
"logged_in","num_compromised",
"root_shell","su_attempted",
"num_root","num_file_creations",
"num_shells","num_access_files",
"num_outbound_cmds",
"is_hot_login","is_guest_login",
"count","srv_count","serror_rate",
"srv_serror_rate","rerror_rate",
"srv_rerror_rate","same_srv_rate",
"diff_srv_rate","srv_diff_host_rate",
"dst_host_count","dst_host_srv_count",
"dst_host_same_srv_rate","dst_host_diff_srv_rate",
"dst_host_same_src_port_rate",
"dst_host_srv_diff_host_rate","dst_host_serror_rate",
"dst_host_srv_serror_rate",
"dst_host_rerror_rate","dst_host_srv_rerror_rate"]
def parse_interaction_categorical(line):
line_split = line.split(",")
clean_line_split = line_split[6:41]
attack = 1.0
if line_split[41]=='normal.':
attack = 0.0
return LabeledPoint(attack,
array([float(x) for x in clean_line_split]))
training_data_categorical = raw_data.map(parse_interaction_categorical)
from pyspark.mllib.stat import Statistics
chi = Statistics.chiSqTest(training_data_categorical)
通过把这些结果放入pandas data frame 进行检验。
import pandas as pd
pd.set_option('display.max_colwidth', 30)
records = [(result.statistic, result.pValue) for result in chi]
chi_df = pd.DataFrame(data=records, index= feature_names, columns=["Statistic","p-value"])
chi_df
从上面的结果可以得出结论,land, num_outbound_cmds 对模型的精度影响较大,应移除。
评估新模型
修改parse_interaction函数,移除6, 19 列,以及相应的预测。
def parse_interaction_chi(line):
line_split = line.split(",")
# leave_out = [1,2,3,6,19,41]
clean_line_split = line_split[0:1] + line_split[4:6] + line_split[7:19] + line_split[20:41]
attack = 1.0
if line_split[41]=='normal.':
attack = 0.0
return LabeledPoint(attack, array([float(x) for x in clean_line_split]))
training_data_chi = raw_data.map(parse_interaction_chi)
test_data_chi = test_raw_data.map(parse_interaction_chi)
重建模型
t0 = time()
logit_model_chi = LogisticRegressionWithLBFGS.train(training_data_chi)
tt = time() - t0
使用测试集数据评估
labels_and_preds = test_data_chi.map(lambda p: (p.label, logit_model_chi.predict(p.features)))
t0 = time()
test_accuracy = labels_and_preds.filter(lambda (v, p): v == p).count() / float(test_data_chi.count())
tt = time() - t0
test_accuracy
#0.9164
使用相关矩阵
参考文章
https://www.codementor.io/spark/tutorial/spark-mllib-logistic-regression