机器学习——回归和异常值处理(安然数据集)
以下代码是在python 3.6下运行。
安然事件造成有史以来最大的公司破产。在2000年度,安然是美国最大的能源公司,然而被揭露舞弊后,它在一年内就破产了。
我们之所以选择使用安然事件的数据集来做机器学习的项目,是因为我们已经有安然的电子邮件数据库,它包含150名前安然员工之间的50万封电子邮件,主要是高级管理人员。这也是唯一的大型公共的真实邮件数据库。
感兴趣的可以看一下安然的纪录片,也是非常令人唏嘘的一部经典纪录片:【纪录片】安然:房间里最聪明的人
或者阅读安然事件文章
关于安然数据集的分析可参考上一篇文章:
安然数据集分析
根据年龄和工资回归
#!/usr/bin/python
import random
import numpy
import matplotlib.pyplot as plt
import pickle
from outlier_cleaner import outlierCleaner
#python2_to_python3
class StrToBytes:
def __init__(self, fileobj):
self.fileobj = fileobj
def read(self, size):
return self.fileobj.read(size).encode()
def readline(self, size=-1):
return self.fileobj.readline(size).encode()
### load up some practice data with outliers in it
ages = pickle.load( StrToBytes(open("practice_outliers_ages.pkl", "r")))
net_worths = pickle.load( StrToBytes(open("practice_outliers_net_worths.pkl", "r")) )
### ages and net_worths need to be reshaped into 2D numpy arrays
### second argument of reshape command is a tuple of integers: (n_rows, n_columns)
### by convention, n_rows is the number of data points
### and n_columns is the number of features
ages = numpy.reshape( numpy.array(ages), (len(ages), 1))
net_worths = numpy.reshape( numpy.array(net_worths), (len(net_worths), 1))
#from sklearn.cross_validation import train_test_split
from sklearn.model_selection import train_test_split
ages_train, ages_test, net_worths_train, net_worths_test = train_test_split(ages, net_worths, test_size=0.1, random_state=42)
### fill in a regression here! Name the regression object reg so that
### the plotting code below works, and you can see what your regression looks like
from sklearn.linear_model import LinearRegression
reg = LinearRegression()
reg.fit(ages_train, net_worths_train)
print("scope: ", reg.coef_)
print("intercept: ", reg.intercept_)
print("train score: ", reg.score(ages_train, net_worths_train))
print("test score: ", reg.score(ages_test, net_worths_test))
try:
plt.plot(ages, reg.predict(ages), color="blue")
except NameError:
pass
plt.scatter(ages, net_worths)
plt.show()
scope: [[5.07793064]]
intercept: [25.21002155]
train score: 0.4898725961751499
test score: 0.8782624703664671
此时斜率是5.07,训练集的R-平方值是0.4898,测试集的R-平方值是0.878
删除年龄和工资中的异常值
在outlier_cleaner.py中定义异常值清除函数outlierCleaner(),清楚10%的异常值
#!/usr/bin/python
import math
def outlierCleaner(predictions, ages, net_worths):
"""
Clean away the 10% of points that have the largest
residual errors (difference between the prediction
and the actual net worth).
Return a list of tuples named cleaned_data where
each tuple is of the form (age, net_worth, error).
"""
cleaned_data = []
### your code goes here
errors = abs(predictions - net_worths)
cleaned_data = zip(ages, net_worths, errors)
cleaned_data = sorted(cleaned_data, key=lambda clean:clean[2])
clean_num = int(math.ceil(len(cleaned_data)*0.9))
cleaned_data = cleaned_data[:clean_num]
print('data length: ',len(ages))
print('cleaned_data length: ',len(cleaned_data))
return cleaned_data
调用异常值清除函数outlierCleaner(),讲清除后的数据重新回归拟合
import outlier_cleaner
cleaned_data = outlierCleaner(reg.predict(ages_train), ages_train, net_worths_train)
ages_train_new, net_worths_train_new, e = zip(*cleaned_data)
ages_train_new = numpy.reshape( numpy.array(ages_train_new), (len(ages_train_new), 1))
net_worths_train_new = numpy.reshape( numpy.array(net_worths_train_new), (len(net_worths_train_new), 1))
reg.fit(ages_train_new, net_worths_train_new)
print("scope_removal: ", reg.coef_)
print("intercept_removal: ", reg.intercept_)
print("train score_removal: ", reg.score(ages_train_new, net_worths_train_new))
print("test score_removal: ", reg.score(ages_test, net_worths_test))
try:
plt.plot(ages_train_new, reg.predict(ages_train_new), color="blue")
except NameError:
pass
plt.scatter(ages_train_new, net_worths_train_new)
plt.show()
scope_removal: [[6.36859481]]
intercept_removal: [-6.91861069]
train score_removal: 0.9513734907601892
test score_removal: 0.9831894553955322
删除异常之后,斜率是6.369,训练集的R-平方值是0.95,测试集的R-平方值是0.98,效果明显的好很多了。
根据工资和奖金处理数据集
主要是根据安然数据集中的工资和奖金进行处理,来判断此数据集中是否有异常值。
import pickle
import sys
import matplotlib.pyplot
sys.path.append("../tools/")
from feature_format import featureFormat, targetFeatureSplit
#python2 to python3
class StrToBytes:
def __init__(self, fileobj):
self.fileobj = fileobj
def read(self, size):
return self.fileobj.read(size).encode()
def readline(self, size=-1):
return self.fileobj.readline(size).encode()
### read in data dictionary, convert to numpy array
data_dict = pickle.load( StrToBytes(open("../final_project/final_project_dataset.pkl", "r") ))
features = ["salary", "bonus"]
data = featureFormat(data_dict, features)
可视化工资和奖金数据集
for point in data:
salary = point[0]
bonus = point[1]
matplotlib.pyplot.scatter( salary, bonus )
matplotlib.pyplot.xlabel("salary")
matplotlib.pyplot.ylabel("bonus")
matplotlib.pyplot.show()
很明显,右上角的那一数据点与其他点差距太大,是异常值。
找出工资和奖金的异常值
max_value = sorted(data, reverse=True, key=lambda sal:sal[0])[0]
print('the max_value is: ', max_value)
for i in data_dict:
if data_dict[i]['salary'] == max_value[0]:
print('Who is the max_value is: ',i)
the max_value is: [26704229. 97343619.]
Who is the max_value is: TOTAL
删除异常值后重新可视化工资和奖金数据
data_dict.pop( 'TOTAL', 0 )
data = featureFormat(data_dict, features)
for point in data:
salary = point[0]
bonus = point[1]
matplotlib.pyplot.scatter( salary, bonus )
matplotlib.pyplot.xlabel("salary")
matplotlib.pyplot.ylabel("bonus")
matplotlib.pyplot.show()
我们认为还有 4 个异常值需要调查;让我们举例来看。两人获得了至少 5 百万美元的奖金,以及超过 1 百万美元的工资;换句话说,他们就像是强盗。
和这些点相关的名字是什么?
for i in data_dict:
if data_dict[i]['salary'] != 'NaN' and data_dict[i]['bonus'] != 'NaN':
if data_dict[i]['salary'] > 1e6 and data_dict[i]['bonus'] > 5e6:
print(i)
LAY KENNETH L
SKILLING JEFFREY K
你认为这两个异常值应该并清除,还是留下来作为一个数据点? 留下来,它是有效的数据点 清除掉,它是一个电子表格怪癖 清除掉,它是一个错误?
这两个异常数据当天不能删除,事实表明他们两个非法拿到了很多钱,是司法的重点研究对象