机器学习(8)——回归和异常值处理（安然数据集）_安然电子邮件数据集分析异常-CSDN博客

本文链接：https://blog.csdn.net/xiao_lxl/article/details/91860104

文章目录

根据年龄和工资回归
删除年龄和工资中的异常值
根据工资和奖金处理数据集
可视化工资和奖金数据集
找出工资和奖金的异常值
删除异常值后重新可视化工资和奖金数据

机器学习——回归和异常值处理（安然数据集）

以下代码是在python 3.6下运行。

安然事件造成有史以来最大的公司破产。在2000年度，安然是美国最大的能源公司，然而被揭露舞弊后，它在一年内就破产了。

我们之所以选择使用安然事件的数据集来做机器学习的项目，是因为我们已经有安然的电子邮件数据库，它包含150名前安然员工之间的50万封电子邮件，主要是高级管理人员。这也是唯一的大型公共的真实邮件数据库。

感兴趣的可以看一下安然的纪录片，也是非常令人唏嘘的一部经典纪录片：【纪录片】安然：房间里最聪明的人
或者阅读安然事件文章

关于安然数据集的分析可参考上一篇文章：
安然数据集分析

根据年龄和工资回归

#!/usr/bin/python

import random
import numpy
import matplotlib.pyplot as plt
import pickle

from outlier_cleaner import outlierCleaner


#python2_to_python3
class StrToBytes:  
    def __init__(self, fileobj):  
        self.fileobj = fileobj  
    def read(self, size):  
        return self.fileobj.read(size).encode()  
    def readline(self, size=-1):  
        return self.fileobj.readline(size).encode()


### load up some practice data with outliers in it
ages = pickle.load( StrToBytes(open("practice_outliers_ages.pkl", "r")))
net_worths = pickle.load( StrToBytes(open("practice_outliers_net_worths.pkl", "r")) )



### ages and net_worths need to be reshaped into 2D numpy arrays
### second argument of reshape command is a tuple of integers: (n_rows, n_columns)
### by convention, n_rows is the number of data points
### and n_columns is the number of features
ages       = numpy.reshape( numpy.array(ages), (len(ages), 1))
net_worths = numpy.reshape( numpy.array(net_worths), (len(net_worths), 1))
#from sklearn.cross_validation import train_test_split

from sklearn.model_selection import train_test_split
ages_train, ages_test, net_worths_train, net_worths_test = train_test_split(ages, net_worths, test_size=0.1, random_state=42)

### fill in a regression here!  Name the regression object reg so that
### the plotting code below works, and you can see what your regression looks like




from sklearn.linear_model import LinearRegression

reg = LinearRegression()
reg.fit(ages_train, net_worths_train)

print("scope: ", reg.coef_)
print("intercept: ", reg.intercept_)
print("train score: ", reg.score(ages_train, net_worths_train))
print("test score: ", reg.score(ages_test, net_worths_test))

try:
    plt.plot(ages, reg.predict(ages), color="blue")
except NameError:
    pass
plt.scatter(ages, net_worths)
plt.show()

在这里插入图片描述

scope: [[5.07793064]]
intercept: [25.21002155]
train score: 0.4898725961751499
test score: 0.8782624703664671

此时斜率是5.07，训练集的R-平方值是0.4898，测试集的R-平方值是0.878

删除年龄和工资中的异常值

在outlier_cleaner.py中定义异常值清除函数outlierCleaner()，清楚10%的异常值

#!/usr/bin/python

import math

def outlierCleaner(predictions, ages, net_worths):
    """
        Clean away the 10% of points that have the largest
        residual errors (difference between the prediction
        and the actual net worth).

        Return a list of tuples named cleaned_data where 
        each tuple is of the form (age, net_worth, error).
    """
    
    cleaned_data = []

    ### your code goes here

    errors = abs(predictions - net_worths)
    cleaned_data = zip(ages, net_worths, errors)
    cleaned_data = sorted(cleaned_data, key=lambda clean:clean[2])
    clean_num = int(math.ceil(len(cleaned_data)*0.9))
    cleaned_data = cleaned_data[:clean_num]
    

    print('data length: ',len(ages))
    print('cleaned_data length: ',len(cleaned_data))
    return cleaned_data

调用异常值清除函数outlierCleaner()，讲清除后的数据重新回归拟合

import outlier_cleaner

cleaned_data = outlierCleaner(reg.predict(ages_train), ages_train, net_worths_train)

ages_train_new, net_worths_train_new, e = zip(*cleaned_data)

ages_train_new       = numpy.reshape( numpy.array(ages_train_new), (len(ages_train_new), 1))
net_worths_train_new = numpy.reshape( numpy.array(net_worths_train_new), (len(net_worths_train_new), 1))

reg.fit(ages_train_new, net_worths_train_new)



print("scope_removal: ", reg.coef_)
print("intercept_removal: ", reg.intercept_)
print("train score_removal: ", reg.score(ages_train_new, net_worths_train_new))
print("test score_removal: ", reg.score(ages_test, net_worths_test))

try:
    plt.plot(ages_train_new, reg.predict(ages_train_new), color="blue")
except NameError:
    pass
plt.scatter(ages_train_new, net_worths_train_new)
plt.show()

在这里插入图片描述

scope_removal: [[6.36859481]]
intercept_removal: [-6.91861069]
train score_removal: 0.9513734907601892
test score_removal: 0.9831894553955322

删除异常之后，斜率是6.369，训练集的R-平方值是0.95，测试集的R-平方值是0.98，效果明显的好很多了。

根据工资和奖金处理数据集

主要是根据安然数据集中的工资和奖金进行处理，来判断此数据集中是否有异常值。

import pickle
import sys
import matplotlib.pyplot
sys.path.append("../tools/")
from feature_format import featureFormat, targetFeatureSplit

#python2 to python3
class StrToBytes:  
    def __init__(self, fileobj):  
        self.fileobj = fileobj  
    def read(self, size):  
        return self.fileobj.read(size).encode()  
    def readline(self, size=-1):  
        return self.fileobj.readline(size).encode()

### read in data dictionary, convert to numpy array
data_dict = pickle.load( StrToBytes(open("../final_project/final_project_dataset.pkl", "r") ))
features = ["salary", "bonus"]


data = featureFormat(data_dict, features)

可视化工资和奖金数据集

for point in data:
    salary = point[0]
    bonus = point[1]
    matplotlib.pyplot.scatter( salary, bonus )

matplotlib.pyplot.xlabel("salary")
matplotlib.pyplot.ylabel("bonus")
matplotlib.pyplot.show()

在这里插入图片描述
很明显，右上角的那一数据点与其他点差距太大，是异常值。

找出工资和奖金的异常值

max_value = sorted(data, reverse=True, key=lambda sal:sal[0])[0]
print('the max_value is: ', max_value)

for i in data_dict:
    if data_dict[i]['salary'] == max_value[0]:
        print('Who is the max_value is: ',i)

the max_value is: [26704229. 97343619.]
Who is the max_value is: TOTAL

删除异常值后重新可视化工资和奖金数据

data_dict.pop( 'TOTAL', 0 )
data = featureFormat(data_dict, features)

for point in data:
    salary = point[0]
    bonus = point[1]
    matplotlib.pyplot.scatter( salary, bonus )

matplotlib.pyplot.xlabel("salary")
matplotlib.pyplot.ylabel("bonus")
matplotlib.pyplot.show()

在这里插入图片描述

我们认为还有 4 个异常值需要调查；让我们举例来看。两人获得了至少 5 百万美元的奖金，以及超过 1 百万美元的工资；换句话说，他们就像是强盗。

和这些点相关的名字是什么？

for i in data_dict:
    if data_dict[i]['salary'] != 'NaN' and data_dict[i]['bonus'] != 'NaN':
        if data_dict[i]['salary'] > 1e6 and data_dict[i]['bonus'] > 5e6:
            print(i)

LAY KENNETH L
SKILLING JEFFREY K

你认为这两个异常值应该并清除，还是留下来作为一个数据点？留下来，它是有效的数据点清除掉，它是一个电子表格怪癖清除掉，它是一个错误？

这两个异常数据当天不能删除，事实表明他们两个非法拿到了很多钱，是司法的重点研究对象