机器学习算法整理

最新推荐文章于 2024-04-26 14:33:29 发布

weixin_42317631

最新推荐文章于 2024-04-26 14:33:29 发布

阅读量53

点赞数 1

文章标签：机器学习算法人工智能

本文链接：https://blog.csdn.net/weixin_42317631/article/details/131104220

版权

1.线性回归算法

回归分析（Regression Analysis）是统计学的数据分析方法，目的在于了解两个或多个变量间是否相关、相关方向与强度，并建立数学模型以便观察特定变量来预测其它变量的变化情况。
线性回归算法的建模过程就是使用数据点来寻找最佳拟合线。公式：y=ax+b，其中y是因变量，x是自变量，利用给定的数据集求a和b的值。线性回归又分为两种类型：简单线性回归（只有一个自变量）和多变量回归（至少两组以上自变量）
例如下：

from sklearn import linear_model,datasets
import pandas as pd
digits=datasets.load_iris()#加载数据集
clf=linear_model.LinearRegression()#加载线性模型
x,y=digits.data[:-8],digits.target[:-8]#数据处理
clf.fit(x,y)#拟合模型
y_pred=clf.predict([digits.data[-8]])
y_true=digits.target[-8]
print(y_pred,y_true)

# 使用sklearn调用衡量线性回归的MSE 、 RMSE、 MAE、r2
# from math import sqrt
# from sklearn.metrics import mean_absolute_error
# from sklearn.metrics import mean_squared_error
# from sklearn.metrics import r2_score
# print("mean_absolute_error:", mean_absolute_error(y_true, y_pred))
# print("mean_squared_error:", mean_squared_error(y_true, y_pred))
# print("rmse:", sqrt(mean_squared_error(y_true, y_pred)))
# print("r2 score:", r2_score(y_true, y_pred))

2.支持向量机算法

支持向量机算法(SVM)属于分类型算法。SVM将实例表示为表示为空间中的点，使用一条直线分隔数据点。不过需要注意的是，支持向量机需要对输入数据进行完全标记，仅适用于两类任务。（这个地方我不知道是不是真的仅适用于两类任务，因为也是网上找的资料整理的）

代码如下：

from sklearn import svm,datasets
import pandas as pd
digits=datasets.load_iris()#加载数据集

clf=svm.SVC(gamma=0.001,C=100)#加载线性模型

x,y=digits.data[:-1],digits.target[:-1]#数据处理

clf.fit(x,y)#拟合模型
y_pred=clf.predict([digits.data[-1]])
y_true=digits.target[-1]
print(y_pred,y_true)

3.K近邻算法(K-Nearest Neighbors,KNN)

KNN算法指如果一个样本在特征空间中的 k个最相似的样本中的大多数属于某一个类别，那么该样本也属于这个类别。也就是说对于一个新输入的实例，从数据集中找到该实例最临近的k个实例，这k个实例属于某一个类，那么就把该实例放到该类中。

KNN既可以用于分类，也可以用于回归。

from sklearn.neighbors import KNeighborsClassifier
import pandas as pd
digits=datasets.load_iris()#加载数据集

clf=KNeighborsClassifier(n_neighbors=6)#加载模型

x,y=digits.data[:-1],digits.target[:-1]#数据处理

clf.fit(x,y)#拟合模型
y_pred=clf.predict([digits.data[-1]])
y_true=digits.target[-1]
print(y_pred,y_true)

4.逻辑回归算法

逻辑回归算法（Logistic Regression）一般用于需要明确输出的场景，如某些事件的发生（预测是否会发生降雨）。通常，逻辑回归使用某种函数将概率值压缩到某一特定范围。

例如，Sigmod（S函数）是一种具有S曲线、用于二分类的函数。它将发生某事件的概率值转换为0.1的范围表示。

#!/usr/bin/python
# -*- coding:utf-8 -*-
#鸢尾花分类
#1.首先导入包：
import numpy as np #机器学习基础包
from sklearn.linear_model import LogisticRegression #逻辑回归算法库
import matplotlib.pyplot as plt #绘图工具包
import matplotlib as mpl #绘图地图包
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn import datasets #sklearn官方提供的数据集

 
# 用sklearn的数据集
iris = datasets.load_iris()
x = iris.data[:, :2]  # we only take the first two features.
y = iris.target
 
from sklearn.model_selection import train_test_split
 
X_train, X_test, y_train, y_test = train_test_split(x, y, random_state=666)
#3.用pipline建立模型
    #StandardScaler()作用：去均值和方差归一化。且是针对每一个特征维度来做的，而不是针对样本。 StandardScaler对每列分别标准化。
    #PolynomialFeatures(degree=1)：进行特征的构造。它是使用多项式的方法来进行的，如果有a，b两个特征，那么它的2次多项式为（1,a,b,a^2,ab, b^2）。PolynomialFeatures有三个参数：
    #1.degree：控制多项式的度
    #2.interaction_only： 默认为False，如果指定为True，那么就不会有特征自己和自己结合的项，上面的二次项中没有a2和b2。
    #3.include_bias：默认为True。如果为True的话，那么就会有上面的 1那一项。
    #LogisticRegression（）建立逻辑回归模型
lr = Pipeline([('sc', StandardScaler()),
                        ('clf', LogisticRegression(multi_class="multinomial",solver="newton-cg")) ])
lr.fit(X_train,y_train) #ravel将多维数组降位一维,y轴是标签只有一维
 
#4.画图准备
N, M = 500, 500  # 横纵各采样多少个值
x1_min, x1_max = x[:, 0].min(), x[:, 0].max()  # 第0列的范围
x2_min, x2_max = x[:, 1].min(), x[:, 1].max()  # 第1列的范围
t1 = np.linspace(x1_min, x1_max, N)
t2 = np.linspace(x2_min, x2_max, M)
x1, x2 = np.meshgrid(t1, t2)  # 生成网格采样点
x_test = np.stack((x1.flat, x2.flat), axis=1)  # 测试点
 
#5.开始画图
cm_light = mpl.colors.ListedColormap(['#77E0A0', '#FF8080', '#A0A0FF'])
cm_dark = mpl.colors.ListedColormap(['g', 'r', 'b'])
y_hat = lr.predict(x_test)  # 预测值
y_hat = y_hat.reshape(x1.shape)  # 使之与输入的形状相同
# print(y_hat)
plt.pcolormesh(x1, x2, y_hat,shading='auto', cmap=cm_light)  # 预测值的显示 其实就是背景
plt.scatter(x[:, 0], x[:, 1], c=y.ravel(), edgecolors='k', s=50, cmap=cm_dark)  # 样本的显示
plt.xlabel('petal length')
plt.ylabel('petal width')
plt.xlim(x1_min, x1_max)
plt.ylim(x2_min, x2_max)
plt.grid()
# plt.savefig('2.png')
plt.show()
#6.训练集上的预测结果
y_hat = lr.predict(x) #回归的y
y =y.ravel() #变一维
print(y)
#y = y.reshape(-1) #变一维
#print(y)
result = y_hat == y #回归的y和真实值y比较
print(y_hat)
print(result)
acc = np.mean(result) #求平均数
print('准确率: %.2f%%' % (100 * acc))

5.决策树算法

决策树相关知识

在决策树模型中，一般存在两种节点，一种为下图矩形所表示的样本特征，一种即为椭圆表示的样本标签。如果用决策树进行分类，则是将一组给定数据从根节点开始，对样本的某一特征进行测试，判断其属于哪一个子节点，然后循环往复，直至没有子节点出现，也就是到达叶节点，最终实现分类效果。

#导包
import pydotplus
from sklearn.tree import DecisionTreeClassifier 
from sklearn import datasets

#加载数据并探索
iris = datasets.load_iris() 
x_features = iris.data[:-1]
x_target = iris.target[:-1]
y_features = iris.data[-1]
y_target = iris.target[-1]

print(features)#四个特征分别为萼片长度、萼片宽度、花瓣长度和花瓣宽度
print(target)#目标位花的类型：Iris Setosa、Iris Versicolour和Iris Virginica。用0,1,2表示

#创建决策树分类对象
model_tree=DecisionTreeClassifier(random_state=0)

#拟合模型
model=model_tree.fit(x_features,x_target)

#预测
pre=model.predict([y_features])#预测所属类别
# pre_proba=model.prepredict_proba([y_features])  #预测类的概率
print(pre)

# 预测的Dot数据
# 在这一步，我们以DOT格式（一种图描述语言）输出训练好的模型。
# 为了达到这个目的，我们使用可以从sklearn包中导入的树类。
# 在此基础上，使用 export_graphviz 方法，将决策树、特征和目标变量作为参数。
from sklearn import tree

dot_data = tree.export_graphviz(model_tree, out_file=None,
feature_names=iris.feature_names, 
class_names=iris.target_names
)

#绘制图表
from IPython.display import Image

tree.plot_tree(model)  #输出决策树结果
dot=tree.export_graphviz(model,filled=True) #将决策树可视化
graph=graphviz.Source(dot)

[[5.1 3.5 1.4 0.2]
 [4.9 3.  1.4 0.2]
 [4.7 3.2 1.3 0.2]
 [4.6 3.1 1.5 0.2]
 [5.  3.6 1.4 0.2]
 [5.4 3.9 1.7 0.4]
 [4.6 3.4 1.4 0.3]
 [5.  3.4 1.5 0.2]
 [4.4 2.9 1.4 0.2]
 [4.9 3.1 1.5 0.1]
 [5.4 3.7 1.5 0.2]
 [4.8 3.4 1.6 0.2]
 [4.8 3.  1.4 0.1]
 [4.3 3.  1.1 0.1]
 [5.8 4.  1.2 0.2]
 [5.7 4.4 1.5 0.4]
 [5.4 3.9 1.3 0.4]
 [5.1 3.5 1.4 0.3]
 [5.7 3.8 1.7 0.3]
 [5.1 3.8 1.5 0.3]
 [5.4 3.4 1.7 0.2]
 [5.1 3.7 1.5 0.4]
 [4.6 3.6 1.  0.2]
 [5.1 3.3 1.7 0.5]
 [4.8 3.4 1.9 0.2]
 [5.  3.  1.6 0.2]
 [5.  3.4 1.6 0.4]
 [5.2 3.5 1.5 0.2]
 [5.2 3.4 1.4 0.2]
 [4.7 3.2 1.6 0.2]
 [4.8 3.1 1.6 0.2]
 [5.4 3.4 1.5 0.4]
 [5.2 4.1 1.5 0.1]
 [5.5 4.2 1.4 0.2]
 [4.9 3.1 1.5 0.2]
 [5.  3.2 1.2 0.2]
 [5.5 3.5 1.3 0.2]
 [4.9 3.6 1.4 0.1]
 [4.4 3.  1.3 0.2]
 [5.1 3.4 1.5 0.2]
 [5.  3.5 1.3 0.3]
 [4.5 2.3 1.3 0.3]
 [4.4 3.2 1.3 0.2]
 [5.  3.5 1.6 0.6]
 [5.1 3.8 1.9 0.4]
 [4.8 3.  1.4 0.3]
 [5.1 3.8 1.6 0.2]
 [4.6 3.2 1.4 0.2]
 [5.3 3.7 1.5 0.2]
 [5.  3.3 1.4 0.2]
 [7.  3.2 4.7 1.4]
 [6.4 3.2 4.5 1.5]
 [6.9 3.1 4.9 1.5]
 [5.5 2.3 4.  1.3]
 [6.5 2.8 4.6 1.5]
 [5.7 2.8 4.5 1.3]
 [6.3 3.3 4.7 1.6]
 [4.9 2.4 3.3 1. ]
 [6.6 2.9 4.6 1.3]
 [5.2 2.7 3.9 1.4]
 [5.  2.  3.5 1. ]
 [5.9 3.  4.2 1.5]
 [6.  2.2 4.  1. ]
 [6.1 2.9 4.7 1.4]
 [5.6 2.9 3.6 1.3]
 [6.7 3.1 4.4 1.4]
 [5.6 3.  4.5 1.5]
 [5.8 2.7 4.1 1. ]
 [6.2 2.2 4.5 1.5]
 [5.6 2.5 3.9 1.1]
 [5.9 3.2 4.8 1.8]
 [6.1 2.8 4.  1.3]
 [6.3 2.5 4.9 1.5]
 [6.1 2.8 4.7 1.2]
 [6.4 2.9 4.3 1.3]
 [6.6 3.  4.4 1.4]
 [6.8 2.8 4.8 1.4]
 [6.7 3.  5.  1.7]
 [6.  2.9 4.5 1.5]
 [5.7 2.6 3.5 1. ]
 [5.5 2.4 3.8 1.1]
 [5.5 2.4 3.7 1. ]
 [5.8 2.7 3.9 1.2]
 [6.  2.7 5.1 1.6]
 [5.4 3.  4.5 1.5]
 [6.  3.4 4.5 1.6]
 [6.7 3.1 4.7 1.5]
 [6.3 2.3 4.4 1.3]
 [5.6 3.  4.1 1.3]
 [5.5 2.5 4.  1.3]
 [5.5 2.6 4.4 1.2]
 [6.1 3.  4.6 1.4]
 [5.8 2.6 4.  1.2]
 [5.  2.3 3.3 1. ]
 [5.6 2.7 4.2 1.3]
 [5.7 3.  4.2 1.2]
 [5.7 2.9 4.2 1.3]
 [6.2 2.9 4.3 1.3]
 [5.1 2.5 3.  1.1]
 [5.7 2.8 4.1 1.3]
 [6.3 3.3 6.  2.5]
 [5.8 2.7 5.1 1.9]
 [7.1 3.  5.9 2.1]
 [6.3 2.9 5.6 1.8]
 [6.5 3.  5.8 2.2]
 [7.6 3.  6.6 2.1]
 [4.9 2.5 4.5 1.7]
 [7.3 2.9 6.3 1.8]
 [6.7 2.5 5.8 1.8]
 [7.2 3.6 6.1 2.5]
 [6.5 3.2 5.1 2. ]
 [6.4 2.7 5.3 1.9]
 [6.8 3.  5.5 2.1]
 [5.7 2.5 5.  2. ]
 [5.8 2.8 5.1 2.4]
 [6.4 3.2 5.3 2.3]
 [6.5 3.  5.5 1.8]
 [7.7 3.8 6.7 2.2]
 [7.7 2.6 6.9 2.3]
 [6.  2.2 5.  1.5]
 [6.9 3.2 5.7 2.3]
 [5.6 2.8 4.9 2. ]
 [7.7 2.8 6.7 2. ]
 [6.3 2.7 4.9 1.8]
 [6.7 3.3 5.7 2.1]
 [7.2 3.2 6.  1.8]
 [6.2 2.8 4.8 1.8]
 [6.1 3.  4.9 1.8]
 [6.4 2.8 5.6 2.1]
 [7.2 3.  5.8 1.6]
 [7.4 2.8 6.1 1.9]
 [7.9 3.8 6.4 2. ]
 [6.4 2.8 5.6 2.2]
 [6.3 2.8 5.1 1.5]
 [6.1 2.6 5.6 1.4]
 [7.7 3.  6.1 2.3]
 [6.3 3.4 5.6 2.4]
 [6.4 3.1 5.5 1.8]
 [6.  3.  4.8 1.8]
 [6.9 3.1 5.4 2.1]
 [6.7 3.1 5.6 2.4]
 [6.9 3.1 5.1 2.3]
 [5.8 2.7 5.1 1.9]
 [6.8 3.2 5.9 2.3]
 [6.7 3.3 5.7 2.5]
 [6.7 3.  5.2 2.3]
 [6.3 2.5 5.  1.9]
 [6.5 3.  5.2 2. ]
 [6.2 3.4 5.4 2.3]
 [5.9 3.  5.1 1.8]]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2]
[2]

6.k-平均算法

k_means是一种无监督学习算法，为聚类问题提供了一种解决方案。

k_means算法把n个点（可以是样本的一次观察或一个实例）划分到k个集群，使得每个点都属于离他最近的均值（即聚类的中心）对应的集群。重复上述过程一直持续到重心不改变。

import numpy as np
import matplotlib.pyplot as plt
# 引入scipy中的距离函数，默认欧式距离
from scipy.spatial.distance import cdist
# 从sklearn中直接生成聚类数据
from sklearn.datasets._samples_generator import make_blobs
 
 
# -------------1. 数据加载---------
x, y = make_blobs(n_samples=100, centers=6, random_state=1234, cluster_std=0.6)
 
#plt.figure(figsize=(6, 6))
#plt.scatter(x[:, 0], x[:, 1], c=y)
#plt.show()
 
# --------------2. 算法实现--------------
class K_Means(object):
    # 初始化，参数 n_clusters（K）、迭代次数max_iter、初始质心 centroids
    def __init__(self, n_clusters=5, max_iter=300, centroids=[]):
        self.n_clusters = n_clusters
        self.max_iter = max_iter
        self.centroids = np.array(centroids, dtype=np.float)
 
    # 训练模型方法，k-means聚类过程，传入原始数据
    def fit(self, data):
        # 假如没有指定初始质心，就随机选取data中的点作为初始质心
        if (self.centroids.shape == (0,)):
            # 从data中随机生成0到data行数的6个整数，作为索引值
            self.centroids = data[np.random.randint(0, data.shape[0], self.n_clusters), :]
 
        # 开始迭代
        for i in range(self.max_iter):
            # 1. 计算距离矩阵，得到的是一个100*6的矩阵
            distances = cdist(data, self.centroids)
 
            # 2. 对距离按有近到远排序，选取最近的质心点的类别，作为当前点的分类
            c_ind = np.argmin(distances, axis=1)
 
            # 3. 对每一类数据进行均值计算，更新质心点坐标
            for i in range(self.n_clusters):
                # 排除掉没有出现在c_ind里的类别
                if i in c_ind:
                    # 选出所有类别是i的点，取data里面坐标的均值，更新第i个质心
                    self.centroids[i] = np.mean(data[c_ind == i], axis=0)
 
    # 实现预测方法
    def predict(self, samples):
        # 跟上面一样，先计算距离矩阵，然后选取距离最近的那个质心的类别
        distances = cdist(samples, self.centroids)
        c_ind = np.argmin(distances, axis=1)
 
        return c_ind
 
 
dist = np.array([[121, 221, 32, 43],
                 [121, 1, 12, 23],
                 [65, 21, 2, 43],
                 [1, 221, 32, 43],
                 [21, 11, 22, 3], ])
c_ind = np.argmin(dist, axis=1)
print(c_ind)
x_new = x[0:5]
print(x_new)
print(c_ind == 2)
print(x_new[c_ind == 2])
np.mean(x_new[c_ind == 2], axis=0)
 
# --------------3. 测试------------
# 定义一个绘制子图函数
def plotKMeans(x, y, centroids, subplot, title):
    # 分配子图，121表示1行2列的子图中的第一个
    plt.subplot(subplot)
    plt.scatter(x[:, 0], x[:, 1], c='cyan')
    # 画出质心点
    plt.scatter(centroids[:, 0], centroids[:, 1], c=np.array(range(5)), s=100)
    plt.title(title)
 
kmeans = K_Means(max_iter=300, centroids=[[2, 1], [2, 2], [2, 3], [2, 4], [2, 5]])
 
plt.figure(figsize=(16, 6))
plotKMeans(x, y, kmeans.centroids, 121, 'start')
 
# 开始聚类
kmeans.fit(x)
 
plotKMeans(x, y, kmeans.centroids, 122, 'k-means')
 
# 预测新数据点的类别
x_new = np.array([[0, 0], [10, 7]])
y_pred = kmeans.predict(x_new)
 
print(kmeans.centroids)
print(y_pred)
 
plt.scatter(x_new[:, 0], x_new[:, 1], s=100, c='black')
plt.show()

7.随机森林算法

随机森林算法，是一种基于决策树的集成有监督的学习算法。

简而言之，原来决策树只有一棵树，现在随机算法要建立多个树，即多个决策树，然后集成这多个决策树，集成的方法就是投票法，就是由原来决策树一颗树的结果为准，改变为由大多棵树的结果为准。

import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV

X,y=make_classification()
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2,random_state=1)
estimator=RandomForestClassifier(oob_score=True,random_state=1)
estimator.fit(X_train,y_train)
print(estimator.oob_score_)

"""
对外层的bagging框架进行参数择优，即对n_estimators参数择优，其他参数仍然是默认值
"""
param_test1={'n_estimators':range(1,101,10)}
grid_search=GridSearchCV(estimator=RandomForestClassifier(random_state=1),param_grid=param_test1,scoring='roc_auc',cv=10)
grid_search.fit(X_train,y_train)
print(grid_search.best_params_)
print(grid_search.best_score_)

"""
优化决策树参数的最大特征数max_features，其他参数设置为常数，且n_estimators为81
"""
param_test2={'max_features':range(1,21,1)}
grid_search_1=GridSearchCV(estimator=RandomForestClassifier(n_estimators=grid_search.best_params_['n_estimators'],random_state=1),
                           param_grid=param_test2,scoring='roc_auc',cv=10)
grid_search_1.fit(X_train,y_train)
print(grid_search_1.best_params_)
print(grid_search_1.best_score_)

"""
用最优参数重新训练数据，计算泛化误差
"""
rfl=RandomForestClassifier(n_estimators=grid_search.best_params_['n_estimators'],max_features=grid_search_1.best_params_['max_features'],
                           oob_score=True,random_state=1)
rfl.fit(X_train,y_train)
print(rfl.oob_score_)

8.朴素贝叶斯算法

参考文章

朴素贝叶斯算法（Naive Bayes）基于概率论的贝叶斯定理，应用非常广泛，从文本分类、垃圾邮件过滤器、医疗诊断等等。朴素贝叶斯适用于特征之间的相互独立的场景，例如利用花瓣的长度和宽度来预测花的类型。“朴素”的内涵可以理解为特征和特征之间独立性强。

与朴素贝叶斯算法密切相关的一个概念是最大似然估计(Maximum likelihood estimation)，历史上大部分的最大似然估计理论也都是在贝叶斯统计中得到大发展。例如，建立人口身高模型，很难有人力与物力去统计全国每个人的身高，但是可以通过采样，获取部分人的身高，然后通过最大似然估计来获取分布的均值与方差。

上面的链接有详细的解释，这里只写了--利用伯努利朴素贝叶斯预测天气会不会下雨。其中0表示无雨，1表示有雨。

import numpy as np
x = np.array([[0,1,0,1],[1,1,1,1],[1,1,1,0],[0,1,1,0],[0,1,0,0],[0,1,0,1],
              [1,1,0,1],[1,0,0,1],[1,1,0,1],[0,0,0,0]])
y = np.array([1,1,1,1,0,1,0,1,1,0])
 
from sklearn.naive_bayes import BernoulliNB
bnb = BernoulliNB()
bnb.fit(x,y)
day_pre=[[0,0,1,0]]
pre = bnb.predict(day_pre)
print("预测结果如下\n:",'*'*50)
print('结果为:',pre)
print('*'*50)
 
#进一步查看概率分布
pre_pro = bnb.predict_proba(day_pre)
print("不下雨的概率为：",pre_pro[0][0],"\n下雨的概率为：",pre_pro[0][1])

9.降维算法

在机器学习和统计学领域，降维是指在限定条件下，降低随机变量个数，得到一组“不相关”主变量的过程，并可进一步细分为特征选择和特征提取两大方法。

一些数据集可能包含许多难以处理的变量。特别是资源丰富的情况下，系统中的数据将非常详细。在这种情况下，数据集可能包含数千个变量，其中大多数变量也可能是不必要的。在这种情况下，几乎不可能确定对我们的预测影响最大的变量。此时，我们需要使用降维算法，降维的过程中也可能需要用到其他算法，例如借用随机森林，决策树来识别最重要的变量。

降维算法参考链接

这里仅列出笔者之前自己做的PCA降维代码，使用的是阿里工业蒸汽量预测数据集：

#利用PCA去除数据的多重共线性，并进行降维
from sklearn.decomposition import PCA
#保持90%的信息
pca=PCA(n_components=0.9)

new_train_pca_90=pca.fit_transform(train_data_scaler.iloc[:,0:-1])#不包括最后一列，即target那一列
new_test_pca_90=pca.transform(test_data_scaler)

new_train_pca_90=pd.DataFrame(new_train_pca_90)
new_test_pca_90=pd.DataFrame(new_test_pca_90)
new_train_pca_90


new_train_pca_90['target']=train_data_scaler['target']
new_train_pca_90.describe()