机器学习之数据预处理

最新推荐文章于 2024-08-02 13:43:48 发布

若云流风

最新推荐文章于 2024-08-02 13:43:48 发布

阅读量717

点赞数 1

分类专栏：机器学习利用python进行数据分析文章标签：机器学习数据分析数据预处理特征缩放

本文链接：https://blog.csdn.net/ruoyunliufeng/article/details/79943915

版权

利用python进行数据分析同时被 2 个专栏收录

45 篇文章 51 订阅

订阅专栏

机器学习

36 篇文章 1 订阅

订阅专栏

一、导入标准库

    In [1]: 
  

# Importing the libraries 导入库
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

二、导入数据

    In [24]: 
  

# Importing the dataset 导入数据
dataset = pd.read_csv('./Data.csv')

X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, -1].values
dataset

      Out[24]: 
    

	Country	Age	Salary	Purchased
0	France	44.0	72000.0	No
1	Spain	27.0	48000.0	Yes
2	Germany	30.0	54000.0	No
3	Spain	38.0	61000.0	No
4	Germany	40.0	NaN	Yes
5	France	35.0	58000.0	Yes
6	Spain	NaN	52000.0	No
7	France	48.0	79000.0	Yes
8	Germany	50.0	83000.0	No
9	France	37.0	67000.0	Yes

    In [25]: 
  

      Out[25]: 
    

array([['France', 44.0, 72000.0],
       ['Spain', 27.0, 48000.0],
       ['Germany', 30.0, 54000.0],
       ['Spain', 38.0, 61000.0],
       ['Germany', 40.0, nan],
       ['France', 35.0, 58000.0],
       ['Spain', nan, 52000.0],
       ['France', 48.0, 79000.0],
       ['Germany', 50.0, 83000.0],
       ['France', 37.0, 67000.0]], dtype=object)

    In [26]: 
  

      Out[26]: 
    

array(['No', 'Yes', 'No', 'No', 'Yes', 'Yes', 'No', 'Yes', 'No', 'Yes'], dtype=object)

三、处理缺失数据

    In [27]: 
  

from sklearn.preprocessing import Imputer
imputer = Imputer(missing_values = 'NaN', strategy = 'mean', axis = 0) # 将na用平均值代替
imputer = imputer.fit(X[:, 1:3])
X[:, 1:3] = imputer.transform(X[:, 1:3])
X

      Out[27]: 
    

array([['France', 44.0, 72000.0],
       ['Spain', 27.0, 48000.0],
       ['Germany', 30.0, 54000.0],
       ['Spain', 38.0, 61000.0],
       ['Germany', 40.0, 63777.77777777778],
       ['France', 35.0, 58000.0],
       ['Spain', 38.77777777777778, 52000.0],
       ['France', 48.0, 79000.0],
       ['Germany', 50.0, 83000.0],
       ['France', 37.0, 67000.0]], dtype=object)

四、数据转换

    In [28]: 
  

from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder_X = LabelEncoder()
X[:, 0] = labelencoder_X.fit_transform(X[:, 0]) # 将国家变成数字
X

      Out[28]: 
    

array([[0, 44.0, 72000.0],
       [2, 27.0, 48000.0],
       [1, 30.0, 54000.0],
       [2, 38.0, 61000.0],
       [1, 40.0, 63777.77777777778],
       [0, 35.0, 58000.0],
       [2, 38.77777777777778, 52000.0],
       [0, 48.0, 79000.0],
       [1, 50.0, 83000.0],
       [0, 37.0, 67000.0]], dtype=object)

    In [29]: 
  

onehotencoder = OneHotEncoder(categorical_features = [0]) # 将国家数字变成虚拟编码
X = onehotencoder.fit_transform(X).toarray()
X

      Out[29]: 
    

array([[  1.00000000e+00,   0.00000000e+00,   0.00000000e+00,
          4.40000000e+01,   7.20000000e+04],
       [  0.00000000e+00,   0.00000000e+00,   1.00000000e+00,
          2.70000000e+01,   4.80000000e+04],
       [  0.00000000e+00,   1.00000000e+00,   0.00000000e+00,
          3.00000000e+01,   5.40000000e+04],
       [  0.00000000e+00,   0.00000000e+00,   1.00000000e+00,
          3.80000000e+01,   6.10000000e+04],
       [  0.00000000e+00,   1.00000000e+00,   0.00000000e+00,
          4.00000000e+01,   6.37777778e+04],
       [  1.00000000e+00,   0.00000000e+00,   0.00000000e+00,
          3.50000000e+01,   5.80000000e+04],
       [  0.00000000e+00,   0.00000000e+00,   1.00000000e+00,
          3.87777778e+01,   5.20000000e+04],
       [  1.00000000e+00,   0.00000000e+00,   0.00000000e+00,
          4.80000000e+01,   7.90000000e+04],
       [  0.00000000e+00,   1.00000000e+00,   0.00000000e+00,
          5.00000000e+01,   8.30000000e+04],
       [  1.00000000e+00,   0.00000000e+00,   0.00000000e+00,
          3.70000000e+01,   6.70000000e+04]])

    In [30]: 
  

labelencoder_y = LabelEncoder()
y = labelencoder_y.fit_transform(y) # 将是否购买变成数字
y

      Out[30]: 
    

array([0, 1, 0, 0, 1, 1, 0, 1, 0, 1])

五、区分训练集和测试集

    In [31]: 
  

from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state = 0)

六、特征缩放

    In [33]: 
  

X_test

      Out[33]: 
    

array([[  0.00000000e+00,   1.00000000e+00,   0.00000000e+00,
          3.00000000e+01,   5.40000000e+04],
       [  0.00000000e+00,   1.00000000e+00,   0.00000000e+00,
          5.00000000e+01,   8.30000000e+04]])

    In [34]: 
  

from sklearn.preprocessing import StandardScaler  # 导入标准化库
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)
X_test

      Out[34]: 
    

array([[-1.        ,  2.64575131, -0.77459667, -1.45882927, -0.90166297],
       [-1.        ,  2.64575131, -0.77459667,  1.98496442,  2.13981082]])

    In [35]: 
  

      Out[35]: 
    

array([0, 1, 0, 0, 1, 1, 0, 1, 0, 1])

git项目地址:

git连接

若云流风

关注

1
点赞
踩
4

收藏

觉得还不错? 一键收藏
打赏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录