data_prepeocess-CSDN博客

本文链接：https://blog.csdn.net/qq_43460068/article/details/127600618

数据预处理

读取数据集

import os

os.makedirs(os.path.join('.','data'),exist_ok=True)
data_file = os.path.join('.','data','house_tiny.csv')
with open(data_file,'w') as f:
    f.write('NumRoos,Alley,Price\n') #列名
    f.write('NA,Pave,127500\n') #每行表示一个样本
    f.write('2,NA,106000\n')
    f.write('4,NA,178100\n')
    f.write('NA,NA,140000\n')

import pandas as pd

data = pd.read_csv(data_file)
print(data)

   NumRoos Alley   Price
0      NaN  Pave  127500
1      2.0   NaN  106000
2      4.0   NaN  178100
3      NaN   NaN  140000

处理缺失值

注意，“NaN”项代表缺失值。[为了处理缺失的数据，典型的方法包括插值和删除]，其中插值用替代值代替缺失值，而删除值则忽略缺失值。通过位置索引iloc，我们将data分成inputs和outputs，其中前者为data的前两列，而后者为data的最后一列。对于inputs中缺少的数值，我们用同一列的均值替换“NaN”项。

inputs,outputs = data.iloc[:,0:2],data.iloc[:,2]
inputs = inputs.fillna(inputs.mean())
print(inputs)

   NumRoos Alley
0      3.0  Pave
1      2.0   NaN
2      4.0   NaN
3      3.0   NaN

对于inputs中的类别值或离散值，我们将“NaN”视为一个类别。由于“巷子”（“Alley”）列只接受两种类型的类别值“Pave”和“NaN”，pandas可以自动将此列转换为两列“Alley_Pave”和"Alley_nan"。巷子类型为“Pave”的行为将"Alley_Pave"的值设置为1，"Alley_nan"的值设置为0.缺少巷子类型的行为会将"Alley_Pave"和"Alley_nan"分别设置为0和1.

inputs = pd.get_dummies(inputs,dummy_na=True)
print(inputs)

   NumRoos  Alley_Pave  Alley_nan
0      3.0           1          0
1      2.0           0          1
2      4.0           0          1
3      3.0           0          1

转换为张量格式

现在inputs和outputs中的所有条目都是数值类型，它们可以转换为张量格式。当数据采用张量格式后，可以通过在numref:sec_ndarry中引入那些张量函数来进一步操作。

import paddle
X,y = paddle.to_tensor(inputs.values),paddle.to_tensor(outputs.values)
X,y

(Tensor(shape=[4, 3], dtype=float64, place=Place(gpu:0), stop_gradient=True,
        [[3., 1., 0.],
         [2., 0., 1.],
         [4., 0., 1.],
         [3., 0., 1.]]),
 Tensor(shape=[4], dtype=int64, place=Place(gpu:0), stop_gradient=True,
        [127500, 106000, 178100, 140000]))

小结

像庞大的Python生态系统中的其他扩展包一样，pandas可以与张量兼容
插值和删除可用于处理缺失的数据

练习

删除缺失值最多的列
将预处理后的数据集转换为张量格式

data_test_flie = os.path.join('.','data','test_file.csv')
with open(data_test_flie,'w',encoding='utf-8') as f:
    f.write('小狗,小猫,价格\n') #列名
    f.write('哈士奇,橘猫,1000\n')
    f.write('柴犬,蓝猫,800\n')
    f.write('柯基,NA,500\n')
    f.write('NA,加菲猫,1200\n')
    f.write('NA,英短,1500\n')

test_data = pd.read_csv(data_test_flie)
print(test_data)

    小狗   小猫    价格
0  哈士奇   橘猫  1000
1   柴犬   蓝猫   800
2   柯基  NaN   500
3  NaN  加菲猫  1200
4  NaN   英短  1500

#找到每列的nan数
nan_number = test_data.isnull().sum(axis=0)  #axis=0是列，从上往下
print(nan_number)

小狗    2
小猫    1
价格    0
dtype: int64

#找到nan_number中最大数的索引
nan_max_id = nan_number.idxmax()
print(nan_max_id)

小狗

#删除nan最大的列
test_data = test_data.drop([nan_max_id],axis=1) #axis = 1是行，从左到右
print(test_data)

    小猫    价格
0   橘猫  1000
1   蓝猫   800
2  NaN   500
3  加菲猫  1200
4   英短  1500

inputs,outputs = test_data.iloc[:,0],test_data.iloc[:,1]
print(inputs)
print(outputs)

0     橘猫
1     蓝猫
2    NaN
3    加菲猫
4     英短
Name: 小猫, dtype: object
0    1000
1     800
2     500
3    1200
4    1500
Name: 价格, dtype: int64

inputs = pd.get_dummies(inputs,dummy_na=True)
print(inputs)

   加菲猫  橘猫  英短  蓝猫  NaN
0    0   1   0   0    0
1    0   0   0   1    0
2    0   0   0   0    1
3    1   0   0   0    0
4    0   0   1   0    0

#转换为张量格式
X,y = paddle.to_tensor(inputs.values),paddle.to_tensor(outputs.values)
X,y

(Tensor(shape=[5, 5], dtype=uint8, place=Place(gpu:0), stop_gradient=True,
        [[0, 1, 0, 0, 0],
         [0, 0, 0, 1, 0],
         [0, 0, 0, 0, 1],
         [1, 0, 0, 0, 0],
         [0, 0, 1, 0, 0]]),
 Tensor(shape=[5], dtype=int64, place=Place(gpu:0), stop_gradient=True,
        [1000, 800 , 500 , 1200, 1500]))