python数据分析——数据清洗及特征处理

先导入相应的包和数据

import numpy as np
import pandas as pd

df=pd.read_csv('train.csv')
df.head()
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
4503Allen, Mr. William Henrymale35.0003734508.0500NaNS

数据清洗

1.缺失值的观察和处理

通过info()或isnull()函数可以查看每列缺失值的个数

df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB
df.isnull().sum()
PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

从上面的结果可以看出,Age,Cabin,Embarked这三列都有缺失值,可以通过head函数大概查看这三列的情况

df[['Age','Cabin','Embarked']].head(5)
AgeCabinEmbarked
022.0NaNS
138.0C85C
226.0NaNS
335.0C123S
435.0NaNS

处理缺失值可以通过过滤缺失值或补全缺失值的方法,过滤缺失值主要是删除缺失值所在的行或列,补全缺失值可以通过常数补全,插值补全,中位数或均值补全

df.dropna().head(3)#过滤含缺失值的行
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
6701McCarthy, Mr. Timothy Jmale54.0001746351.8625E46S
df.fillna(0).head()#用0补全所有缺失值
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
0103Braund, Mr. Owen Harrismale22.010A/5 211717.25000S
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.92500S
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
4503Allen, Mr. William Henrymale35.0003734508.05000S
data=df.fillna({'Age':0})
data.head()

PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
4503Allen, Mr. William Henrymale35.0003734508.0500NaNS

fillna函数相关参数:
value:标量值或字典型对象用于填充缺失值
method:插值方法,默认是’fill’
axis:需要填充的轴,默认axis=0
inplace:修改被调用的对象,而不是生成一个备份
limit:用于前向或后向填充时最大的填充范围

2.重复值的观察和处理

对重复值的查看主要通过duplicated()函数

df[df.duplicated()]
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked

对重复值的处理方法主要是去除重复值

df.drop_duplicates().head()
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
4503Allen, Mr. William Henrymale35.0003734508.0500NaNS
data.to_csv('test_clear.csv')

特征观察和处理

特征可分成数值型特征和文本型特征,数值型特征也可分为离散型数值特征和连续型数值特征

1.离散化和分箱

分箱操作:把连续型的数据离散化,把连续型变量按照不同的标准放入不同的“箱子”里
这里对Age提出三种分箱方法,一个是把Age平均分成5个年龄段,分别用类变量12345表示,一个是给出预定的区间进行分箱,一个是按百分比进行分箱

df['AgeBand']=pd.cut(df['Age'],5,labels=['1','2','3','4','5'])#把Age平均分成5类
df.head()

PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarkedAgeBand
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS2
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C3
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS2
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S3
4503Allen, Mr. William Henrymale35.0003734508.0500NaNS3
df.to_csv('test_ave.csv')
#将Age划分为[0,5),[5,15),[15,30),[30,50),[50,80)5类
df['AgeBand']=pd.cut(df['Age'],[0,5,15,30,50,80],labels=['1','2','3','4','5'])
df.head()
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarkedAgeBand
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS3
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C4
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS3
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S4
4503Allen, Mr. William Henrymale35.0003734508.0500NaNS4
df.to_csv('test_cut.csv')
#将Age按10%,30%,50%,70%,90%分成5类
df['AgeBand']=pd.qcut(df['Age'],[0,0.1,0.3,0.5,0.7,0.9],labels=['1','2','3','4','5'])
df.head()
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarkedAgeBand
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS2
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C5
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS3
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S4
4503Allen, Mr. William Henrymale35.0003734508.0500NaNS4
df.to_csv('test_pr.csv')

2.对文本变量进行转换

查看类别文本变量名可以用value_counts或unique函数

#查看Sex,Cabin,,Embarked的种类
df['Sex'].value_counts()

male      577
female    314
Name: Sex, dtype: int64
df['Cabin'].value_counts()
G6                 4
C23 C25 C27        4
B96 B98            4
F33                3
F2                 3
C22 C26            3
D                  3
E101               3
C125               2
B77                2
C83                2
B18                2
E33                2
F4                 2
C124               2
D35                2
B57 B59 B63 B66    2
D26                2
C78                2
B22                2
B58 B60            2
D33                2
C52                2
E25                2
C68                2
B28                2
E44                2
E121               2
B49                2
B51 B53 B55        2
                  ..
A19                1
D46                1
B71                1
E63                1
A14                1
B82 B84            1
C46                1
C85                1
D30                1
D49                1
D21                1
B78                1
B37                1
B73                1
A16                1
C91                1
B69                1
A6                 1
B30                1
F G63              1
E58                1
B101               1
C82                1
E46                1
E68                1
D6                 1
B102               1
B4                 1
T                  1
A34                1
Name: Cabin, Length: 147, dtype: int64
df['Embarked'].value_counts()
S    644
C    168
Q     77
Name: Embarked, dtype: int64
#用unique查看Sex,Cabin,,Embarked的种类
df['Sex'].unique()
array(['male', 'female'], dtype=object)
df['Cabin'].unique()
array([nan, 'C85', 'C123', 'E46', 'G6', 'C103', 'D56', 'A6',
       'C23 C25 C27', 'B78', 'D33', 'B30', 'C52', 'B28', 'C83', 'F33',
       'F G73', 'E31', 'A5', 'D10 D12', 'D26', 'C110', 'B58 B60', 'E101',
       'F E69', 'D47', 'B86', 'F2', 'C2', 'E33', 'B19', 'A7', 'C49', 'F4',
       'A32', 'B4', 'B80', 'A31', 'D36', 'D15', 'C93', 'C78', 'D35',
       'C87', 'B77', 'E67', 'B94', 'C125', 'C99', 'C118', 'D7', 'A19',
       'B49', 'D', 'C22 C26', 'C106', 'C65', 'E36', 'C54',
       'B57 B59 B63 B66', 'C7', 'E34', 'C32', 'B18', 'C124', 'C91', 'E40',
       'T', 'C128', 'D37', 'B35', 'E50', 'C82', 'B96 B98', 'E10', 'E44',
       'A34', 'C104', 'C111', 'C92', 'E38', 'D21', 'E12', 'E63', 'A14',
       'B37', 'C30', 'D20', 'B79', 'E25', 'D46', 'B73', 'C95', 'B38',
       'B39', 'B22', 'C86', 'C70', 'A16', 'C101', 'C68', 'A10', 'E68',
       'B41', 'A20', 'D19', 'D50', 'D9', 'A23', 'B50', 'A26', 'D48',
       'E58', 'C126', 'B71', 'B51 B53 B55', 'D49', 'B5', 'B20', 'F G63',
       'C62 C64', 'E24', 'C90', 'C45', 'E8', 'B101', 'D45', 'C46', 'D30',
       'E121', 'D11', 'E77', 'F38', 'B3', 'D6', 'B82 B84', 'D17', 'A36',
       'B102', 'B69', 'E49', 'C47', 'D28', 'E17', 'A24', 'C50', 'B42',
       'C148'], dtype=object)
df['Embarked'].unique()
array(['S', 'C', 'Q', nan], dtype=object)



有时候为了方便,会把类别文本转换为1,2,3,4,5这样的标签,下面分别介绍用replace函数,map函数,sklearn.preprocess的LabelEncoder函数进行类别文本的转换
#replace函数
df['Sex_num']=df['Sex'].replace(['male','female'],[1,2])
df.head(3)
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarkedAgeBandSex_num
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS21
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C52
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS32
#map函数
df['Sex_num']=df['Sex'].map({'male':1,'female':2})
df.head(3)
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarkedAgeBandSex_num
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS21
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C52
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS32
from sklearn.preprocessing import LabelEncoder
for feat in ['Cabin','Ticket']:
    lbl=LabelEncoder()#使用0到n_classes-1之间的值对目标标签进行编码。
    
    #对每个类别进行编码,返回字典形式
    label_dict=dict(zip(df[feat].unique(),range(df[feat].nunique())))#nunique函数是返回类别的个数
    
    df[feat+"_labelEncode"]=df[feat].map(label_dict)
    df[feat+"_labelEncode"]=lbl.fit_transform(df[feat].astype(str))#返回编码的标签,astype()函数是强制类型转换,把之前编码的数据类型转换为str型
df.head(3)   
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarkedAgeBandSex_numCabin_labelEncodeTicket_labelEncodeTitle
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS21147523Mr
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C5281596Mrs
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS32147669Miss

one-hot编码:one-hot编码可以将离散空间的特征取值扩展到欧式空间,离散特征的某个取值就对应欧式空间的某个点

#将类别文本转变为one-hot编码
for feat in ['Age','Embarked']:
    x=pd.get_dummies(df[feat],prefix=feat)
    df=pd.concat([df,x],axis=1)#concat是拼接函数
df.head()
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFare...Age_65.0Age_66.0Age_70.0Age_70.5Age_71.0Age_74.0Age_80.0Embarked_CEmbarked_QEmbarked_S
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500...0000000001
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833...0000000100
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250...0000000001
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000...0000000001
4503Allen, Mr. William Henrymale35.0003734508.0500...0000000001

5 rows × 108 columns

3.纯文本中提取特征

#提取Name中含有Mr,Miss,Mrs的特征,用正则表达式
df['Title']=df.Name.str.extract('([A-Za-z]+)\.',expand=False)#返回title那一列
df.head()
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarkedAgeBandSex_numCabin_labelEncodeTicket_labelEncodeTitle
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS21147523Mr
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C5281596Mrs
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS32147669Miss
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S425549Mrs
4503Allen, Mr. William Henrymale35.0003734508.0500NaNS41147472Mr
df.to_csv('test_fin.csv')

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值