Task02:数据清洗及特征处理

Task02:数据清洗及特征处理

本文参考datawhale开源学习资料

#加载所需的库
import numpy as np
import pandas as pd
#加载数据train.csv
df = pd.read_csv('train.csv')
df.head(3)
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS

2.1 缺失值观察与处理

2.1.1 任务一:缺失值观察

(1) 查看每个特征缺失值个数

#方法一
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB
#方法二
df.isnull().sum()
PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

(2) 查看Age, Cabin, Embarked列的数据

df[['Age','Cabin','Embarked']].head(3)
AgeCabinEmbarked
022.0NaNS
138.0C85C
226.0NaNS
2.1.2 任务二:对缺失值进行处理

(1) 对Age列的数据的缺失值进行处理

# 方法一
df[df['Age']==None]=0
df.head(3)
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS
# 方法二
df[df['Age'].isnull()] = 0 
df.head(3)
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS
# 方法三
df[df['Age'] == np.nan] = 0
df.head()
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
4503Allen, Mr. William Henrymale35.0003734508.0500NaNS
# 方法四
# df.loc[df['Age'].isnull(), 'age']=0
df.isnull().sum()
PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age              0
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          529
Embarked         2
dtype: int64
type(np.nan)
float
type(np.NaN)
float

【思考】检索空缺值用np.nan,None以及.isnull()哪个更好,这是为什么?如果其中某个方式无法找到缺失值,原因又是为什么?

  • 数值列读取数据后,空缺值的数据类型为float64,所以用None一般索引不到,比较的时候最好用np.nan

(2) 使用不同的方法直接对整张表的缺失值进行处理

# 方法一
df.dropna().head(3)
# df.drop(labels=None).head(3)
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
5000000.00000.000000
# 方法2
df.fillna(0).head(3)
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
0103Braund, Mr. Owen Harrismale22.010A/5 211717.25000S
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.92500S

【思考】dropna和fillna有哪些参数,分别如何使用呢?

# DataFrame.dropna(axis=0, how='any', thresh=None, subset=None, inplace=False)
参数作用
axisaxis=0: 删除包含缺失值的行;axis=1: 删除包含缺失值的列
howhow=‘any’ :只要有缺失值出现,就删除该行或列;how=‘all’:所有的值都缺失,才删除行或列
threshaxis中至少有thresh个非缺失值,否则删除
subset即在某一组列范围中搜索缺失值
inplace是否在原数据上操作
# DataFrame.fillna(value=None, method=None, axis=None, inplace=False, limit=None, downcast=None)
参数作用
value填充的值:scalar, dict, Series,或DataFrame
dict可以指定每一行或列用什么值填充
method{‘backfill’, ‘bfill’, ‘pad’, ‘ffill’, None}
limit填充的缺失值个数限制

2.2 重复值观察与处理

2.2.1 任务一:查看数据中的重复值
df.duplicated()
0      False
1      False
2      False
3      False
4      False
5      False
6      False
7      False
8      False
9      False
10     False
11     False
12     False
13     False
14     False
15     False
16     False
17      True
18     False
19      True
20     False
21     False
22     False
23     False
24     False
25     False
26      True
27     False
28      True
29      True
       ...  
861    False
862    False
863     True
864    False
865    False
866    False
867    False
868     True
869    False
870    False
871    False
872    False
873    False
874    False
875    False
876    False
877    False
878     True
879    False
880    False
881    False
882    False
883    False
884    False
885    False
886    False
887    False
888     True
889    False
890    False
Length: 891, dtype: bool
df[df.duplicated()]
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
17000000.00000.000
19000000.00000.000
26000000.00000.000
28000000.00000.000
29000000.00000.000
31000000.00000.000
32000000.00000.000
36000000.00000.000
42000000.00000.000
45000000.00000.000
46000000.00000.000
47000000.00000.000
48000000.00000.000
55000000.00000.000
64000000.00000.000
65000000.00000.000
76000000.00000.000
77000000.00000.000
82000000.00000.000
87000000.00000.000
95000000.00000.000
101000000.00000.000
107000000.00000.000
109000000.00000.000
121000000.00000.000
126000000.00000.000
128000000.00000.000
140000000.00000.000
154000000.00000.000
158000000.00000.000
.......................................
718000000.00000.000
727000000.00000.000
732000000.00000.000
738000000.00000.000
739000000.00000.000
740000000.00000.000
760000000.00000.000
766000000.00000.000
768000000.00000.000
773000000.00000.000
776000000.00000.000
778000000.00000.000
783000000.00000.000
790000000.00000.000
792000000.00000.000
793000000.00000.000
815000000.00000.000
825000000.00000.000
826000000.00000.000
828000000.00000.000
832000000.00000.000
837000000.00000.000
839000000.00000.000
846000000.00000.000
849000000.00000.000
859000000.00000.000
863000000.00000.000
868000000.00000.000
878000000.00000.000
888000000.00000.000

176 rows × 12 columns

2.2.2 任务二:对重复值进行处理

重复值有哪些处理方式呢?

# 方法一
isDuplicated=df.duplicated()
df[~isDuplicated].shape
(715, 12)
# 方法二
df = df.drop_duplicates()
print(df.shape)
df.head()
(715, 12)
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
4503Allen, Mr. William Henrymale35.0003734508.0500NaNS
2.2.3 任务三:将前面清洗的数据保存为csv格式
df.to_csv('test_clear.csv')

2.3 特征观察与处理

我们对特征进行一下观察,可以把特征大概分为两大类:

  • 数值型特征
    • Survived ,Pclass, Age ,SibSp, Parch, Fare,其中Survived, Pclass为离散型数值特征
    • Age,SibSp, Parch, Fare为连续型数值特征
  • 文本型特征
    • Name, Sex, Cabin,Embarked, Ticket,其中Sex, Cabin, Embarked, Ticket为类别型文本特征
  • 数值型特征一般可以直接用于模型的训练,但有时候为了模型的稳定性及鲁棒性会对连续变量进行离散化
  • 文本型特征往往需要转换成数值型特征才能用于建模分析
2.3.1 任务一:对年龄进行分箱(离散化)处理

(1) 分箱操作是什么?

  • 将多个连续值分组为较少数量的“分箱”的方法
  • 数据主要分为连续变量和分类变量,分箱的操作主要针对于连续变量

(2) 将连续变量Age平均分箱成5个年龄段,并分别用类别变量12345表示

#平均分5组
df['AgeBand'] = pd.cut(df['Age'], 5,labels = [1,2,3,4,5])
df.head()
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFare...Age_70.0Age_70.5Age_71.0Age_74.0Age_80.0Embarked_0Embarked_CEmbarked_QEmbarked_STitle
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500...000000001Mr
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833...000000100Mrs
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250...000000001Miss
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000...000000001Mrs
4503Allen, Mr. William Henrymale35.0003734508.0500...000000001Mr

5 rows × 110 columns

df.to_csv('test_ave.csv')

(3) 将连续变量Age划分为(0,5] (5,15] (15,30] (30,50] (50,80]五个年龄段,并分别用类别变量12345表示

df['AgeBand'] = pd.cut(df['Age'],[0,5,15,30,50,80],labels = [1,2,3,4,5])
df.head(3)
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarkedAgeBand
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS3
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C4
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS3
df.to_csv('test_cut.csv')

(4) 将连续变量Age按10% 30% 50% 70% 90%五个年龄段,并用分类变量12345表示

df['AgeBand'] = pd.qcut(df['Age'],[0,0.1,0.3,0.5,0.7,0.9],labels = [1,2,3,4,5])
df.head()
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarkedAgeBand
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS2
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C5
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS3
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S4
4503Allen, Mr. William Henrymale35.0003734508.0500NaNS4
df.to_csv('test_pr.csv')

【参考】

  • https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.cut.html
  • https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.qcut.html
2.3.2 任务二:对文本变量进行转换

(1) 查看文本变量名及种类

#方法一: value_counts
df['Sex'].value_counts()
male      453
female    261
0           1
Name: Sex, dtype: int64
df['Cabin'].value_counts()
G6             4
C23 C25 C27    4
B96 B98        4
F33            3
D              3
F2             3
C22 C26        3
C78            2
C93            2
B49            2
E25            2
F G73          2
E8             2
E121           2
D35            2
D17            2
B35            2
B20            2
B22            2
F4             2
C83            2
E67            2
E44            2
C68            2
B51 B53 B55    2
D33            2
C123           2
B58 B60        2
C65            2
B28            2
              ..
C30            1
B94            1
C70            1
B86            1
D19            1
C87            1
C86            1
E38            1
E40            1
E68            1
C148           1
B19            1
E49            1
E31            1
B39            1
A16            1
B80            1
C101           1
D30            1
B37            1
C126           1
C92            1
E50            1
A31            1
C99            1
C82            1
D48            1
B30            1
C111           1
0              1
Name: Cabin, Length: 135, dtype: int64
df['Embarked'].value_counts()
S    554
C    130
Q     28
0      1
Name: Embarked, dtype: int64
#方法二: unique
df['Sex'].unique()
array(['male', 'female', 0], dtype=object)
df['Sex'].nunique()
3

(2) 将文本变量Sex, Cabin ,Embarked用数值变量12345表示

#方法一: replace
df['Sex_num'] = df['Sex'].replace(['male','female'],[1,2])
df.head()
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarkedAgeBandSex_num
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS21
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C52
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS32
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S42
4503Allen, Mr. William Henrymale35.0003734508.0500NaNS41
#方法二: map
df['Sex_num'] = df['Sex'].map({'male': 1, 'female': 2})
df.head()
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarkedAgeBandSex_num
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS21.0
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C52.0
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS32.0
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S42.0
4503Allen, Mr. William Henrymale35.0003734508.0500NaNS41.0
#方法三: 使用sklearn.preprocessing的LabelEncoder
from sklearn.preprocessing import LabelEncoder
for feat in ['Cabin', 'Ticket']:
    lbl = LabelEncoder()  
    label_dict = dict(zip(df[feat].unique(), range(df[feat].nunique())))
    df[feat + "_labelEncode"] = df[feat].map(label_dict)
    df[feat + "_labelEncode"] = lbl.fit_transform(df[feat].astype(str))

df.head()
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarkedAgeBandSex_numCabin_labelEncodeTicket_labelEncode
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS21.0135409
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C52.074472
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS32.0135533
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S42.05041
4503Allen, Mr. William Henrymale35.0003734508.0500NaNS41.0135374

(3) 将文本变量Sex, Cabin, Embarked用one-hot编码表示

#方法一: OneHotEncoder
for feat in ["Age", "Embarked"]:
    x = pd.get_dummies(df[feat], prefix=feat)
    df = pd.concat([df, x], axis=1)  
df.head()
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFare...Age_66.0Age_70.0Age_70.5Age_71.0Age_74.0Age_80.0Embarked_0Embarked_CEmbarked_QEmbarked_S
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500...0000000001
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833...0000000100
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250...0000000001
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000...0000000001
4503Allen, Mr. William Henrymale35.0003734508.0500...0000000001

5 rows × 203 columns

2.3.3 任务三:从纯文本Name特征里提取出Titles的特征(所谓的Titles就是Mr,Miss,Mrs等)
# 使用正则表达式
# 以'.'结尾,前面可以有多个英文字母
df['Title'] = df.Name.str.extract('([A-Za-z]+)\.', expand=False)
df.head()
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFare...Age_66.0Age_70.0Age_70.5Age_71.0Age_74.0Age_80.0Embarked_0Embarked_CEmbarked_QEmbarked_S
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500...0000000001
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833...0000000100
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250...0000000001
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000...0000000001
4503Allen, Mr. William Henrymale35.0003734508.0500...0000000001

5 rows × 203 columns

# 保存上面的为最终结论
df.to_csv('test_fin.csv')
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值