第二章:第一节数据清洗及特征处理-课程

【回顾&引言】前面一章的内容大家可以感觉到我们主要是对基础知识做一个梳理,让大家了解数据分析的一些操作,主要做了数据的各个角度的观察。那么在这里,我们主要是做数据分析的流程性学习,主要是包括了数据清洗以及数据的特征处理,数据重构以及数据可视化。这些内容是为数据分析最后的建模和模型评价做一个铺垫。

开始之前,导入numpy、pandas包和数据
#加载所需的库
import numpy as np
import pandas as pd
#加载数据train.csv
df=pd.read_csv('train.csv')
df.head()
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
4503Allen, Mr. William Henrymale35.0003734508.0500NaNS

2 第二章:数据清洗及特征处理

我们拿到的数据通常是不干净的,所谓的不干净,就是数据中有缺失值,有一些异常点等,需要经过一定的处理才能继续做后面的分析或建模,所以拿到数据的第一步是进行数据清洗,本章我们将学习缺失值、重复值、字符串和数据转换等操作,将数据清洗成可以分析或建模的亚子。

2.1 缺失值观察与处理

我们拿到的数据经常会有很多缺失值,比如我们可以看到Cabin列存在NaN,那其他列还有没有缺失值,这些缺失值要怎么处理呢

2.1.1 任务一:缺失值观察

(1) 请查看每个特征缺失值个数
(2) 请查看Age, Cabin, Embarked列的数据
以上方式都有多种方式,所以大家多多益善

#写入代码
df.isna().sum()
PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64
#写入代码
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
#写入代码
df[df.Age.isna()]
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
5603Moran, Mr. JamesmaleNaN003308778.4583NaNQ
171812Williams, Mr. Charles EugenemaleNaN0024437313.0000NaNS
192013Masselmani, Mrs. FatimafemaleNaN0026497.2250NaNC
262703Emir, Mr. Farred ChehabmaleNaN0026317.2250NaNC
282913O'Dwyer, Miss. Ellen "Nellie"femaleNaN003309597.8792NaNQ
.......................................
85986003Razi, Mr. RaihedmaleNaN0026297.2292NaNC
86386403Sage, Miss. Dorothy Edith "Dolly"femaleNaN82CA. 234369.5500NaNS
86886903van Melkebeke, Mr. PhilemonmaleNaN003457779.5000NaNS
87887903Laleff, Mr. KristomaleNaN003492177.8958NaNS
88888903Johnston, Miss. Catherine Helen "Carrie"femaleNaN12W./C. 660723.4500NaNS

177 rows × 12 columns

df[df.Cabin.isna()]
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS
4503Allen, Mr. William Henrymale35.0003734508.0500NaNS
5603Moran, Mr. JamesmaleNaN003308778.4583NaNQ
7803Palsson, Master. Gosta Leonardmale2.03134990921.0750NaNS
.......................................
88488503Sutehall, Mr. Henry Jrmale25.000SOTON/OQ 3920767.0500NaNS
88588603Rice, Mrs. William (Margaret Norton)female39.00538265229.1250NaNQ
88688702Montvila, Rev. Juozasmale27.00021153613.0000NaNS
88888903Johnston, Miss. Catherine Helen "Carrie"femaleNaN12W./C. 660723.4500NaNS
89089103Dooley, Mr. Patrickmale32.0003703767.7500NaNQ

687 rows × 12 columns

df[df.Embarked.isna()]
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
616211Icard, Miss. Ameliefemale38.00011357280.0B28NaN
82983011Stone, Mrs. George Nelson (Martha Evelyn)female62.00011357280.0B28NaN
2.1.2 任务二:对缺失值进行处理

(1)处理缺失值一般有几种思路

(2) 请尝试对Age列的数据的缺失值进行处理

(3) 请尝试使用不同的方法直接对整张表的缺失值进行处理

#处理缺失值的一般思路:
#提醒:可使用的函数有--->dropna函数与fillna函数
df_drop=df.dropna(subset=['Age'])
df_drop[df_drop.Age.isna()]
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
#写入代码
df_ffill=df.copy()
df_ffill.Age.fillna(method='ffill',inplace=True)
df_ffill[df_ffill.Age.isna()]
# df.Age.fillna(method='ffill')
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
#写入代码
df_bfill=df.copy()
df_bfill.Age.fillna(method='bfill',inplace=True)
df_bfill[df_bfill.Age.isna()]
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
#写入代码
df_meanfill=df.copy()
df_meanfill.Age.mean()
df_meanfill.Age.fillna(df_meanfill.Age.mean(),inplace=True)
df_meanfill[df_meanfill.Age.isna()]

PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
df_meanfill.Age.mean()
29.699117647058763

【思考1】dropna和fillna有哪些参数,分别如何使用呢?

dropna

  • (a)axis参数
  • (b)how参数(可以选all或者any,表示全为缺失去除和存在缺失去除)
  • (c)subset参数(即在某一组列范围中搜索缺失值)

fillna

  • (a)值填充与前后向填充(分别与ffill方法和bfill方法等价)method=

【思考2】检索空缺值用np.nan要比用None好,这是为什么?

#思考回答

  • 在所有的表格读取后,无论列是存放什么类型的数据,默认的缺失值全为np.nan类型
  • 但这里存在一个问题就是np.nan!=np.nan

【参考】https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html

【参考】https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.fillna.html

2.2 重复值观察与处理

由于这样那样的原因,数据中会不会存在重复值呢,如果存在要怎样处理呢

2.2.1 任务一:请查看数据中的重复值
  • .duplicated()可选参数keep默认为first,即首次出现设为不重复,若为last,则最后一次设为不重复,若为False,则所有重复项为True
df.head()
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
4503Allen, Mr. William Henrymale35.0003734508.0500NaNS
df[df.duplicated()]
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
df.drop_duplicates()
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
4503Allen, Mr. William Henrymale35.0003734508.0500NaNS
.......................................
88688702Montvila, Rev. Juozasmale27.00021153613.0000NaNS
88788811Graham, Miss. Margaret Edithfemale19.00011205330.0000B42S
88888903Johnston, Miss. Catherine Helen "Carrie"femaleNaN12W./C. 660723.4500NaNS
88989011Behr, Mr. Karl Howellmale26.00011136930.0000C148C
89089103Dooley, Mr. Patrickmale32.0003703767.7500NaNQ

891 rows × 12 columns

df[df.duplicated('Ticket')]
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
242503Palsson, Miss. Torborg Danirafemale8.03134990921.0750NaNS
717203Goodwin, Miss. Lillian Amyfemale16.052CA 214446.9000NaNS
888911Fortune, Miss. Mabel Helenfemale23.03219950263.0000C23 C25 C27S
11711802Turpin, Mr. William John Robertmale29.0101166821.0000NaNS
11912003Andersson, Miss. Ellis Anna Mariafemale2.04234708231.2750NaNS
.......................................
87687703Gustafsson, Mr. Alfred Ossianmale20.00075349.8458NaNS
87988011Potter, Mrs. Thomas Jr (Lily Alexenia Wilson)female56.0011176783.1583C50C
88088112Shelley, Mrs. William (Imanita Parrish Hall)female25.00123043326.0000NaNS
88588603Rice, Mrs. William (Margaret Norton)female39.00538265229.1250NaNQ
88888903Johnston, Miss. Catherine Helen "Carrie"femaleNaN12W./C. 660723.4500NaNS

210 rows × 12 columns

#写入代码

for name ,group in df.groupby('Ticket'):
    print(name)
    display(group)
    break
110152
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
25725811Cherry, Miss. Gladysfemale30.00011015286.5B77S
50450511Maioni, Miss. Robertafemale16.00011015286.5B79S
75976011Rothes, the Countess. of (Lucy Noel Martha Dye...female33.00011015286.5B77S

可以发现有很多人是使用了同样的票上船的,如果是要求一人一票的话那么可能这部分的数据就是存在问题的,也可能是逃票了或者其他原因。

2.2.2 任务二:对重复值进行处理

(1)重复值有哪些处理方式呢?

(2)处理我们数据的重复值

方法多多益善

  • .drop_duplicates(‘Class’)

#重复值有哪些处理方式:

  • 删除重复值
  • 单独对于重复值进行修改
#写入代码


df.drop_duplicates()
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
4503Allen, Mr. William Henrymale35.0003734508.0500NaNS
.......................................
88688702Montvila, Rev. Juozasmale27.00021153613.0000NaNS
88788811Graham, Miss. Margaret Edithfemale19.00011205330.0000B42S
88888903Johnston, Miss. Catherine Helen "Carrie"femaleNaN12W./C. 660723.4500NaNS
88989011Behr, Mr. Karl Howellmale26.00011136930.0000C148C
89089103Dooley, Mr. Patrickmale32.0003703767.7500NaNQ

891 rows × 12 columns

2.2.3 任务三:将前面清洗的数据保存为csv格式
#写入代码
df.to_csv('test_clear.csv')


2.3 特征观察与处理

我们对特征进行一下观察,可以把特征大概分为两大类:
数值型特征:Survived ,Pclass, Age ,SibSp, Parch, Fare,其中Survived, Pclass为离散型数值特征,Age,SibSp, Parch, Fare为连续型数值特征
文本型特征:Name, Sex, Cabin,Embarked, Ticket,其中Sex, Cabin, Embarked, Ticket为类别型文本特征,数值型特征一般可以直接用于模型的训练,但有时候为了模型的稳定性及鲁棒性会对连续变量进行离散化。文本型特征往往需要转换成数值型特征才能用于建模分析。

df.head()
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
4503Allen, Mr. William Henrymale35.0003734508.0500NaNS
2.3.1 任务一:对年龄进行分箱(离散化)处理

(1) 分箱操作是什么?

(2) 将连续变量Age平均分箱成5个年龄段,并分别用类别变量12345表示

(3) 将连续变量Age划分为[0,5) [5,15) [15,30) [30,50) [50,80)五个年龄段,并分别用类别变量12345表示

(4) 将连续变量Age按10% 30% 50 70% 90%五个年龄段,并用分类变量12345表示

(5) 将上面的获得的数据分别进行保存,保存为csv格式

df['Age_cut']=pd.cut(df.Age,5, right=False, labels=[1,2,3,4,5])#
df['Age_cut']
0        2
1        3
2        2
3        3
4        3
      ... 
886      2
887      2
888    NaN
889      2
890      2
Name: Age_cut, Length: 891, dtype: category
Categories (5, int64): [1 < 2 < 3 < 4 < 5]
df.to_csv('test_ave.csv')
#分箱操作是什么:
df['Age_cut']=pd.cut(df.Age,[0,5,15,30,50,80], right=False, labels=[1,2,3,4,5])
df['Age_cut']
0        3
1        4
2        3
3        4
4        4
      ... 
886      3
887      3
888    NaN
889      3
890      4
Name: Age_cut, Length: 891, dtype: category
Categories (5, int64): [1 < 2 < 3 < 4 < 5]
df.to_csv('test_ave.csv')
#写入代码
df['Age_cut']=pd.qcut(df.Age,[0,.1,.3,0.5,0.7,0.9], labels=[1,2,3,4,5])
df['Age_cut']
0        2
1        5
2        3
3        4
4        4
      ... 
886      3
887      2
888    NaN
889      3
890      4
Name: Age_cut, Length: 891, dtype: category
Categories (5, int64): [1 < 2 < 3 < 4 < 5]
#写入代码
df.to_csv('test_ave.csv')
#写入代码

【参考】https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.cut.html

【参考】https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.qcut.html

2.3.2 任务二:对文本变量进行转换

(1) 查看文本变量名及种类
(2) 将文本变量Sex, Cabin ,Embarked用数值变量12345表示
(3) 将文本变量Sex, Cabin, Embarked用one-hot编码表示

#写入代码
df.Sex.describe()
count      891
unique       2
top       male
freq       577
Name: Sex, dtype: object
df.Sex.value_counts()
male      577
female    314
Name: Sex, dtype: int64
df.Cabin.value_counts()
G6             4
C23 C25 C27    4
B96 B98        4
F33            3
F2             3
              ..
A5             1
B71            1
C111           1
B79            1
D48            1
Name: Cabin, Length: 147, dtype: int64
df.Embarked.value_counts()
S    644
C    168
Q     77
Name: Embarked, dtype: int64
df['Sex_num'] = df['Sex'].replace(['male','female'],[1,2])
df.head()
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarkedAge_cutSex_num
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS21
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C52
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS32
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S42
4503Allen, Mr. William Henrymale35.0003734508.0500NaNS41
#写入代码
df['Sex_num'] = df['Sex'].replace(r'male',1).replace(r'female',2)
df.head()
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarkedAge_cutSex_num
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS21
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C52
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS32
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S42
4503Allen, Mr. William Henrymale35.0003734508.0500NaNS41
#写入代码
df['Sex_num'] = df['Sex'].map({'male':1,'female':2})
df.head()

PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarkedAge_cutSex_num
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS21
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C52
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS32
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S42
4503Allen, Mr. William Henrymale35.0003734508.0500NaNS41
#方法三: 使用sklearn.preprocessing的LabelEncoder
from sklearn.preprocessing import LabelEncoder
for feat in ['Sex','Cabin', 'Ticket']:
    lbl = LabelEncoder()  
    label_dict = dict(zip(df[feat].unique(), range(df[feat].nunique())))
    df[feat + "_labelEncode_map"] = df[feat].map(label_dict)
    df[feat + "_labelEncode_lbl"] = lbl.fit_transform(df[feat].astype(str))

df.head()
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarkedAge_cutSex_numSex_labelEncode_mapSex_labelEncode_lblCabin_labelEncode_mapCabin_labelEncode_lblTicket_labelEncode_mapTicket_labelEncode_lbl
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS21010.01470523
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C52101.0811596
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS32100.01472669
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S42102.055349
4503Allen, Mr. William Henrymale35.0003734508.0500NaNS41010.01474472
###  factorize方法
#### 该方法主要用于自然数编码,并且缺失值会被记做-1,其中sort参数表示是否排序后赋值
codes, uniques = pd.factorize(df['Sex'], sort=True)
df['Sex_factorize']=codes
codes, uniques = pd.factorize(df['Cabin'], sort=True)
df['Cabin_factorize']=codes
codes, uniques = pd.factorize(df['Embarked'], sort=True)
df['Embarked_factorize']=codes
df.head()
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFare...Sex_numSex_labelEncode_mapSex_labelEncode_lblCabin_labelEncode_mapCabin_labelEncode_lblTicket_labelEncode_mapTicket_labelEncode_lblSex_factorizeCabin_factorizeEmbarked_factorize
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500...1010.014705231-12
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833...2101.08115960810
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250...2100.014726690-12
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000...2102.0553490552
4503Allen, Mr. William Henrymale35.0003734508.0500...1010.014744721-12

5 rows × 23 columns

#one_hot
df.join(pd.get_dummies(df[['Sex', 'Cabin','Embarked']])).head()
#可选prefix参数添加前缀,prefix_sep添加分隔符
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFare...Cabin_F G73Cabin_F2Cabin_F33Cabin_F38Cabin_F4Cabin_G6Cabin_TEmbarked_CEmbarked_QEmbarked_S
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500...0000000001
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833...0000000100
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250...0000000001
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000...0000000001
4503Allen, Mr. William Henrymale35.0003734508.0500...0000000001

5 rows × 175 columns

2.3.3 任务三:从纯文本Name特征里提取出Titles的特征(所谓的Titles就是Mr,Miss,Mrs等)
#写入代码
df['Titles']=df['Name'].str.extract('(["Mr","Miss","Mrs"]+)\.')
df.head()
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFare...Sex_labelEncode_mapSex_labelEncode_lblCabin_labelEncode_mapCabin_labelEncode_lblTicket_labelEncode_mapTicket_labelEncode_lblSex_factorizeCabin_factorizeEmbarked_factorizeTitles
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500...010.014705231-12Mr
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833...101.08115960810Mrs
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250...100.014726690-12Miss
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000...102.0553490552Mrs
4503Allen, Mr. William Henrymale35.0003734508.0500...010.014744721-12Mr

5 rows × 24 columns

#保存最终你完成的已经清理好的数据
# 保存上面的为最终结论
df.to_csv('test_fin.csv')

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值