动手学数据分析-task2数据清洗及特征处理

数据分析主要是包括:

  • 数据清洗
  • 数据的特征处理
  • 数据重构
  • 数据可视化
开始之前,导入numpy、pandas包和数据
#加载所需的库
import numpy as np
import pandas as pd
#加载数据train.csv
train_data = pd.read_csv('../titanic/train.csv')

2 数据清洗及特征处理

我们拿到的数据通常是不干净的,所谓的不干净,就是数据中有缺失值,有一些异常点等,需要经过一定的处理才能继续做后面的分析或建模,所以拿到数据的第一步是进行数据清洗,本章我们将学习缺失值、重复值、字符串和数据转换等操作,将数据清洗成可以分析或建模的亚子。

2.1 缺失值观察与处理

我们拿到的数据经常会有很多缺失值,比如我们可以看到Cabin列存在NaN,那其他列还有没有缺失值,这些缺失值要怎么处理呢

2.1.1 任务一:缺失值观察

(1) 请查看每个特征缺失值个数
(2) 请查看Age, Cabin, Embarked列的数据

查看缺失值个数
  • ①train_data.info()
  • ②train_data.isnull().sum()
#写入代码
train_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
#写入代码
train_data.isnull().sum()
PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

tips:由以上可知,总共为891条数据,

Age缺失了大概1/4左右,考虑Age可以进行各种插值方法填充

Cabin缺失了7/9左右,考虑模型的噪音,要么直接将这列特征drop,要么想办法构建模型去对此列进行预测填充

Embarked只缺失了2个,插值或者直接drop都可,没太大影响

# 请查看Age, Cabin, Embarked列的数据
columns = ['Age','Cabin','Embarked']
train_data[columns].head()
AgeCabinEmbarked
022.0NaNS
138.0C85C
226.0NaNS
335.0C123S
435.0NaNS
2.1.2 任务二:对缺失值进行处理

(1)处理缺失值一般有几种思路

  • 丢弃
  • 补全
    • 统计法:对于数值型的数据,使用均值,加权均值,中位数等方法补足,对于分类型数据,使用类别众数最多的值补足
    • 模型法:基于其他已有的其他字段,将缺失字段作为目标变量进行预测,从而得到最为可能的补全值。如果带有缺失值的列式数值变量,采用回顾模型补全,如果是分类变量,则采用分类模型补全。
  • 真值转换法,承认缺失值的存在
  • 不处理

(2) 请尝试对Age列的数据的缺失值进行处理

(3) 请尝试使用不同的方法直接对整张表的缺失值进行处理

# 对Age进行缺失值处理
age_median = train_data.Age.median()
train_data.Age.fillna(age_median,inplace=True)
train_data.Age.describe()
count    891.000000
mean      29.361582
std       13.019697
min        0.420000
25%       22.000000
50%       28.000000
75%       35.000000
max       80.000000
Name: Age, dtype: float64
# 更好的处理方法是按照性别分组,各自计算男性和女性的年龄的中位数然后再填充
age_median_sex = train_data.groupby('Sex').Age.median()
train_data.set_index('Sex',inplace = True)#设置原数据集的索引为'Sex'
# train_data.head()
age_median_sex
Sex
female    28.0
male      28.0
Name: Age, dtype: float64
"""
Pandas 的值在运算的过程中,会根据索引的值来进行自动的匹配。
在这里我们可以看到上一步骤的Series:age_median_sex的索引是 female 和 male 两个值,
所以需要把原始数据titanic_df中的性别也设置为索引,用 fillna 自动匹配相应的索引进行填充。
"""
train_data.Age.fillna(age_median_sex,inplace = True)
train_data.reset_index(inplace=True)
train_data.Age.describe()
count    891.000000
mean      29.361582
std       13.019697
min        0.420000
25%       22.000000
50%       28.000000
75%       35.000000
max       80.000000
Name: Age, dtype: float64

Cabin的缺失,由于损失太多选择直接删除

#写入代码
train_data.drop(['Cabin'],axis = 1,inplace = True)
train_data.head()
SexPassengerIdSurvivedPclassNameAgeSibSpParchTicketFareEmbarked
0male103Braund, Mr. Owen Harris22.010A/5 211717.2500S
1female211Cumings, Mrs. John Bradley (Florence Briggs Th...38.010PC 1759971.2833C
2female313Heikkinen, Miss. Laina26.000STON/O2. 31012827.9250S
3female411Futrelle, Mrs. Jacques Heath (Lily May Peel)35.01011380353.1000S
4male503Allen, Mr. William Henry35.0003734508.0500S

分类型数据缺失值处理Embarked

train_data.describe(include=[np.object])#利用include=[np.object]查看分类型数据的描述性统计
SexNameTicketEmbarked
count891891891889
unique28916813
topmaleNicholson, Mr. Arthur ErnestCA. 2343S
freq57717644
# 能看到‘S’出现的频数最多,咖位最高
# 其实这里也可以利用技术统计的方式,求出Embarked列频数最多的值
train_data.Embarked.value_counts()
S    644
C    168
Q     77
Name: Embarked, dtype: int64
train_data.fillna({'Embarked':'S'},inplace=True)
train_data['Embarked'].isnull().sum()
0

【思考1】dropna和fillna有哪些参数,分别如何使用呢?

使用DataFrame.dropna(axis=0, how=‘any’, thresh=None, subset=None, inplace=False)
参数说明:

axis:

  • axis=0: 删除包含缺失值的行
  • axis=1: 删除包含缺失值的列
  • how: 与axis配合使用
  • how=‘any’ :只要有缺失值出现,就删除该行货列
  • how=‘all’: 所有的值都缺失,才删除行或列
  • thresh: axis中至少有thresh个非缺失值,否则删除
  • 比如 axis=0,thresh=10:标识如果该行中非缺失值的数量小于10,将删除改行
  • subset: list
  • 在哪些列中查看是否有缺失值
  • inplace: 是否在原数据上操作。如果为真,返回None否则返回新的copy,去掉了缺失值

DataFrame.drop(labels=None, axis=0, index=None, columns=None, level=None, inplace=False, errors=‘raise’)

  • labels: 要删除行或列的列表
  • axis: 0 行 ;1 列

使用DataFrame.fillna(value=None, method=None, axis=None, inplace=False, limit=None, downcast=None, **kwargs)

  • value: scalar, dict, Series, or DataFrame
  • dict 可以指定每一行或列用什么值填充
  • method: {‘backfill’, ‘bfill’, ‘pad’, ‘ffill’, None}, default None
    在列上操作
  • ffill / pad: 使用前一个值来填充缺失值
  • backfill / bfill :使用后一个值来填充缺失值
  • limit 填充的缺失值个数限制。应该不怎么用

【参考】https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html

【参考】https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.fillna.html

2.2 重复值观察与处理

由于这样那样的原因,数据中会不会存在重复值呢,如果存在要怎样处理呢

2.2.1 任务一:请查看数据中的重复值
#写入代码
train_data[train_data.duplicated()]
SexPassengerIdSurvivedPclassNameAgeSibSpParchTicketFareEmbarked
2.2.2 任务二:对重复值进行处理

(1)重复值有哪些处理方式呢?

(2)处理我们数据的重复值

方法多多益善

#重复值有哪些处理方式:
train_data.drop_duplicates().head()
SexPassengerIdSurvivedPclassNameAgeSibSpParchTicketFareEmbarked
0male103Braund, Mr. Owen Harris22.010A/5 211717.2500S
1female211Cumings, Mrs. John Bradley (Florence Briggs Th...38.010PC 1759971.2833C
2female313Heikkinen, Miss. Laina26.000STON/O2. 31012827.9250S
3female411Futrelle, Mrs. Jacques Heath (Lily May Peel)35.01011380353.1000S
4male503Allen, Mr. William Henry35.0003734508.0500S

2.3 特征观察与处理

我们对特征进行一下观察,可以把特征大概分为两大类:
数值型特征:Survived ,Pclass, Age ,SibSp, Parch, Fare,其中Survived, Pclass为离散型数值特征,Age,SibSp, Parch, Fare为连续型数值特征
文本型特征:Name, Sex, Cabin,Embarked, Ticket,其中Sex, Cabin, Embarked, Ticket为类别型文本特征,数值型特征一般可以直接用于模型的训练,但有时候为了模型的稳定性及鲁棒性会对连续变量进行离散化。文本型特征往往需要转换成数值型特征才能用于建模分析。

2.3.1 任务一:对年龄进行分箱(离散化)处理

(1) 分箱操作是什么?

(2) 将连续变量Age平均分箱成5个年龄段,并分别用类别变量12345表示

(3) 将连续变量Age划分为[0,5) [5,15) [15,30) [30,50) [50,80)五个年龄段,并分别用类别变量12345表示

(4) 将连续变量Age按10% 30% 50 70% 90%五个年龄段,并用分类变量12345表示

(5) 将上面的获得的数据分别进行保存,保存为csv格式

  • 分箱操作是什么:
    1.1 对异常数据有很强的鲁棒性:比如一个特征是会话时长=702341sec,换算成天是8.1天,这属于明显的异常值。如果特征没有离散化,一个异常数据“会话时长=8.1天”会给模型造成很大的干扰;

在很多网页分析系统中,0点之后会话将被强行切分,所以会话时长不可能超过1天。
1.2 在逻辑回归模型中,单变量离散化为N个哑变量后,每个哑变量有单独的权重,相当于为模型引入了非线性,能够提升模型表达能力,加大拟合;

1.3 缺失值也可以作为一类特殊的变量进入模型

1.4 分箱后降低模型运算复杂度,提升模型运算速度,对后后期生产上线较为友好

#将连续变量Age平均分箱成5个年龄段,并分别用类别变量12345表示
train_data['Age'] = pd.cut(train_data['Age'],5,labels = ['1','2','3','4','5'])
train_data.head()

SexPassengerIdSurvivedPclassNameAgeSibSpParchTicketFareEmbarked
0male103Braund, Mr. Owen Harris210A/5 211717.2500S
1female211Cumings, Mrs. John Bradley (Florence Briggs Th...310PC 1759971.2833C
2female313Heikkinen, Miss. Laina200STON/O2. 31012827.9250S
3female411Futrelle, Mrs. Jacques Heath (Lily May Peel)31011380353.1000S
4male503Allen, Mr. William Henry3003734508.0500S
#将连续变量Age划分为[0,5) [5,15) [15,30) [30,50) [50,80)五个年龄段,并分别用类别变量12345表示
train_data['Age'] = pd.cut(train_data['Age'],[0,5,15,30,50,80],labels = ['1','2','3','4','5'])
train_data.head()
SexPassengerIdSurvivedPclassNameAgeSibSpParchTicketFareEmbarked
0male103Braund, Mr. Owen Harris310A/5 211717.2500S
1female211Cumings, Mrs. John Bradley (Florence Briggs Th...410PC 1759971.2833C
2female313Heikkinen, Miss. Laina300STON/O2. 31012827.9250S
3female411Futrelle, Mrs. Jacques Heath (Lily May Peel)41011380353.1000S
4male503Allen, Mr. William Henry4003734508.0500S
#将连续变量Age按10% 30% 50 70% 90%五个年龄段,并用分类变量12345表示
train_data['Age'] = pd.cut(train_data['Age'],[0,0.1,0.3,0.5,0.7,0.9],labels = ['1','2','3','4','5'])
train_data.head()
Unnamed: 0SexPassengerIdSurvivedPclassNameAgeSibSpParchTicketFareEmbarked
00male103Braund, Mr. Owen HarrisNaN10A/5 211717.2500S
11female211Cumings, Mrs. John Bradley (Florence Briggs Th...NaN10PC 1759971.2833C
22female313Heikkinen, Miss. LainaNaN00STON/O2. 31012827.9250S
33female411Futrelle, Mrs. Jacques Heath (Lily May Peel)NaN1011380353.1000S
44male503Allen, Mr. William HenryNaN003734508.0500S

【参考】https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.cut.html

【参考】https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.qcut.html

2.3.2 任务二:对文本变量进行转换

(1) 查看文本变量名及种类

  • train_data.describe(include = [np.object])首先查看文本变量名
  • 然后通过train_data.feature.value_counts()查看种类

(2) 将文本变量Sex, Cabin ,Embarked用数值变量12345表示

  • replace
  • map

(3) 将文本变量Sex, Cabin, Embarked用one-hot编码表示

#查看文本变量名及种类
train_data.describe(include = [np.object])
SexNameTicketEmbarked
count891891891891
unique28916813
topmaleNicholson, Mr. Arthur ErnestCA. 2343S
freq57717646
train_data.Embarked.value_counts()
S    646
C    168
Q     77
Name: Embarked, dtype: int64
train_data.Sex.value_counts()
male      577
female    314
Name: Sex, dtype: int64
train_data.Sex.unique()
array(['male', 'female'], dtype=object)
#将文本变量Sex, Cabin ,Embarked用数值变量12345表示
train_data['Sex'] = train_data.Sex.replace(['male','female'],[1,2])
train_data.head()
SexPassengerIdSurvivedPclassNameAgeSibSpParchTicketFareEmbarked
01103Braund, Mr. Owen Harris22.010A/5 211717.2500S
12211Cumings, Mrs. John Bradley (Florence Briggs Th...38.010PC 1759971.2833C
22313Heikkinen, Miss. Laina26.000STON/O2. 31012827.9250S
32411Futrelle, Mrs. Jacques Heath (Lily May Peel)35.01011380353.1000S
41503Allen, Mr. William Henry35.0003734508.0500S
train_data['Embarked'] = train_data.Embarked.replace(['S','C','Q'],[1,2,3])
train_data.head()
SexPassengerIdSurvivedPclassNameAgeSibSpParchTicketFareEmbarked
01103Braund, Mr. Owen Harris22.010A/5 211717.25001
12211Cumings, Mrs. John Bradley (Florence Briggs Th...38.010PC 1759971.28332
22313Heikkinen, Miss. Laina26.000STON/O2. 31012827.92501
32411Futrelle, Mrs. Jacques Heath (Lily May Peel)35.01011380353.10001
41503Allen, Mr. William Henry35.0003734508.05001
# 法二进行map
# train_data = pd.read_csv('data_tmp.csv')
# train_data.head()
dict = {'female':1,'male':2}
train_data['Sex'] = train_data.Sex.map(dict)
# train_data = pd.read_csv('../titanic/train.csv')
# train_data.Cabin.unique()
# train_data.Cabin.nunique()
array([nan, 'C85', 'C123', 'E46', 'G6', 'C103', 'D56', 'A6',
       'C23 C25 C27', 'B78', 'D33', 'B30', 'C52', 'B28', 'C83', 'F33',
       'F G73', 'E31', 'A5', 'D10 D12', 'D26', 'C110', 'B58 B60', 'E101',
       'F E69', 'D47', 'B86', 'F2', 'C2', 'E33', 'B19', 'A7', 'C49', 'F4',
       'A32', 'B4', 'B80', 'A31', 'D36', 'D15', 'C93', 'C78', 'D35',
       'C87', 'B77', 'E67', 'B94', 'C125', 'C99', 'C118', 'D7', 'A19',
       'B49', 'D', 'C22 C26', 'C106', 'C65', 'E36', 'C54',
       'B57 B59 B63 B66', 'C7', 'E34', 'C32', 'B18', 'C124', 'C91', 'E40',
       'T', 'C128', 'D37', 'B35', 'E50', 'C82', 'B96 B98', 'E10', 'E44',
       'A34', 'C104', 'C111', 'C92', 'E38', 'D21', 'E12', 'E63', 'A14',
       'B37', 'C30', 'D20', 'B79', 'E25', 'D46', 'B73', 'C95', 'B38',
       'B39', 'B22', 'C86', 'C70', 'A16', 'C101', 'C68', 'A10', 'E68',
       'B41', 'A20', 'D19', 'D50', 'D9', 'A23', 'B50', 'A26', 'D48',
       'E58', 'C126', 'B71', 'B51 B53 B55', 'D49', 'B5', 'B20', 'F G63',
       'C62 C64', 'E24', 'C90', 'C45', 'E8', 'B101', 'D45', 'C46', 'D30',
       'E121', 'D11', 'E77', 'F38', 'B3', 'D6', 'B82 B84', 'D17', 'A36',
       'B102', 'B69', 'E49', 'C47', 'D28', 'E17', 'A24', 'C50', 'B42',
       'C148'], dtype=object)
#将文本变量Sex, Cabin ,Embarked用数值变量12345表示
import pandas as pd
train_data = pd.read_csv('../titanic/train.csv')
from sklearn.preprocessing import LabelEncoder
for f in ['Ticket','Cabin']:
    lbl = LabelEncoder()
    label_dict = dict(zip(train_data[f].unique(),range(train_data[f].nunique())))
    train_data[f+'_labelEncoder'] = train_data[f].map(label_dict)
train_data.head()
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarkedTicket_labelEncoderCabin_labelEncoder
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS00.0
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C11.0
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS20.0
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S32.0
4503Allen, Mr. William Henrymale35.0003734508.0500NaNS40.0
for feat in ['Ticket','Cabin']:
    lbl = LabelEncoder()
    train_data[feat+'_labelEncoder'] = lbl.fit_transform(train_data[f].astype(str))
# One-hotEncoder
for f in ['Age','Embarked']:
    x = pd.get_dummies(train_data[f],prefix = f)
    train_data = pd.concat([train_data,x],axis = 1)
train_data.head()
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFare...Age_65.0Age_66.0Age_70.0Age_70.5Age_71.0Age_74.0Age_80.0Embarked_CEmbarked_QEmbarked_S
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500...0000000001
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833...0000000100
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250...0000000001
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000...0000000001
4503Allen, Mr. William Henrymale35.0003734508.0500...0000000001

5 rows × 195 columns

2.3.3 任务三:从纯文本Name特征里提取出Titles的特征(所谓的Titles就是Mr,Miss,Mrs等)
#写入代码
train_data['title'] = train_data.Name.str.extract('([A-Za-z]+\.)',expand = False)
train_data.title[:20]
0         Mr.
1        Mrs.
2       Miss.
3        Mrs.
4         Mr.
5         Mr.
6         Mr.
7     Master.
8        Mrs.
9        Mrs.
10      Miss.
11      Miss.
12        Mr.
13        Mr.
14      Miss.
15       Mrs.
16    Master.
17        Mr.
18       Mrs.
19       Mrs.
Name: title, dtype: object
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值