动手数据分析:数据清洗及特征处理Task02

1 数据清洗及特征处理

#加载所需的库
import numpy as np
import pandas as pd


#加载数据train.csv
data=pd.read_csv('train.csv')
data.head()
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
4503Allen, Mr. William Henrymale35.0003734508.0500NaNS

1.1 缺失值观察与处理

缺失值观察

(1) 请查看每个特征缺失值个数

#写入代码
"""各类标签的缺失数量"""
data.isnull().sum()
PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64
#写入代码
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB

(2) 请查看Age, Cabin, Embarked列的数据

data[['Age','Cabin','Embarked']].head()
AgeCabinEmbarked
022.0NaNS
138.0C85C
226.0NaNS
335.0C123S
435.0NaNS
data.loc[:,['Age','Cabin','Embarked']].head()
AgeCabinEmbarked
022.0NaNS
138.0C85C
226.0NaNS
335.0C123S
435.0NaNS

对缺失值进行处理

(1)处理缺失值一般有几种思路

(2) 请尝试对Age列的数据的缺失值进行处理

(3) 请尝试使用不同的方法直接对整张表的缺失值进行处理

#使用fillna函数进行中位数填补

data.loc[:,'Age'].fillna(data.loc[:,'Age'].median())   #中位数进行填补
0      22.0
1      38.0
2      26.0
3      35.0
4      35.0
       ... 
886    27.0
887    19.0
888    28.0
889    26.0
890    32.0
Name: Age, Length: 891, dtype: float64
#fillna函数进行平均值填补

data.loc[:,'Age'].fillna(data.loc[:,'Age'].mean())   #平均值进行填补
0      22.000000
1      38.000000
2      26.000000
3      35.000000
4      35.000000
         ...    
886    27.000000
887    19.000000
888    29.699118
889    26.000000
890    32.000000
Name: Age, Length: 891, dtype: float64
#使用dropna删除空值
data.dropna().head()     
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
6701McCarthy, Mr. Timothy Jmale54.0001746351.8625E46S
101113Sandstrom, Miss. Marguerite Rutfemale4.011PP 954916.7000G6S
111211Bonnell, Miss. Elizabethfemale58.00011378326.5500C103S
#查看删除空值后的数据信息
data.dropna().info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 183 entries, 1 to 889
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  183 non-null    int64  
 1   Survived     183 non-null    int64  
 2   Pclass       183 non-null    int64  
 3   Name         183 non-null    object 
 4   Sex          183 non-null    object 
 5   Age          183 non-null    float64
 6   SibSp        183 non-null    int64  
 7   Parch        183 non-null    int64  
 8   Ticket       183 non-null    object 
 9   Fare         183 non-null    float64
 10  Cabin        183 non-null    object 
 11  Embarked     183 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 18.6+ KB

【思考1】dropna和fillna有哪些参数,分别如何使用呢?

【回答】filln函数:

value:需要用什么值去填充缺失值

axis:确定填充维度,从行开始或是从列开始

method:ffill:用缺失值前面的一个值代替缺失值

backfill/bfill,缺失值后面的一个值代替前面的缺失值

dropna函数:
.dropna(axis=0)删除所有有缺失值的行,.dropna(axis=1)删除所有有缺失值的列

参数inplace,为True表示在原数据集上进行修改,为False表示生成一个复制对象,不修改原数据,默认False

【思考】检索空缺值用np.nan,None以及.isnull()哪个更好,这是为什么?如果其中某个方式无法找到缺失值,原因又是为什么?

#思考回答
"""数值列读取数据后,空缺值的数据类型为float64所以用None一般索引不到,比较的时候一般用np.nan"""

1.2 重复值观察与处理

#duplicated返回一个布尔的Series,显示各行是否有重复行,没有显示为FALSE,有为TRUE;

data.duplicated()
0      False
1      False
2      False
3      False
4      False
       ...  
886    False
887    False
888    False
889    False
890    False
Length: 891, dtype: bool
#drop_duplicates去掉重复行

data = data.drop_duplicates()     
data.head()
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
4503Allen, Mr. William Henrymale35.0003734508.0500NaNS
#将清洗的数据保存

data.to_csv('test_clear.csv')

1.3对年龄进行分箱(离散化)处理

(1) 分箱操作是什么?

1)
数据分箱(也称为离散分箱或分段)是一种数据预处理技术,用于减少次要观察误差的影响,是一种将多个连续值分组为较少数量的“分箱”的方法。

#举一个小例子
#DataFrame的创建

test1 = {'name': ['张三', '李四', '王二麻子', '李华'],
        'ages': [35, 60, 25, 15],
        'height': [170,165,175,180]
       }
example = pd.DataFrame(test1)
example
nameagesheight
0张三35170
1李四60165
2王二麻子25175
3李华15180
example['agesband']=pd.cut(data['ages'],3,labels=['青年','中年','老年'])
example
nameagesheightagesband
0张三35170中年
1李四60165老年
2王二麻子25175青年
3李华15180青年

(2) 将连续变量Age平均分箱成5个年龄段,并分别用类别变量12345表示


data['AgeBand'] = pd.cut(data['Age'], 5,labels = [1,2,3,4,5])
data.head()
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarkedAgeBand
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS2
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C3
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS2
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S3
4503Allen, Mr. William Henrymale35.0003734508.0500NaNS3
data.to_csv('test_ave.csv')

(3) 将连续变量Age划分为[0,5) [5,15) [15,30) [30,50) [50,80)五个年龄段,并分别用类别变量12345表示

data['AgeBand'] = pd.cut(data['Age'],[0,5,15,30,50,80],labels = [1,2,3,4,5])
data.head()
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarkedAgeBand
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS3
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C4
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS3
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S4
4503Allen, Mr. William Henrymale35.0003734508.0500NaNS4
data.to_csv('test_cut.csv')

(4) 将连续变量Age按10% 30% 50% 70% 90%五个年龄段,并用分类变量12345表示

data['AgeBand']=pd.qcut(data['Age'],[0,0.1,0.3,0.5,0.7,0.9],labels = [1,2,3,4,5])
data.head()
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarkedAgeBand
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS2
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C5
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS3
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S4
4503Allen, Mr. William Henrymale35.0003734508.0500NaNS4
data.to_csv('test_pr.csv')

1.4对文本变量进行转换

(1) 查看文本变量名及种类

"""查看文本种类有的方法有:
    1、value_counts
    2、unique
    3、nunique 
"""
data['Sex'].value_counts()
male      577
female    314
Name: Sex, dtype: int64
print(data['Cabin'].nunique())
data['Cabin'].unique()
147





array([nan, 'C85', 'C123', 'E46', 'G6', 'C103', 'D56', 'A6',
       'C23 C25 C27', 'B78', 'D33', 'B30', 'C52', 'B28', 'C83', 'F33',
       'F G73', 'E31', 'A5', 'D10 D12', 'D26', 'C110', 'B58 B60', 'E101',
       'F E69', 'D47', 'B86', 'F2', 'C2', 'E33', 'B19', 'A7', 'C49', 'F4',
       'A32', 'B4', 'B80', 'A31', 'D36', 'D15', 'C93', 'C78', 'D35',
       'C87', 'B77', 'E67', 'B94', 'C125', 'C99', 'C118', 'D7', 'A19',
       'B49', 'D', 'C22 C26', 'C106', 'C65', 'E36', 'C54',
       'B57 B59 B63 B66', 'C7', 'E34', 'C32', 'B18', 'C124', 'C91', 'E40',
       'T', 'C128', 'D37', 'B35', 'E50', 'C82', 'B96 B98', 'E10', 'E44',
       'A34', 'C104', 'C111', 'C92', 'E38', 'D21', 'E12', 'E63', 'A14',
       'B37', 'C30', 'D20', 'B79', 'E25', 'D46', 'B73', 'C95', 'B38',
       'B39', 'B22', 'C86', 'C70', 'A16', 'C101', 'C68', 'A10', 'E68',
       'B41', 'A20', 'D19', 'D50', 'D9', 'A23', 'B50', 'A26', 'D48',
       'E58', 'C126', 'B71', 'B51 B53 B55', 'D49', 'B5', 'B20', 'F G63',
       'C62 C64', 'E24', 'C90', 'C45', 'E8', 'B101', 'D45', 'C46', 'D30',
       'E121', 'D11', 'E77', 'F38', 'B3', 'D6', 'B82 B84', 'D17', 'A36',
       'B102', 'B69', 'E49', 'C47', 'D28', 'E17', 'A24', 'C50', 'B42',
       'C148'], dtype=object)
data['Embarked'].value_counts()
S    644
C    168
Q     77
Name: Embarked, dtype: int64

(2) 将文本变量Sex, Cabin ,Embarked用数值变量12345表示

data['Sex_num'] = data['Sex'].map({"male": 1, "female": 0})
data.head(3)
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarkedAgeBandSex_num
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS21
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C50
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS30
#这里用map映射
for feat in ['Cabin', 'Ticket']:
    label_dict = dict(zip(data[feat].unique(), range(1,data[feat].nunique())))
    data[feat + "_num"] = data[feat].map(label_dict)
    data[feat + "_num"] = data[feat].map(label_dict)
data.head(3)
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarkedAgeBandSex_numCabinnumTicketnumCabin_numTicket_num
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS210.001.01.0
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C501.012.02.0
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS300.021.03.0

3)独热编码

# 独热编码
for feature in ["Sex","Cabin", "Embarked"]:
    temp=pd.get_dummies(data[feature],prefix=feature)
    data=pd.concat([data, temp], axis=1)
data.head()
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFare...Cabin_F G73Cabin_F2Cabin_F33Cabin_F38Cabin_F4Cabin_G6Cabin_TEmbarked_CEmbarked_QEmbarked_S
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500...0000000001
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833...0000000100
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250...0000000001
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000...0000000001
4503Allen, Mr. William Henrymale35.0003734508.0500...0000000001

5 rows × 164 columns

1.5 从Name特征里提取出Mr,Miss,Mrs

#正则表达
data['Title'] = data.Name.str.extract('([A-Za-z]+)\.', expand=False)
data.head()
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFare...Cabin_F2Cabin_F33Cabin_F38Cabin_F4Cabin_G6Cabin_TEmbarked_CEmbarked_QEmbarked_STitle
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500...000000001Mr
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833...000000100Mrs
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250...000000001Miss
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000...000000001Mrs
4503Allen, Mr. William Henrymale35.0003734508.0500...000000001Mr

5 rows × 1083 columns

#保存已经清理好的数据
data.to_csv('test_fin.csv')

本文主要学习内容来源:datawhale

  • 1
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值