datawhale课程[动手学数据分析]——Task02：数据清洗简述

本文链接：https://blog.csdn.net/sd3145265/article/details/108161520

【回顾&引言】前面一章的内容大家可以感觉到我们主要是对基础知识做一个梳理，让大家了解数据分析的一些操作，主要做了数据的各个角度的观察。那么在这里，我们主要是做数据分析的流程性学习，主要是包括了数据清洗以及数据的特征处理，数据重构以及数据可视化。这些内容是为数据分析最后的建模和模型评价做一个铺垫。

2 第二章：数据清洗及特征处理

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = 'all'

开始之前，导入numpy、pandas包和数据

#加载所需的库
import numpy as np
import pandas as pd

#加载数据train.csv
df=pd.read_csv('train.csv')
df.head()

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	NaN	S

数据清洗简述

我们拿到的数据通常是不干净的，所谓的不干净，就是数据中有缺失值，有一些异常点等，需要经过一定的处理才能继续做后面的分析或建模，所以拿到数据的第一步是进行数据清洗，本章我们将学习缺失值、重复值、字符串和数据转换等操作，将数据清洗成可以分析或建模的样子。

2.1 缺失值观察与处理

我们拿到的数据经常会有很多缺失值，比如我们可以看到Cabin列存在NaN，那其他列还有没有缺失值，这些缺失值要怎么处理呢

2.1.1 任务一：缺失值观察

(1) 请查看每个特征缺失值个数
(2) 请查看Age， Cabin， Embarked列的数据
以上方式都有多种方式，所以大家多多益善

(1) 请查看每个特征缺失值个数

#写入代码
df.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

#写入代码
df.isna().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

【小结】可以看出只有Age、Cabin和Embark三项有缺失。

df[df.Age.isnull()]

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
5	6	0	3	Moran, Mr. James	male	NaN	0	0	330877	8.4583	NaN	Q
17	18	1	2	Williams, Mr. Charles Eugene	male	NaN	0	0	244373	13.0000	NaN	S
19	20	1	3	Masselmani, Mrs. Fatima	female	NaN	0	0	2649	7.2250	NaN	C
26	27	0	3	Emir, Mr. Farred Chehab	male	NaN	0	0	2631	7.2250	NaN	C
28	29	1	3	O'Dwyer, Miss. Ellen "Nellie"	female	NaN	0	0	330959	7.8792	NaN	Q
...	...	...	...	...	...	...	...	...	...	...	...	...
859	860	0	3	Razi, Mr. Raihed	male	NaN	0	0	2629	7.2292	NaN	C
863	864	0	3	Sage, Miss. Dorothy Edith "Dolly"	female	NaN	8	2	CA. 2343	69.5500	NaN	S
868	869	0	3	van Melkebeke, Mr. Philemon	male	NaN	0	0	345777	9.5000	NaN	S
878	879	0	3	Laleff, Mr. Kristo	male	NaN	0	0	349217	7.8958	NaN	S
888	889	0	3	Johnston, Miss. Catherine Helen "Carrie"	female	NaN	1	2	W./C. 6607	23.4500	NaN	S

177 rows × 12 columns

可以看到Age缺失旅客的具体相关信息，下面同理。

df[df.Cabin.isnull()]

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	0	A/5 21171	7.2500	NaN	S
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	0	STON/O2. 3101282	7.9250	NaN	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	0	373450	8.0500	NaN	S
5	6	0	3	Moran, Mr. James	male	NaN	0	0	330877	8.4583	NaN	Q
7	8	0	3	Palsson, Master. Gosta Leonard	male	2.0	3	1	349909	21.0750	NaN	S
...	...	...	...	...	...	...	...	...	...	...	...	...
884	885	0	3	Sutehall, Mr. Henry Jr	male	25.0	0	0	SOTON/OQ 392076	7.0500	NaN	S
885	886	0	3	Rice, Mrs. William (Margaret Norton)	female	39.0	0	5	382652	29.1250	NaN	Q
886	887	0	2	Montvila, Rev. Juozas	male	27.0	0	0	211536	13.0000	NaN	S
888	889	0	3	Johnston, Miss. Catherine Helen "Carrie"	female	NaN	1	2	W./C. 6607	23.4500	NaN	S
890	891	0	3	Dooley, Mr. Patrick	male	32.0	0	0	370376	7.7500	NaN	Q

687 rows × 12 columns

df[df.Embarked.isnull()]

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
61	62	1	1	Icard, Miss. Amelie	female	38.0	0	0	113572	80.0	B28	NaN
829	830	1	1	Stone, Mrs. George Nelson (Martha Evelyn)	female	62.0	0	0	113572	80.0	B28	NaN

#写入代码
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB

(2) 请查看Age， Cabin， Embarked列的数据

#写入代码
df[['Age','Cabin','Embarked']].head()

	Age	Cabin	Embarked
0	22.0	NaN	S
1	38.0	C85	C
2	26.0	NaN	S
3	35.0	C123	S
4	35.0	NaN	S

df.loc[:,['Age','Cabin','Embarked']].head()

	Age	Cabin	Embarked
0	22.0	NaN	S
1	38.0	C85	C
2	26.0	NaN	S
3	35.0	C123	S
4	35.0	NaN	S

df.iloc[:,[5,10,11]].head()

	Age	Cabin	Embarked
0	22.0	NaN	S
1	38.0	C85	C
2	26.0	NaN	S
3	35.0	C123	S
4	35.0	NaN	S

2.1.2 任务二：对缺失值进行处理

(1) 处理缺失值一般有几种思路

(2) 请尝试对Age列的数据的缺失值进行处理

(3) 请尝试使用不同的方法直接对整张表的缺失值进行处理

#(1) 处理缺失值的一般思路：
#提醒：可使用的函数有--->dropna函数与fillna函数

#回答：主要有两种思路：
#        1.删除存在缺失值的个例
#        2.缺失值插补

#(2) 请尝试对Age列的数据的缺失值进行处理
#用0填充
df[df['Age']==None]=0
df.head()
df['Age'].count()

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	NaN	S

NaN没处理成功

#写入代码
df[df['Age']==np.nan]=0
df.head()
df['Age'].count()

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	NaN	S

同样NaN也没处理成功

#写入代码
df[df['Age'].isnull()]=0
df.head()
df['Age'].count()

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	NaN	S

NaN处理成功

#写入代码
df[df['Age'].isna()]=0
df.head()
df['Age'].count()

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	NaN	S

NaN处理成功

#写入代码
df['Age']=df['Age'].fillna(0)
df.head()
df['Age'].count()

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	NaN	S

df.dropna().head()

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	0	PC 17599	71.2833	C85	C
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	0	113803	53.1000	C123	S
6	7	0	1	McCarthy, Mr. Timothy J	male	54.0	0	0	17463	51.8625	E46	S
10	11	1	3	Sandstrom, Miss. Marguerite Rut	female	4.0	1	1	PP 9549	16.7000	G6	S
11	12	1	1	Bonnell, Miss. Elizabeth	female	58.0	0	0	113783	26.5500	C103	S

df.fillna(0).head()

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	0	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	0	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	0	S

【思考1】dropna和fillna有哪些参数，分别如何使用呢?

【思考1回答】详情见下面参考资料

【参考】https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html

【参考】https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.fillna.html

【思考2】检索空缺值用np.nan要比用None好，这是为什么？

【思考2回答】数值列读取数据后，空缺值的数据类型为float64，np.nan的数据类型为float64，而None的类型为特殊的NoneType，所以用None一般索引不到，比较的时候最好用np.nan。

type(None)

NoneType

type(np.nan)

float

上面可以看出None是一种特殊的NoneType类型，而np.nan是比较常见的float类型，这是上述问题的根源。这会导致：

对函数支持。numpy有不少函数可以自动处理np.nan，但是却不支持处理None，pandas库也是如此。
对容器数据类型的影响。主要是对数组的影响，None会导致整个数组转化为object类型，无法数值计算。

【参考】https://www.cnblogs.com/onemorepoint/p/8966791.html

2.2 重复值观察与处理

由于这样那样的原因，数据中会不会存在重复值呢，如果存在要怎样处理呢

2.2.1 任务一：请查看数据中的重复值

#写入代码，方法一
df[df.duplicated()]
any(df.duplicated())

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked

False

结果可以看出没有重复值

# 方法二
df.index.is_unique

True

2.2.2 任务二：对重复值进行处理

(1)重复值有哪些处理方式呢？

(2)处理我们数据的重复值

方法多多益善

重复值有哪些处理方式：

用df.drop_duplicates(subset,keep,inplace)函数。

subset 接收string和sequence。表示进行去重的列
keep 接收特定的string。表示去重是保留第几个数据，或者不保留。“first”：保留第一个，“last”：保留最后一个，False：只要有重复都不保留。默认为"first"
inplace 接收boolean。表示是否在原表上进行操作。默认为False

【参考】 https://www.cda.cn/bigdata/27829.html

#写入代码
df.drop_duplicates().head()

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	NaN	S

2.2.3 任务三：将前面清洗的数据保存为csv格式

#写入代码
df.to_csv('test_clear.csv')

2.3 特征观察与处理

我们对特征进行一下观察，可以把特征大概分为两大类：
数值型特征：Survived ，Pclass， Age ，SibSp， Parch， Fare，其中Survived， Pclass为离散型数值特征，Age，SibSp， Parch， Fare为连续型数值特征
文本型特征：Name， Sex， Cabin，Embarked， Ticket，其中Sex， Cabin， Embarked， Ticket为类别型文本特征，数值型特征一般可以直接用于模型的训练，但有时候为了模型的稳定性及鲁棒性会对连续变量进行离散化。文本型特征往往需要转换成数值型特征才能用于建模分析。

【感想】在做进一步的数据分析过程前，我们要把数据转化成计算机能处理的数据类型，所以上述特征类型都进行了数字化，将其转化成计算机最好处理的离散数值型。

2.3.1 任务一：对年龄进行分箱（离散化）处理

(1) 分箱操作是什么？

(2) 将连续变量Age平均分箱成5个年龄段，并分别用类别变量12345表示

(3) 将连续变量Age划分为[0,5) [5,15) [15,30) [30,50) [50,80)五个年龄段，并分别用类别变量12345表示

(4) 将连续变量Age按10% 30% 50% 70% 90%五个年龄段，并用分类变量12345表示

(5) 将上面的获得的数据分别进行保存，保存为csv格式

【回答】

(1)分箱操作是什么：

即连续数值离散化。数据分箱操作，即把一段连续的值切分成若干段，每一段的值看成一个分类。

#(2) 将连续变量Age平均分箱成5个年龄段，并分别用类别变量12345表示
df['AgeBand'] = pd.cut(df['Age'], 5,labels = ['1','2','3','4','5'])
df.head()

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked	AgeBand
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S	2
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C	3
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S	2
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S	3
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	NaN	S	3

df.to_csv('test_ave.csv')

观察发现最后面多了一列AgeBand，按Age平均分5组，以12345进行分箱

#（3）将连续变量Age划分为[0,5) [5,15) [15,30) [30,50) [50,80)五个年龄段，并分别用类别变量12345表示
df['AgeBand'] = pd.cut(df['Age'],[0,5,15,30,50,80],labels = ['1','2','3','4','5'])
df.head()

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked	AgeBand
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S	3
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C	4
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S	3
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S	4
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	NaN	S	4

df.to_csv('test_cut.csv')

#(4) 将连续变量Age按10% 30% 50% 70% 90%五个年龄段，并用分类变量12345表示
df['AgeBand'] = pd.qcut(df['Age'],[0,0.1,0.3,0.5,0.7,0.9],labels = ['1','2','3','4','5'])
df.head()

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked	AgeBand
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S	2
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C	5
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S	3
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S	4
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	NaN	S	4

df.to_csv('test_pr.csv')

【总结】pd.cut与pd.qcut

pd.cut(x,bins,right=True,labels=None,retbins=False,precision=3,include_lowest=False)

需要将数据值分段并排序到bins中时使用cut。 此函数对于从连续变量转换为离散变量也很有用。 例如，cut可以将年龄转换为年龄范围组。 支持bins到相同数量的箱柜或预先指定的bins阵列。
x: 进行划分的一维数组
bins : 1,整数---将x划分为多少个等间距的区间；2,序列—将x划分在指定的序列中，若不在该序列中，则是NaN
right : 是否包含右端点
labels : 是否用标记来代替返回的bins
retbins: 是否返回间距bins
precision: 精度
include_lowest:是否包含左端点
返回值:如果retbins = False 则返回x中每个值对应的bin的列表，否者则返回x中每个值对应的bin的列表和对应的bins

pd.qcut(x, q, labels=None, retbins=False, precision=3, duplicates=‘raise’)

基于分位数的离散化功能。 根据等级或基于样本分位数将变量分离为相等大小的桶。 例如，10个分位数的1000个值将产生一个分类对象，表示每个数据点的分位数成员资格。
x： 要进行分组的数据，数据类型为一维数组，或Series对象
q： 组数，即要将数据分成几组，后边举例说明
labels： 可以理解为组标签，这里注意标签个数要和组数相等
retbins： 默认为False,当为False时，返回值是Categorical类型（具有value_counts()方法），为True是返回值是元组

【参考】https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.cut.html

【参考】https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.qcut.html

2.3.2 任务二：对文本变量进行转换

(1) 查看文本变量名及种类
(2) 将文本变量Sex， Cabin ，Embarked用数值变量12345表示
(3) 将文本变量Sex， Cabin， Embarked用one-hot编码表示

【分析】通过刚才的观察，我们知道文本型特征：Name， Sex， Cabin，Embarked， Ticket，其中Sex， Cabin， Embarked， Ticket为类别型文本特征。

#(1）查看类别文本变量名及种类

#方法一: value_counts
df['Sex'].value_counts()
df.Sex.value_counts()

male      577
female    314
Name: Sex, dtype: int64






male      577
female    314
Name: Sex, dtype: int64

df['Cabin'].value_counts()
df.Cabin.value_counts()

B96 B98        4
G6             4
C23 C25 C27    4
C22 C26        3
E101           3
              ..
B41            1
B38            1
C90            1
E36            1
C118           1
Name: Cabin, Length: 147, dtype: int64






B96 B98        4
G6             4
C23 C25 C27    4
C22 C26        3
E101           3
              ..
B41            1
B38            1
C90            1
E36            1
C118           1
Name: Cabin, Length: 147, dtype: int64

df['Embarked'].value_counts()
df.Embarked.value_counts()

S    644
C    168
Q     77
Name: Embarked, dtype: int64






S    644
C    168
Q     77
Name: Embarked, dtype: int64

df['Ticket'].value_counts()
df.Ticket.value_counts()

CA. 2343    7
1601        7
347082      7
3101295     6
347088      6
           ..
349912      1
349251      1
PC 17475    1
PP 4348     1
2649        1
Name: Ticket, Length: 681, dtype: int64






CA. 2343    7
1601        7
347082      7
3101295     6
347088      6
           ..
349912      1
349251      1
PC 17475    1
PP 4348     1
2649        1
Name: Ticket, Length: 681, dtype: int64

#方法二: unique
df['Sex'].unique()
df.Sex.unique()

array(['male', 'female'], dtype=object)






array(['male', 'female'], dtype=object)

df['Sex'].nunique()
df.Sex.nunique()

#(2)将类别文本转换为12345

#方法一: replace
df['Sex_num'] = df['Sex'].replace(['male','female'],[1,2])
df.head()

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked	AgeBand	Sex_num
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S	2	1
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C	5	2
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S	3	2
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S	4	2
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	NaN	S	4	1

#方法二: map
df['Sex_num'] = df['Sex'].map({'male': 1, 'female': 2})
df.head()

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked	AgeBand	Sex_num
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S	2	1
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C	5	2
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S	3	2
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S	4	2
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	NaN	S	4	1

#方法三: 使用sklearn.preprocessing的LabelEncoder
from sklearn.preprocessing import LabelEncoder
for feat in ['Cabin', 'Ticket']:
    lbl = LabelEncoder()  
    label_dict = dict(zip(df[feat].unique(), range(df[feat].nunique())))
    df[feat + "_labelEncode"] = df[feat].map(label_dict)
    df[feat + "_labelEncode"] = lbl.fit_transform(df[feat].astype(str))

df.head()

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked	AgeBand	Sex_num	Cabin_labelEncode	Ticket_labelEncode
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S	2	1	147	523
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C	5	2	81	596
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S	3	2	147	669
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S	4	2	55	49
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	NaN	S	4	1	147	472

其中，zip([iterable, …])，返回元组列表。

【参考】https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html

#(3)将类别文本转换为one-hot编码

#方法一: OneHotEncoder
for feat in ["Age", "Embarked"]:
#     x = pd.get_dummies(df["Age"] // 6)
#     x = pd.get_dummies(pd.cut(df['Age'],5))
    x = pd.get_dummies(df[feat], prefix=feat)
    df = pd.concat([df, x], axis=1)
    #df[feat] = pd.get_dummies(df[feat], prefix=feat)
    
df.head()

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	...	Embarked_C	Embarked_S
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	...	0	1
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	...	1	0
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	...	0	1
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	...	0	1
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	...	0	1

5 rows × 107 columns

# 方法二：使用sklearn.preprocessing的OneHotEncoder

'''OneHotEncoder输入必须是int数组，所以直接传入字符串特征值是不行的，需要先转化成整型特征，
再传入OneHotEncoder。'''

from sklearn.preprocessing  import OneHotEncoder
for feat in ["Age", "Embarked"]:    
# 必须转换为2D矩阵
    enc = OneHotEncoder()
    enc_array = enc.fit_transform(df[feat].astype(str).values.reshape(-1,1)).toarray() #fit_transform()输入要为ndarray
    enc_array.shape
    x = pd.DataFrame(enc_array)
    df = pd.concat([df, x], axis=1, sort= False)
df.head()

'OneHotEncoder输入必须是int数组，所以直接传入字符串特征值是不行的，需要先转化成整型特征，\n再传入OneHotEncoder。'






(891, 89)






(891, 4)

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	...	0	2
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	...	0.0	1.0
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	...	1.0	0.0
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	...	0.0	1.0
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	...	0.0	1.0
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	...	0.0	1.0

5 rows × 200 columns

【参考】https://www.jianshu.com/p/421f040dfe2f
在sklearn 包中，OneHotEncoder 函数非常实用，它可以实现将分类特征的每个元素转化为一个可以用来计算的值。

【参考】https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html

2.3.3 任务三：从纯文本Name特征里提取出Titles的特征(所谓的Titles就是Mr,Miss,Mrs等)

#写入代码
df['Title'] = df.Name.str.extract('([A-Za-z]+)\.', expand=False)
df.head()

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked	AgeBand	Sex_num	Cabin_labelEncode	Ticket_labelEncode	Title
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S	2	1	147	523	Mr
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C	5	2	81	596	Mrs
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S	3	2	147	669	Miss
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S	4	2	55	49	Mrs
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	NaN	S	4	1	147	472	Mr

【分析】上面用到了str.extract()函数和正则表达式，可以处理数字、符号和字母混合的字符串。

【参考】https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.extract.html

#保存最终你完成的已经清理好的数据
df.to_csv('test_fin.csv')

附加题

【问题】

观察DataFrame数据类型，可以分为哪些类型？
数据文件的NAN/NULL数据读取为DataFrame后最终会变为哪种数据？
如何筛选姓名中含有"Mr."的数据出来？
如何查看数据中的95%分位数？
关于缺失值部分：
df[df[‘Age’]==None]=0 # 不推荐
df[df[‘Age’] == np.nan] = 0 # 不推荐
df[df[‘Age’].isnull()] = 0 # 还好
df[‘Age’] = df[‘Age’].fillna(0) # 推荐
你能说说原因吗？

【附加题回答】

DataFrame数据类型：
(1). float
(2). int
(3). bool
(4). datetime64[ns]
(5). datetime64[ns, tz]
(6). timedelta[ns]
(7). category
(8). object

用df.dtypes可以查看数据类型

df.dtypes

PassengerId              int64
Survived                 int64
Pclass                   int64
Name                    object
Sex                     object
Age                    float64
SibSp                    int64
Parch                    int64
Ticket                  object
Fare                   float64
Cabin                   object
Embarked                object
AgeBand               category
Sex_num                  int64
Cabin_labelEncode        int32
Ticket_labelEncode       int32
Title                   object
dtype: object

最终会变为float
步骤如下：

#(1) 首先将他们进行字符串化，并得到其对应的布尔值：
bool = df.Name.str.contains('Mr\.')  #不要忘记正则表达式的写法，'.'在里面要用'\.'表示
print('bool : \n', bool)

bool : 
 0       True
1      False
2      False
3      False
4       True
       ...  
886    False
887    False
888    False
889     True
890     True
Name: Name, Length: 891, dtype: bool

#(2) 通过dataframe的基本操作将其选取出来：
filter_data = df[bool]
filter_data

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked	AgeBand	Sex_num	Cabin_labelEncode	Ticket_labelEncode	Title
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	0	A/5 21171	7.2500	NaN	S	2	1	147	523	Mr
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	0	373450	8.0500	NaN	S	4	1	147	472	Mr
5	6	0	3	Moran, Mr. James	male	NaN	0	0	330877	8.4583	NaN	Q	NaN	1	147	275	Mr
6	7	0	1	McCarthy, Mr. Timothy J	male	54.0	0	0	17463	51.8625	E46	S	NaN	1	129	85	Mr
12	13	0	3	Saundercock, Mr. William Henry	male	20.0	0	0	A/5. 2151	8.0500	NaN	S	2	1	147	535	Mr
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
881	882	0	3	Markun, Mr. Johann	male	33.0	0	0	349257	7.8958	NaN	S	4	1	147	394	Mr
883	884	0	2	Banfield, Mr. Frederick James	male	28.0	0	0	C.A./SOTON 34068	10.5000	NaN	S	3	1	147	565	Mr
884	885	0	3	Sutehall, Mr. Henry Jr	male	25.0	0	0	SOTON/OQ 392076	7.0500	NaN	S	3	1	147	650	Mr
889	890	1	1	Behr, Mr. Karl Howell	male	26.0	0	0	111369	30.0000	C148	C	3	1	60	8	Mr
890	891	0	3	Dooley, Mr. Patrick	male	32.0	0	0	370376	7.7500	NaN	Q	4	1	147	466	Mr

517 rows × 17 columns

【结论】可以看出有517个人有Mr.的Title。

可以用pandas中的quantile(0.95)和df.describe(percentiles=[.95])，对象可以是dataframe和array，里面可以有缺失值。或者numpy中的np.percentile(a,0.95)和np.quantile(a,0.95)，a是array或可以转换成array的对象。前提是a里面不能有缺失值，否则返回nan。

df.quantile(0.95)
df['Age'].quantile(0.95) #pandas的方法，数组可以有缺失值，这里Age有缺失值
df.describe(percentiles=[.95])
np.percentile(df['Age'],0.95) #numpy的方法，数组不能有缺失值，这里Age有缺失值
np.quantile(df['Fare'],0.95)

PassengerId    846.50000
Survived         1.00000
Pclass           3.00000
Age             56.00000
SibSp            3.00000
Parch            2.00000
Fare           112.07915
Name: 0.95, dtype: float64






56.0

	PassengerId	Survived	Pclass	Age	SibSp	Parch	Fare
count	891.000000	891.000000	891.000000	714.000000	891.000000	891.000000	891.000000
mean	446.000000	0.383838	2.308642	29.699118	0.523008	0.381594	32.204208
std	257.353842	0.486592	0.836071	14.526497	1.102743	0.806057	49.693429
min	1.000000	0.000000	1.000000	0.420000	0.000000	0.000000	0.000000
50%	446.000000	0.000000	3.000000	28.000000	0.000000	0.000000	14.454200
95%	846.500000	1.000000	3.000000	56.000000	3.000000	2.000000	112.079150
max	891.000000	1.000000	3.000000	80.000000	8.000000	6.000000	512.329200

nan






112.07915

关于缺失值部分：输出详情见 2.1.2 任务二：对缺失值进行处理

type(np.nan)

float

type(None)

NoneType

np.nan == None

False

np.nan == np.nan

False

df['Age'].isnull()

0      False
1      False
2      False
3      False
4      False
       ...  
886    False
887    False
888     True
889    False
890    False
Name: Age, Length: 891, dtype: bool

df['Age'].fillna(True)

0        22
1        38
2        26
3        35
4        35
       ... 
886      27
887      19
888    True
889      26
890      32
Name: Age, Length: 891, dtype: object

【小结】由上可以看出，有慎用==来判断缺失值，因为返回值是False，判断不出来，从而无法替换。而后面两种方法可以判断出来。

【参考】https://blog.csdn.net/weixin_41712499/article/details/82719987