python数据清洗算法_Python数据分析,清洗数据 7 步走!

数据清洗 (data cleaning) 是机器学习和深度学习进入算法步前的一项重要任务,我平时比较习惯使用的 7 个步骤,总结如下:Step1 : read csv

Step2 : preview data

Step3: check null value for every column

Step4: complete null value

Step5: feature engineering

Step 5.1: delete some features

Step 5.2: create new feature

Step6: encode for categories columns

Step 6.1: Sklearn LabelEncode

Step 6.2: Pandas get_dummies

Step 7: check for data cleaning

今天使用泰坦尼克数据集,完整介绍以上 7 步的具体操作过程。

1 读入数据

这不废话吗,第一步就是读入数据。

data_raw = pd.read_csv('../input/titanicdataset-traincsv/train.csv')

data_raw

结果:

PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked

0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS

1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C

2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS

3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S

4503Allen, Mr. William Henrymale35.0003734508.0500NaNS

.......................................

88688702Montvila, Rev. Juozasmale27.00021153613.0000NaNS

88788811Graham, Miss. Margaret Edithfemale19.00011205330.0000B42S

88888903Johnston, Miss. Catherine Helen "Carrie"femaleNaN12W./C. 660723.4500NaNS

88989011Behr, Mr. Karl Howellmale26.00011136930.0000C148C

89089103Dooley, Mr. Patrickmale32.0003703767.7500NaNQ

891 rows × 12 columns

2 数据预览

data_raw.info()

data_raw.describe(include='all')

结果:

RangeIndex: 891 entries, 0 to 890

Data columns (total 12 columns):

PassengerId 891 non-null int64

Survived 891 non-null int64

Pclass 891 non-null int64

Name 891 non-null object

Sex 891 non-null object

Age 714 non-null float64

SibSp 891 non-null int64

Parch 891 non-null int64

Ticket 891 non-null object

Fare 891 non-null float64

Cabin 204 non-null object

Embarked 889 non-null object

dtypes: float64(2), int64(5), object(5)

memory usage: 83.7+ KB

PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked

count891.000000891.000000891.000000891891714.000000891.000000891.000000891891.000000204889

uniqueNaNNaNNaN8912NaNNaNNaN681NaN1473

topNaNNaNNaNHakkarainen, Mr. Pekka PietarimaleNaNNaNNaN1601NaNG6S

freqNaNNaNNaN1577NaNNaNNaN7NaN4644

mean446.0000000.3838382.308642NaNNaN29.6991180.5230080.381594NaN32.204208NaNNaN

std257.3538420.4865920.836071NaNNaN14.5264971.1027430.806057NaN49.693429NaNNaN

min1.0000000.0000001.000000NaNNaN0.4200000.0000000.000000NaN0.000000NaNNaN

25%223.5000000.0000002.000000NaNNaN20.1250000.0000000.000000NaN7.910400NaNNaN

50%446.0000000.0000003.000000NaNNaN28.0000000.0000000.000000NaN14.454200NaNNaN

75%668.5000001.0000003.000000NaNNaN38.0000001.0000000.000000NaN31.000000NaNNaN

max891.0000001.0000003.000000NaNNaN80.0000008.0000006.000000NaN512.329200NaNN

3 检查null值

data1 = data_raw.copy(deep=True)

data1.isnull().sum()

结果:

PassengerId 0

Survived 0

Pclass 0

Name 0

Sex 0

Age 177

SibSp 0

Parch 0

Ticket 0

Fare 0

Cabin 687

Embarked 2

dtype: int64

Age 列 177 个空值,Cabin 687 个空值,一共才 891 行,估计没啥价值了!Embarked 2 个。

4 补全空值

data1['Age'].fillna(data1['Age'].median(), inplace = True)

data1['Embarked'].fillna(data1['Embarked'].mode()[0], inplace = True)

data1.isnull().sum()

补全操作check:

PassengerId 0

Survived 0

Pclass 0

Name 0

Sex 0

Age 0

SibSp 0

Parch 0

Ticket 0

Fare 0

Cabin 687

Embarked 0

dtype: int64

5 特征工程

5.1 干掉 3 列:

drop_column = ['PassengerId','Cabin', 'Ticket']

data1.drop(drop_column, axis=1, inplace = True)

5.2 增加 3 列

增加一列FamilySize

data1['FamilySize'] = data1 ['SibSp'] + data1['Parch'] + 1

data1

打印结果:

SurvivedPclassNameSexAgeSibSpParchFareEmbarkedFamilySize

003Braund, Mr. Owen Harrismale22.0107.2500S2

111Cumings, Mrs. John Bradley (Florence Briggs Th...female38.01071.2833C2

213Heikkinen, Miss. Lainafemale26.0007.9250S1

311Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01053.1000S2

403Allen, Mr. William Henrymale35.0008.0500S1

.................................

88602Montvila, Rev. Juozasmale27.00013.0000S1

88711Graham, Miss. Margaret Edithfemale19.00030.0000S1

88803Johnston, Miss. Catherine Helen "Carrie"female28.01223.4500S4

88911Behr, Mr. Karl Howellmale26.00030.0000C1

89003Dooley, Mr. Patrickmale32.0007.7500Q1

891 rows × 10 columns

再创建一列:

data1['IsAlone'] = np.where(data1['FamilySize'] > 1,0,1)

再创建一列:

data1['Title'] = data1['Name'].str.split(", ", expand=True)[1].str.split(".", expand=True)[0]

data1

结果:

SurvivedPclassNameSexAgeSibSpParchFareEmbarkedFamilySizeIsAloneTitle

003Braund, Mr. Owen Harrismale22.0107.2500S20Mr

111Cumings, Mrs. John Bradley (Florence Briggs Th...female38.01071.2833C20Mrs

213Heikkinen, Miss. Lainafemale26.0007.9250S11Miss

311Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01053.1000S20Mrs

403Allen, Mr. William Henrymale35.0008.0500S11Mr

.......................................

88602Montvila, Rev. Juozasmale27.00013.0000S11Rev

88711Graham, Miss. Margaret Edithfemale19.00030.0000S11Miss

88803Johnston, Miss. Catherine Helen "Carrie"female28.01223.4500S40Miss

88911Behr, Mr. Karl Howellmale26.00030.0000C11Mr

89003Dooley, Mr. Patrickmale32.0007.7500Q11Mr

891 rows × 12 columns

5.3 分箱走起

data1['FareCut'] = pd.qcut(data1['Fare'], 4)

data1['AgeCut'] = pd.cut(data1['Age'].astype(int), 6)

data1

结果:

SurvivedPclassNameSexAgeSibSpParchFareEmbarkedFamilySizeIsAloneTitleFareCutAgeCut

003Braund, Mr. Owen Harrismale22.0107.2500S20Mr(-0.001, 7.91](13.333, 26.667]

111Cumings, Mrs. John Bradley (Florence Briggs Th...female38.01071.2833C20Mrs(31.0, 512.329](26.667, 40.0]

213Heikkinen, Miss. Lainafemale26.0007.9250S11Miss(7.91, 14.454](13.333, 26.667]

311Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01053.1000S20Mrs(31.0, 512.329](26.667, 40.0]

403Allen, Mr. William Henrymale35.0008.0500S11Mr(7.91, 14.454](26.667, 40.0]

.............................................

88602Montvila, Rev. Juozasmale27.00013.0000S11Rev(7.91, 14.454](26.667, 40.0]

88711Graham, Miss. Margaret Edithfemale19.00030.0000S11Miss(14.454, 31.0](13.333, 26.667]

88803Johnston, Miss. Catherine Helen "Carrie"female28.01223.4500S40Miss(14.454, 31.0](26.667, 40.0]

88911Behr, Mr. Karl Howellmale26.00030.0000C11Mr(14.454, 31.0](13.333, 26.667]

89003Dooley, Mr. Patrickmale32.0007.7500Q11Mr(-0.001, 7.91](26.667, 40.0]

891 rows × 14 columns

6 编码

6.1 LabelEncoder 方法

使用 Sklearn 的 LabelEncoder

from sklearn.preprocessing import LabelEncoder

label = LabelEncoder()

data1['Sex_Code'] = label.fit_transform(data1['Sex'])

data1['Embarked_Code'] = label.fit_transform(data1['Embarked'])

data1['Title_Code'] = label.fit_transform(data1['Title'])

data1['AgeBin_Code'] = label.fit_transform(data1['AgeCut'])

data1['FareBin_Code'] = label.fit_transform(data1['FareCut'])

data1

结果 data1 选取某些列,算法模型终于能认出它们了,多不容易!

6.2 get_dummies 方法

get_dummies 将长 DataFrame 变为宽 DataFrame:

pd.get_dummies(data1['Sex'])

结果:

femalemale

001

110

210

310

401

.........

88601

88710

88810

88901

89001

891 rows × 2 columns

而 LabelEncoder 编码后,仅仅是把 Female 编码为 0, male 编码为 1.

label.fit_transform(data1['Sex'])

0 1

1 0

2 0

3 0

4 1

..

886 1

887 0

888 0

889 1

890 1

Name: Sex_Code, Length: 891, dtype: int64

7 再 check

# Step 7: data cleaning check

data1[data1_x_alg].info()

print('-'*50)

data1_dummy.info()

结果:

RangeIndex: 891 entries, 0 to 890

Data columns (total 8 columns):

Sex_Code 891 non-null int64

Pclass 891 non-null int64

Embarked_Code 891 non-null int64

Title_Code 891 non-null int64

SibSp 891 non-null int64

Parch 891 non-null int64

Age 891 non-null float64

Fare 891 non-null float64

dtypes: float64(2), int64(6)

memory usage: 55.8 KB

--------------------------------------------------

RangeIndex: 891 entries, 0 to 890

Data columns (total 29 columns):

Pclass 891 non-null int64

SibSp 891 non-null int64

Parch 891 non-null int64

Age 891 non-null float64

Fare 891 non-null float64

FamilySize 891 non-null int64

IsAlone 891 non-null int64

Sex_female 891 non-null uint8

Sex_male 891 non-null uint8

Embarked_C 891 non-null uint8

Embarked_Q 891 non-null uint8

Embarked_S 891 non-null uint8

Title_Capt 891 non-null uint8

Title_Col 891 non-null uint8

Title_Don 891 non-null uint8

Title_Dr 891 non-null uint8

Title_Jonkheer 891 non-null uint8

Title_Lady 891 non-null uint8

Title_Major 891 non-null uint8

Title_Master 891 non-null uint8

Title_Miss 891 non-null uint8

Title_Mlle 891 non-null uint8

Title_Mme 891 non-null uint8

Title_Mr 891 non-null uint8

Title_Mrs 891 non-null uint8

Title_Ms 891 non-null uint8

Title_Rev 891 non-null uint8

Title_Sir 891 non-null uint8

Title_the Countess 891 non-null uint8

dtypes: float64(2), int64(5), uint8(22)

memory usage: 68.0 KB

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值