Hands-on data analysis Task02

import numpy as np
import pandas as pd
train_data = pd.read_csv(r"C:\Users\sheng\1.JupyterNotes\titanic\train.csv")
train_data.columns
Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')

Task01:查看缺失值。

train_data.isna().shape
(891, 12)
train_data.notna().shape
(891, 12)
train_data.isnull().head() #查看所有缺失值. 返回的是由布尔值构成的同样size的DataFrame. 数据显示较大,这里选择前5行
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
0FalseFalseFalseFalseFalseFalseFalseFalseFalseFalseTrueFalse
1FalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
2FalseFalseFalseFalseFalseFalseFalseFalseFalseFalseTrueFalse
3FalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
4FalseFalseFalseFalseFalseFalseFalseFalseFalseFalseTrueFalse
train_data.notnull().head() #和上面效果一样,只是相反的判断。
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
0TrueTrueTrueTrueTrueTrueTrueTrueTrueTrueFalseTrue
1TrueTrueTrueTrueTrueTrueTrueTrueTrueTrueTrueTrue
2TrueTrueTrueTrueTrueTrueTrueTrueTrueTrueFalseTrue
3TrueTrueTrueTrueTrueTrueTrueTrueTrueTrueTrueTrue
4TrueTrueTrueTrueTrueTrueTrueTrueTrueTrueFalseTrue
train_data.isnull().any() #查看某列是否有缺失值,有返回True,没有False。所以,Age, Cabin, Embarked三列有缺失值。
PassengerId    False
Survived       False
Pclass         False
Name           False
Sex            False
Age             True
SibSp          False
Parch          False
Ticket         False
Fare           False
Cabin           True
Embarked        True
dtype: bool
train_data.isnull().all() #查看某列是否全部缺失,train_data中没有数据全部缺失的列。
PassengerId    False
Survived       False
Pclass         False
Name           False
Sex            False
Age            False
SibSp          False
Parch          False
Ticket         False
Fare           False
Cabin          False
Embarked       False
dtype: bool
train_data.isnull().sum()/len(train_data) #简单运算返回缺失值的比例
PassengerId    0.000000
Survived       0.000000
Pclass         0.000000
Name           0.000000
Sex            0.000000
Age            0.198653
SibSp          0.000000
Parch          0.000000
Ticket         0.000000
Fare           0.000000
Cabin          0.771044
Embarked       0.002245
dtype: float64

Task02 对缺失值进行处理

一般有两个函数,dropna()和fillna(),先研究下参数
.dropna:https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html?highlight=dropna#pandas.DataFrame.dropna

axis:可选0和1两个值。0表示删除含有缺失的行;1表示删除含有缺失值的列。

how:可选any和all,默认为any。any表示如果某行或某列有缺失值,则删除该行或该列。all表示如果某行或某列缺失值比例为100%,则删除该行或该列。

thresh:必须为整数,可选参数。表示剩余非na值的数量大于等于thresh保留,其它删除。

subset:数组形式参数,可选。用于传入列名称的数组/列表,删除特定列中的空值行。

inplace:是否原地操作,默认False.

.fillna:https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.fillna.html?highlight=fillna#pandas.DataFrame.fillna

value:变量、字典、Series、DataFrame;用于填充缺失值,或为指定的DataFrame列的缺失值使用字典/Series/DataFrame的值填充。

method:有backfill, bfill, pad, ffill, None可选,默认为None。backfull/bfill表示用后一个非缺失值填充;pad/ffill表示用前一个非缺失值填充;None是用指定值填充。

axis:沿着哪个方向填充。0表示行,1表示列。

inplace:是否原地操作,默认False.

limit:限制填充数量。

train_data.shape
(891, 12)
train_data.dropna(subset=["Age"]).shape #删除年龄缺失的行
(714, 12)
train_data.shape #原数据未变化
(891, 12)
train_data.dropna(subset=["Cabin"]).shape #删除Cabin列缺失的行
(204, 12)
train_data.drop("Cabin",axis=1).shape #删除Cabin列
(891, 11)
train_data.dropna(thresh=11).shape #每行至多一个缺失值。
(733, 12)
train_data.dropna(thresh=12).shape #保留无缺失值的行。
(183, 12)
train_data["Age"].fillna(train_data["Age"].mean()).isnull().sum() #均值填充Age列缺失值
0
train_data.fillna(method="bfill").isnull().sum() #后值填充,Cabin有一个空值。
PassengerId    0
Survived       0
Pclass         0
Name           0
Sex            0
Age            0
SibSp          0
Parch          0
Ticket         0
Fare           0
Cabin          1
Embarked       0
dtype: int64
train_data.fillna(method="bfill").tail() #最后一个是空值,没有后值,无法填充。
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
88688702Montvila, Rev. Juozasmale27.00021153613.00B42S
88788811Graham, Miss. Margaret Edithfemale19.00011205330.00B42S
88888903Johnston, Miss. Catherine Helen "Carrie"female26.012W./C. 660723.45C148S
88989011Behr, Mr. Karl Howellmale26.00011136930.00C148C
89089103Dooley, Mr. Patrickmale32.0003703767.75NaNQ
train_data.fillna(method="ffill").isnull().sum() #前值填充,有一个空值
PassengerId    0
Survived       0
Pclass         0
Name           0
Sex            0
Age            0
SibSp          0
Parch          0
Ticket         0
Fare           0
Cabin          1
Embarked       0
dtype: int64
train_data.fillna(method="ffill").head() #第一个是空值,没有前值,无法填充
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250C85S
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
4503Allen, Mr. William Henrymale35.0003734508.0500C123S

但简单的前值和后值填充都不太合适,希望Age按照均值填充,Embarked按照众数填充,Cabin直接删除。

dic = {"Age":train_data["Age"].mean(),"Embarked":np.array(train_data["Embarked"].mode())[0]}
train_data.drop("Cabin",axis=1).fillna(dic).isnull().sum()  
#第一次做,Embarked两个空值没填上,train_data["Embarked"].mode()返回是个对象,不是字符串
PassengerId    0
Survived       0
Pclass         0
Name           0
Sex            0
Age            0
SibSp          0
Parch          0
Ticket         0
Fare           0
Embarked       0
dtype: int64
np.array([train_data["Embarked"]==np.nan]).sum()
0
np.array([train_data["Embarked"]==None]).sum()
0
train_data["Embarked"].isnull().sum()
2
type(np.nan),type(None),type(pd.NaT),type(""),type(" ")
(float, NoneType, pandas._libs.tslibs.nattype.NaTType, str, str)

【思考】检索空缺值用np.nan,None以及.isnull()哪个更好,这是为什么?如果其中某个方式无法找到缺失值,原因又是为什么?

数据缺失有两个原因:(1)真的没有数据;(2)在数据搜集时候产生了错误。None是python关键字,属于单独的NoneType,Nonetype只有一个值,就是None,在python中用于定义null值的,如果用.isnull是可以检测到的,如果进行运算会报错。可以理解为没有结果,即真的没有数据,而不是搜集时候产生的错误;np.nan只能通过np.nan()创建一种方法,与自身比较结果也是False,即这个是有值的,但是值是多少不清楚,两个比较大概率不一样,可以运算,运算返回还是NaN。isnull可以检测出上述两类,其它如"?",""," "(即问号、无内容、空格)均会被判断为有值。

np.nan == np.nan #与自己不同
False
df = pd.DataFrame([[None,1,2],["?",""," "]]) #直接用np.nan会报错~
df[1][0]=np.nan
df[2][0]=np.NaN
df
012
0NoneNaNNaN
1?
df.isnull() #"?",""," "均被判断为有值
012
0TrueTrueTrue
1FalseFalseFalse
df[0][0]/2 #None计算,报错
---------------------------------------------------------------------------

TypeError                                 Traceback (most recent call last)

<ipython-input-37-5ec1b6f840b7> in <module>()
----> 1 df[0][0]/2 #None计算,报错


TypeError: unsupported operand type(s) for /: 'NoneType' and 'int'
df[1][0]/2 #NaN计算,返回NaN
nan

Task03 重复值处理

train_data.duplicated().sum() #查看train_data重复值
0
train_data.nunique() #查看train_data的取值类别数量
PassengerId    891
Survived         2
Pclass           3
Name           891
Sex              2
Age             88
SibSp            7
Parch            7
Ticket         681
Fare           248
Cabin          147
Embarked         3
dtype: int64
.duplicated() 参考链接:https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.duplicated.html?highlight=duplicate
.drop_duplicates() 参考链接:https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop_duplicates.html#pandas.DataFrame.drop_duplicates
train_data.drop_duplicates(subset=["Ticket"],keep="first").shape #以Ticket为例,保留第一个重复数据,删除其它的。
(681, 12)
dic = {"Age":train_data["Age"].mean(),"Embarked":np.array(train_data["Embarked"].mode())[0]}
data = train_data.drop("Cabin",axis=1).fillna(dic)
data.to_csv("data.csv") #存下清洗的数据

Task 04 分箱操作

分箱操作即将连续性变量转化为离散型变量

d_cut_1 = data.copy()
d_cut_1["Age_Group"] = pd.cut(d_cut_1["Age"],bins=5,labels=[1,2,3,4,5]) #将年龄平均分为5组
d_cut_1.to_csv("d_cut_1.csv") #保存数据
d_cut_1.isnull().sum()
PassengerId    0
Survived       0
Pclass         0
Name           0
Sex            0
Age            0
SibSp          0
Parch          0
Ticket         0
Fare           0
Embarked       0
Age_Group      0
dtype: int64
d_cut_1.head(10)
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareEmbarkedAge_Group
0103Braund, Mr. Owen Harrismale22.00000010A/5 211717.2500S2
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.00000010PC 1759971.2833C3
2313Heikkinen, Miss. Lainafemale26.00000000STON/O2. 31012827.9250S2
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.0000001011380353.1000S3
4503Allen, Mr. William Henrymale35.000000003734508.0500S3
5603Moran, Mr. Jamesmale29.699118003308778.4583Q2
6701McCarthy, Mr. Timothy Jmale54.000000001746351.8625S4
7803Palsson, Master. Gosta Leonardmale2.0000003134990921.0750S1
8913Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)female27.0000000234774211.1333S2
91012Nasser, Mrs. Nicholas (Adele Achem)female14.0000001023773630.0708C1
d_cut_2 = data.copy()
#将年龄按照指定间隔分为5组
d_cut_2["Age_Group"] = pd.cut(d_cut_2["Age"],bins=[0,5,15,30,50,90],right=False,labels=[1,2,3,4,5]) 
d_cut_2.to_csv("d_cut_2.csv")
d_cut_2.isnull().sum()
PassengerId    0
Survived       0
Pclass         0
Name           0
Sex            0
Age            0
SibSp          0
Parch          0
Ticket         0
Fare           0
Embarked       0
Age_Group      0
dtype: int64
d_cut_2.head(10)
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareEmbarkedAge_Group
0103Braund, Mr. Owen Harrismale22.00000010A/5 211717.2500S3
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.00000010PC 1759971.2833C4
2313Heikkinen, Miss. Lainafemale26.00000000STON/O2. 31012827.9250S3
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.0000001011380353.1000S4
4503Allen, Mr. William Henrymale35.000000003734508.0500S4
5603Moran, Mr. Jamesmale29.699118003308778.4583Q3
6701McCarthy, Mr. Timothy Jmale54.000000001746351.8625S5
7803Palsson, Master. Gosta Leonardmale2.0000003134990921.0750S1
8913Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)female27.0000000234774211.1333S3
91012Nasser, Mrs. Nicholas (Adele Achem)female14.0000001023773630.0708C2
d_cut_3 = data.copy()
#按照年龄将数据量,用qcut函数
d_cut_3["Age_Group"] = pd.qcut(d_cut_3["Age"],q=[0,.1,.3,.5,.7,.9],labels=[1,2,3,4,5]) 
d_cut_3.to_csv("d_cut_3.csv")
d_cut_3.isnull().sum() #89个样本未分组,即10%样本未分组。
PassengerId     0
Survived        0
Pclass          0
Name            0
Sex             0
Age             0
SibSp           0
Parch           0
Ticket          0
Fare            0
Embarked        0
Age_Group      89
dtype: int64
d_cut_3.head(10)
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareEmbarkedAge_Group
0103Braund, Mr. Owen Harrismale22.00000010A/5 211717.2500S2.0
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.00000010PC 1759971.2833C5.0
2313Heikkinen, Miss. Lainafemale26.00000000STON/O2. 31012827.9250S3.0
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.0000001011380353.1000S5.0
4503Allen, Mr. William Henrymale35.000000003734508.0500S5.0
5603Moran, Mr. Jamesmale29.699118003308778.4583Q3.0
6701McCarthy, Mr. Timothy Jmale54.000000001746351.8625SNaN
7803Palsson, Master. Gosta Leonardmale2.0000003134990921.0750S1.0
8913Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)female27.0000000234774211.1333S3.0
91012Nasser, Mrs. Nicholas (Adele Achem)female14.0000001023773630.0708C1.0

Task05 对文本变量进行转换

train_data.dtypes
PassengerId      int64
Survived         int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
dtype: object
train_data.isnull().sum()
PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64
train_data.nunique() #Cabin有147个不同值,且缺失值较多,不能直接归到5类
PassengerId    891
Survived         2
Pclass           3
Name           891
Sex              2
Age             88
SibSp            7
Parch            7
Ticket         681
Fare           248
Cabin          147
Embarked         3
dtype: int64
train_data.head(10)
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
4503Allen, Mr. William Henrymale35.0003734508.0500NaNS
5603Moran, Mr. JamesmaleNaN003308778.4583NaNQ
6701McCarthy, Mr. Timothy Jmale54.0001746351.8625E46S
7803Palsson, Master. Gosta Leonardmale2.03134990921.0750NaNS
8913Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)female27.00234774211.1333NaNS
91012Nasser, Mrs. Nicholas (Adele Achem)female14.01023773630.0708NaNC
train_data.sort_values(by="Cabin",ascending=True)[0:220:6] 

#排序切片观察Cabin值的特点,可看出Cabin分ABCDEF和缺失值7类
#想法:将Cabin左边第一个字符提取做出来左边编号,再转化为数字
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
58358401Ross, Mr. John Hugomale36.0001304940.1250A10C
86786801Roebling, Mr. Washington Augustus IImale31.000PC 1759050.4958A24S
969701Goldschmidt, Mr. George Bmale71.000PC 1775434.6542A5C
32933011Hippach, Miss. Jean Gertrudefemale16.00111136157.9792B18C
616211Icard, Miss. Ameliefemale38.00011357280.0000B28NaN
48748801Kent, Mr. Edward Austinmale58.0001177129.7000B37C
48448511Bishop, Mr. Dickinson Hmale25.0101196791.0792B49C
67968011Cardeza, Mr. Thomas Drake Martinezmale36.001PC 17755512.3292B51 B53 B55C
67167201Davidson, Mr. Thorntonmale31.010F.C. 1275052.0000B71S
19519611Lurette, Miss. Elisefemale58.000PC 17569146.5208B80C
80280311Carter, Master. William Thornton IImale11.012113760120.0000B96 B98S
11011101Porter, Mr. Walter Chamberlainmale47.00011046552.0000C110S
71171201Klaber, Mr. HermanmaleNaN0011302826.5500C124S
88989011Behr, Mr. Karl Howellmale26.00011136930.0000C148C
34134211Fortune, Miss. Alice Elizabethfemale24.03219950263.0000C23 C25 C27S
71671711Endres, Miss. Caroline Louisefemale38.000PC 17757227.5250C45C
43043111Bjornstrom-Steffansson, Mr. Mauritz Hakanmale28.00011056426.5500C52S
69869901Thayer, Mr. John Borlandmale49.01117421110.8833C68C
23023111Harris, Mrs. Henry Birkhardt (Irene Wallach)female35.0103697383.4750C83S
33233301Graham, Mr. George Edwardmale38.001PC 17582153.4625C91S
26927011Bissette, Miss. Ameliafemale35.000PC 17760135.6333C99S
21821911Bazzani, Miss. Albinafemale32.0001181376.2917D15C
45745811Kenyon, Mrs. Frederick R (Marion)femaleNaN101746451.8625D21S
525311Harper, Mrs. Henry Sleeper (Myna Haxtun)female49.010PC 1757276.7292D33C
74074111Hawksford, Mr. Walter JamesmaleNaN001698830.0000D45S
212212Beesley, Mr. Lawrencemale34.00024869813.0000D56S
30330412Keane, Miss. Nora AfemaleNaN0022659312.3500E101Q
70770811Calderhead, Mr. Edward Penningtonmale42.000PC 1747626.2875E24S
16616711Chibnall, Mrs. (Edith Martha Bowerman)femaleNaN0111350555.0000E33S
43443501Silvey, Mr. William Bairdmale50.0101350755.9000E44S
26226301Taussig, Mr. Emilmale52.01111041379.6500E67S
12812913Peter, Miss. AnnafemaleNaN11266822.3583F E69C
14814902Navratil, Mr. Michel ("Louis M Hoffman")male36.50223008026.0000F2S
61861912Becker, Miss. Marion Louisefemale4.02123013639.0000F4S
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
91012Nasser, Mrs. Nicholas (Adele Achem)female14.01023773630.0708NaNC
171812Williams, Mr. Charles EugenemaleNaN0024437313.0000NaNS
train_data["Sex_num"]=train_data["Sex"].map({"male":1,"female":2}) #将Sex属性通过map方法转化为二值变量
train_data.head()
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarkedSex_num
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS1
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C2
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS2
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S2
4503Allen, Mr. William Henrymale35.0003734508.0500NaNS1
train_data["Embarked"].value_counts()
S    644
C    168
Q     77
Name: Embarked, dtype: int64
train_data["Embarked_num"]=train_data["Embarked"].replace(["S","C","Q"],[1,2,3]) #用replace方法将登船港编码为数字
train_data.head()
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarkedSex_numEmbarked_num
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS11.0
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C22.0
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS21.0
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S21.0
4503Allen, Mr. William Henrymale35.0003734508.0500NaNS11.0
train_data["Cabin_str"]=train_data["Cabin"].str[0] #将Cabin左边第一个字符提取做出来左边编号,再转化为数字(但不完全准确)
train_data["Cabin_str"].value_counts()
C    59
B    47
D    33
E    32
A    15
F    13
G     4
T     1
Name: Cabin_str, dtype: int64
train_data[train_data["Cabin_str"]=="T"] #查看下预期之外的数据
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarkedSex_numEmbarked_numCabin_numCabin_str
33934001Blackwell, Mr. Stephen Weartmale45.00011378435.5TS11.0TT
train_data["Cabin_num"]=train_data["Cabin_str"].replace(["A","B","C","D","E","F","G","T"],[1,2,3,4,5,6,7,8])

#用replace转化为数字
train_data.head()
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarkedSex_numEmbarked_numCabin_numCabin_str
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS11.0NaNNaN
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C22.03.0C
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS21.0NaNNaN
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S21.03.0C
4503Allen, Mr. William Henrymale35.0003734508.0500NaNS11.0NaNNaN
pd.get_dummies(train_data["Age"])  #将One-hot编码concat到原数据上即可
0.420.670.750.830.921.02.03.04.05.0...62.063.064.065.066.070.070.571.074.080.0
00000000000...0000000000
10000000000...0000000000
20000000000...0000000000
30000000000...0000000000
40000000000...0000000000
50000000000...0000000000
60000000000...0000000000
70000001000...0000000000
80000000000...0000000000
90000000000...0000000000
100000000010...0000000000
110000000000...0000000000
120000000000...0000000000
130000000000...0000000000
140000000000...0000000000
150000000000...0000000000
160000001000...0000000000
170000000000...0000000000
180000000000...0000000000
190000000000...0000000000
200000000000...0000000000
210000000000...0000000000
220000000000...0000000000
230000000000...0000000000
240000000000...0000000000
250000000000...0000000000
260000000000...0000000000
270000000000...0000000000
280000000000...0000000000
290000000000...0000000000
..................................................................
8610000000000...0000000000
8620000000000...0000000000
8630000000000...0000000000
8640000000000...0000000000
8650000000000...0000000000
8660000000000...0000000000
8670000000000...0000000000
8680000000000...0000000000
8690000000010...0000000000
8700000000000...0000000000
8710000000000...0000000000
8720000000000...0000000000
8730000000000...0000000000
8740000000000...0000000000
8750000000000...0000000000
8760000000000...0000000000
8770000000000...0000000000
8780000000000...0000000000
8790000000000...0000000000
8800000000000...0000000000
8810000000000...0000000000
8820000000000...0000000000
8830000000000...0000000000
8840000000000...0000000000
8850000000000...0000000000
8860000000000...0000000000
8870000000000...0000000000
8880000000000...0000000000
8890000000000...0000000000
8900000000000...0000000000

891 rows × 88 columns

for col in ["Age","Cabin_str","Sex"]:  #get_dummies函数进行One-hot编码
    x = pd.get_dummies(train_data[col],prefix=col,dummy_na=True) #Cabin有null值
    train_data = pd.concat([train_data,x],axis=1)
train_data.columns
Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare',
       ...
       'Cabin_str_C', 'Cabin_str_D', 'Cabin_str_E', 'Cabin_str_F',
       'Cabin_str_G', 'Cabin_str_T', 'Cabin_str_nan', 'Sex_female', 'Sex_male',
       'Sex_nan'],
      dtype='object', length=218)

get_dummies(): https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.get_dummies.html?highlight=get_dummies#pandas.Series.str.get_dummies

Task06 纯文本特征提取

我们用.str.extract()函数+正则表达式实现

train_data["Name"] #先观察下数据,方便编写正则表达式,pattern是 xxx,[target].xxxx
0                                Braund, Mr. Owen Harris
1      Cumings, Mrs. John Bradley (Florence Briggs Th...
2                                 Heikkinen, Miss. Laina
3           Futrelle, Mrs. Jacques Heath (Lily May Peel)
4                               Allen, Mr. William Henry
5                                       Moran, Mr. James
6                                McCarthy, Mr. Timothy J
7                         Palsson, Master. Gosta Leonard
8      Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)
9                    Nasser, Mrs. Nicholas (Adele Achem)
10                       Sandstrom, Miss. Marguerite Rut
11                              Bonnell, Miss. Elizabeth
12                        Saundercock, Mr. William Henry
13                           Andersson, Mr. Anders Johan
14                  Vestrom, Miss. Hulda Amanda Adolfina
15                      Hewlett, Mrs. (Mary D Kingcome) 
16                                  Rice, Master. Eugene
17                          Williams, Mr. Charles Eugene
18     Vander Planke, Mrs. Julius (Emelia Maria Vande...
19                               Masselmani, Mrs. Fatima
20                                  Fynney, Mr. Joseph J
21                                 Beesley, Mr. Lawrence
22                           McGowan, Miss. Anna "Annie"
23                          Sloper, Mr. William Thompson
24                         Palsson, Miss. Torborg Danira
25     Asplund, Mrs. Carl Oscar (Selma Augusta Emilia...
26                               Emir, Mr. Farred Chehab
27                        Fortune, Mr. Charles Alexander
28                         O'Dwyer, Miss. Ellen "Nellie"
29                                   Todoroff, Mr. Lalio
                             ...                        
861                          Giles, Mr. Frederick Edward
862    Swift, Mrs. Frederick Joel (Margaret Welles Ba...
863                    Sage, Miss. Dorothy Edith "Dolly"
864                               Gill, Mr. John William
865                             Bystrom, Mrs. (Karolina)
866                         Duran y More, Miss. Asuncion
867                 Roebling, Mr. Washington Augustus II
868                          van Melkebeke, Mr. Philemon
869                      Johnson, Master. Harold Theodor
870                                    Balkic, Mr. Cerin
871     Beckwith, Mrs. Richard Leonard (Sallie Monypeny)
872                             Carlsson, Mr. Frans Olof
873                          Vander Cruyssen, Mr. Victor
874                Abelson, Mrs. Samuel (Hannah Wizosky)
875                     Najib, Miss. Adele Kiamie "Jane"
876                        Gustafsson, Mr. Alfred Ossian
877                                 Petroff, Mr. Nedelio
878                                   Laleff, Mr. Kristo
879        Potter, Mrs. Thomas Jr (Lily Alexenia Wilson)
880         Shelley, Mrs. William (Imanita Parrish Hall)
881                                   Markun, Mr. Johann
882                         Dahlberg, Miss. Gerda Ulrika
883                        Banfield, Mr. Frederick James
884                               Sutehall, Mr. Henry Jr
885                 Rice, Mrs. William (Margaret Norton)
886                                Montvila, Rev. Juozas
887                         Graham, Miss. Margaret Edith
888             Johnston, Miss. Catherine Helen "Carrie"
889                                Behr, Mr. Karl Howell
890                                  Dooley, Mr. Patrick
Name: Name, Length: 891, dtype: object
train_data["Title"]=train_data["Name"].str.extract("([A-Za-z]+)\.",expand=False) #提取.之前的任意数量的字母组合

正则表达式可参考:https://c.runoob.com/front-end/854/

train_data["Title"].value_counts()
Mr          517
Miss        182
Mrs         125
Master       40
Dr            7
Rev           6
Mlle          2
Major         2
Col           2
Don           1
Ms            1
Capt          1
Sir           1
Lady          1
Mme           1
Jonkheer      1
Countess      1
Name: Title, dtype: int64
train_data.to_csv("my_data.csv")

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值