Titanic_Data analysis

最新推荐文章于 2020-12-05 11:42:02 发布

大大大房子

最新推荐文章于 2020-12-05 11:42:02 发布

阅读量574

点赞数

分类专栏：数据分析文章标签： python 数据分析

本文链接：https://blog.csdn.net/weixin_47339994/article/details/108114848

版权

数据分析专栏收录该内容

1 篇文章 0 订阅

订阅专栏

本文是基于Datawhale小组的教程，对泰坦尼克数据集进行深入分析。介绍了Pandas的基本操作，如DataFrame、Series和ndarray的区别，以及如何处理缺失值。还涉及到数据清洗、特征处理，如年龄字段的离散化和文本变量转换。进一步进行了数据重构和可视化，如性别、票价与存活率的关系。最后搭建了逻辑回归和随机森林模型，评估模型性能并输出预测结果。

摘要由CSDN通过智能技术生成

本项目是跟着Datawhale数据分析小组对经典的泰坦尼克数据集做一个从浅入深的数据分析，适合入门。
数据集下载地址：https://www.kaggle.com/c/titanic/overview

Part 1 探索titanic数据集

import numpy as np
import pandas as pd

# 读取文件，并查看文件的前三行
df = pd.read_csv('train.csv')
df.head(3)

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S

# 逐块读取文件，每1000条记录作为一个块
chunker = pd.read_csv("train.csv", chunksize=1000)
print(chunker)

<pandas.io.parsers.TextFileReader object at 0x7f5634984cd0>

将表头改为中文，并将乘客ID设为唯一索引

# 注意：这里的header要设置为header=0,而不是header=None，否则中文表名和英文表名都会读取出来
df = pd.read_csv('train.csv',names=['乘客ID','是否幸存','仓位等级','姓名','性别','年龄','兄弟姐妹个数','父母子女个数','船票信息','票价','客舱','登船港口'], index_col='乘客ID', header=0)                                 
df.head()

# 查看数据的基本信息
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 891 entries, 1 to 891
Data columns (total 11 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   是否幸存    891 non-null    int64  
 1   仓位等级    891 non-null    int64  
 2   姓名      891 non-null    object 
 3   性别      891 non-null    object 
 4   年龄      714 non-null    float64
 5   兄弟姐妹个数  891 non-null    int64  
 6   父母子女个数  891 non-null    int64  
 7   船票信息    891 non-null    object 
 8   票价      891 non-null    float64
 9   客舱      204 non-null    object 
 10  登船港口    889 non-null    object 
dtypes: float64(2), int64(4), object(5)
memory usage: 83.5+ KB

# 查看数据的前10行
df.head(10)

# 查看数据的后15行
df.tail(15)

# 查看数据为空的情况
df.isnull().sum()

是否幸存        0
仓位等级        0
姓名          0
性别          0
年龄        177
兄弟姐妹个数      0
父母子女个数      0
船票信息        0
票价          0
客舱        687
登船港口        2
dtype: int64

可知年龄和客舱字段缺失值较多，可考虑用均值或众数补全

df.describe()

	是否幸存	仓位等级	年龄	兄弟姐妹个数	父母子女个数	票价
count	891.000000	891.000000	714.000000	891.000000	891.000000	891.000000
mean	0.383838	2.308642	29.699118	0.523008	0.381594	32.204208
std	0.486592	0.836071	14.526497	1.102743	0.806057	49.693429
min	0.000000	1.000000	0.420000	0.000000	0.000000	0.000000
25%	0.000000	2.000000	20.125000	0.000000	0.000000	7.910400
50%	0.000000	3.000000	28.000000	0.000000	0.000000	14.454200
75%	1.000000	3.000000	38.000000	1.000000	0.000000	31.000000
max	1.000000	3.000000	80.000000	8.000000	6.000000	512.329200

# 保存数据集
df.to_csv('train_chinese.csv')

Pandas 基础操作

区分Pandas中的两个数据类型：DataFrame 和 Series

经常会在操作中搞不清楚数据类型导致运算报错，接下来讲解常见三种数据类型DataFrame 、 Series和ndarray 的用法和区分：

ndarray

ndarray 是Numpy库中生成的矩阵格式的数据形式，可以用Numpy创建n维的数组对象，但是所有元素必须是相同的类型，如：

# numpy创建数组
import numpy as np
arr = np.array([[1,2,3],[4,5,6],[7,8,9]], dtype=np.int32)
print(arr)

输出：
[[1 2 3]
[4 5 6]
[7 8 9]]

print(arr.shape)  #查看数组大小
print(arr.dtype)   #查看数组类型

输出：
(3, 3)
int32

可知这是一个3行3列的数组，且数组值为int型数据

ndarray数组常用的操作有：

print(np.zeros(10))    # 创建长度为10的全0数组
print(np.ones((2,3,4)))   #创建2*3*4的全1数组

[0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]

[[[1. 1. 1. 1.]
  [1. 1. 1. 1.]
  [1. 1. 1. 1.]]
 [[1. 1. 1. 1.]
  [1. 1. 1. 1.]
  [1. 1. 1. 1.]]]

arr.shape #查看数组大小(**注意：没有括号**)
arr.dtype #查看数组的类型
arr.astype(np.int32) #将数组类型转换成数值型
arr.mean() #求均值
arr.sum() #求和

Series

#创建一个Series类型数据
s = pd.Series(['a','b','c','d'])
print(s)

0    a
1    b
2    c
3    d
dtype: object

可以看到创建Series如未指定索引它会自动在前一列生成一个数字形式的索引,当然你也可以任意指定一个索引：

# 指定索引为a,b,c,d,e
s1 = pd.Series([1,2,3,4,5],index=['a','b','c','d','e'])
print(s1)

a    1
b    2
c    3
d    4
e    5
dtype: int64

这个有点字典的意思，Series中的索引相当于字典中的key，值相当于字典中的value。

常见的操作有：

#查看数值
s1.values

array([1, 2, 3, 4, 5])

#查看索引
s1.index

Index(['a', 'b', 'c', 'd', 'e'], dtype='object')

#查看s1中索引为a的值，输出为1
print(s1['a'])

s1.isnull()   #查看值为空的情况，返回布尔型

a    False
b    False
c    False
d    False
e    False
dtype: bool

s1.notnull()  #查看值非空的情况，返回布尔型

a    True
b    True
c    True
d    True
e    True
dtype: bool

s1.sort_index() #按索引排序
s1.sort_values() #按数值排序

Series比较常见，当查看DataFrame的某一列时，返回的就是Series类型。

DataFrame

DataFrame相当于一个表格，并且有行索引和列索引

# 随机在(0,1)内创建一个4*5格式的DataFrame类型数据,并指定表头
df = pd.DataFrame(np.random.rand(4,5),index=list("ABCD"),columns=list('abcde'))
print(df)

          a         b         c         d         e
A  0.789380  0.071060  0.551364  0.107874  0.490083
B  0.215729  0.803024  0.611663  0.774096  0.032621
C  0.442154  0.685745  0.330844  0.654136  0.769399
D  0.762726  0.461278  0.748808  0.400604  0.375030

DataFrame的一些常用操作：

查看第a列的第1个元素可以用如下方法：

#查看第a列的第1个元素：

print(df['a'][0])  #方法1

# .loc取指定的数 
print(df.loc['A','a'])  #方法2  

# print(df.loc['A':'C'])  # 输出A-C行所有内容，指定行
 
# .loc[n]表示索引的是第n行（index 是整数）
# .loc['d']表示索引的是第'd'行（index是字符）

print(df.iloc[0][0])   #方法3
print(df.iloc[0,0])    #方法4
# print(df.iloc[0:3]) #输出0至3行所有列内容

# .iloc   ：通过行号获取行数据，不能是字符

方法1-4均输出：

0.7893799882122178

以上也能顺便看出.loc 和.iloc的区别：

df.loc[['A','B'],['a','b']] #指定行 指定列

          a         b
A  0.789380  0.071060
B  0.215729  0.803024

df.loc['A':'B',['a','b']]  # 等价于 df.loc[['A','B'],['a','b']]

          a         b
A  0.789380  0.071060
B  0.215729  0.803024

df.iloc[0][0]  #查看第a列的第1个元素

0.7893799882122178

df.iloc[0,0]    #等价于 df.iloc[0][0]

0.7893799882122178

df.iloc[0:3]  #输出0至3行所有列内容，等价于 df.loc['A':'C']

	a	b	c	d	e
A	0.789380	0.071060	0.551364	0.107874	0.490083
B	0.215729	0.803024	0.611663	0.774096	0.032621
C	0.442154	0.685745	0.330844	0.654136	0.769399

df.loc['A':'C']

	a	b	c	d	e
A	0.789380	0.071060	0.551364	0.107874	0.490083
B	0.215729	0.803024	0.611663	0.774096	0.032621
C	0.442154	0.685745	0.330844	0.654136	0.769399

.loc和.iloc的区别，.iloc只能通过整数型行索引获取数据，而.loc可以通过字符型行索引获取数据。

.loc和.iloc更具体的区别和用法可以参考：
https://blog.csdn.net/qq_21840201/article/details/80725433

DataFrame的其他常见操作：

df1 = df.drop('b',axis=1)  # 删除'b'列,axis=0表示行，axis=1表示列,默认删除行
print(df1)

          c         d         e
A  0.551364  0.107874  0.490083
B  0.611663  0.774096  0.032621
C  0.330844  0.654136  0.769399
D  0.748808  0.400604  0.375030

**# 根据b,c列对表数据进行去重**
df.drop_duplicates(['b','c'],keep='first',inplace=True) 

df = df.fillna(5)  #缺省值处(即NaN处填充为5)
df = df.dropna() #删除缺省值为NaN的行
df.replace(1, -1) #将1替换成-1
df.unique() #查看唯一值
df.reset_index() #修改、删除，原有索引
df.columns #查看df的列名

df.sort_values() #排序

pd.merge(df1,df2) #合并(行合并，df2合并到df1的右边)

pd.concat([df1,df2]) #合并(列合并，df2合并到df1的下边)
df.apply()  # 应用到某一列作运算

以上就是对ndarray、Series以及DataFrame大致的讲解区分以及其常用的操作方法，接下来我们会对Titanic数据集做一些高阶一点的操作和探索。

#加载train_chinese.csv数据
data = pd.read_csv('titanic_dataset/train.csv')

data.columns  #查看列信息

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')

# 查看['Cabin']这一列的信息
data.Cabin.head(3)   # 方法一

0    NaN
1    C85
2    NaN
Name: Cabin, dtype: object

data['Cabin'].head(3)  #方法二

0    NaN
1    C85
2    NaN
Name: Cabin, dtype: object

加载test_1.csv文件，与train.csv文件对比，将test_1.csv多余的列删掉

# 加载test_1.csv文件，与train.csv文件对比，将test_1.csv多余的列删掉
ts = pd.read_csv('titanic_dataset/test_1.csv')
print(ts.columns)

Index(['Unnamed: 0', 'PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age',
       'SibSp', 'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked', 'a'],
      dtype='object')

# 使用del 删除某列
del ts['a']      # del ts.a
print(ts.columns)

Index(['Unnamed: 0', 'PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age',
       'SibSp', 'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')

#也可以用pd.drop() 删除某行或某列
print(data.drop(columns = ['a'])) # 删除a列，不改变data,要覆盖原数据可使用inplace=True
print(data.drop([13]),axis=1)
#print(data.drop(['a','Unnamed']),axis=1) #删除多列

Pandas 筛选条件

显示年龄在10岁以下的乘客

data[data['Age']<10].head()

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
7	8	0	3	Palsson, Master. Gosta Leonard	male	2.0	3	1	349909	21.0750	NaN	S
10	11	1	3	Sandstrom, Miss. Marguerite Rut	female	4.0	1	1	PP 9549	16.7000	G6	S
16	17	0	3	Rice, Master. Eugene	male	2.0	4	1	382652	29.1250	NaN	Q
24	25	0	3	Palsson, Miss. Torborg Danira	female	8.0	3	1	349909	21.0750	NaN	S
43	44	1	2	Laroche, Miss. Simonne Marie Anne Andree	female	3.0	1	2	SC/Paris 2123	41.5792	NaN	C

# 将年龄在10岁以上50岁以下的乘客筛选出来，并将其命名为midage

midage = data[(data['Age']>10) & (data['Age']<50)]
print(midage.head(3))

将midage的数据中第100行的’Pclass’和’Sex’的数据显示出来(df.reset_index(drop=True))

mid_age = data[(data['Age']>10) & (data['Age']<50)]
print(mid_age.loc[[101],['Pclass','Sex']])  #未重新设置索引的结果和原始数据集的结果一样
print(data.loc[[101],['Pclass','Sex']])

     Pclass     Sex
100       3  female
     Pclass     Sex
100       3  female

# 将midage的数据中第100行的'Pclass'和'Sex'的数据显示出来
print(mid_age.loc[[101],['Pclass','Sex']]) #如果不重新对midage设置索引的话，那么结果只会按照原始数据train中的索引进行查找,这个结果显然是错误的

     Pclass     Sex
100       3  female

使用reset_index(drop=True)将midage数据重新从0开始索引，[drop=True：将原来的索引删掉,默认为False]

midage = midage.reset_index(drop=True)  
print(midage.head())

# 得到正确的midage数据中第101行的'Pclass'和'Sex'的数据
midage.loc[[100],['Pclass','Sex']]

	Pclass	Sex
100	2	male

# 将midage中第101，106，109行中的'Pclass','Name','Sex'数据显示出来

midage.loc[[100,105,108],['Pclass','Name','Sex']]

#也可以用 .iloc,但这时不能用字符索引，只能用整数索引查询

midage.iloc[[100,105,108],[2,3,4]]

	Pclass	Name	Sex
100	2	Byles, Rev. Thomas Roussel Davids	male
105	3	Cribb, Mr. John Hatfield	male
108	3	Calic, Mr. Jovo	male

mid_age.loc[[100,105,108],['Pclass','Name','Sex']]  # 未重新设置索引的查询结果是错误的

	Pclass	Name	Sex
100	3	Petranec, Miss. Matilda	female
105	3	Mionoff, Mr. Stoytcho	male
108	3	Rekic, Mr. Tido	male

导入train_chinese.csv文件，并按[‘票价’,‘年龄’]进行降序排列

text = pd.read_csv('titanic_dataset/train_chinese.csv')
print(text.head(3))

text.sort_values(['票价','年龄'], ascending=False).head(3)

	乘客ID	是否幸存	仓位等级	姓名	性别	年龄	父母子女个数	船票信息	票价	客舱	登船港口
679	680	1	1	Cardeza, Mr. Thomas Drake Martinez	male	36.0	1	PC 17755	512.3292	B51 B53 B55	C
258	259	1	1	Ward, Miss. Anna	female	35.0	0	PC 17755	512.3292	NaN	C
737	738	1	1	Lesurer, Mr. Gustave J	male	35.0	0	PC 17755	512.3292	B101	C

从该结果可以看出，30多岁的中年人会选择购买票价最高的舱位，同时仓位等级高，存活率也高（这比较符合现实，钱有时候能在关键时刻救你一命= =努力赚钱吧

后面我们会将年龄段、票价、仓位等级与存活率的关系一一可视化呈现出来

计算在船上最大的家族有多少人（兄弟姐妹个数+父母子女个数）

# 方法一
text['家族人数'] = text['兄弟姐妹个数']+ text['父母子女个数']
text.sort_values(['家族人数'], ascending=False).head(10)

text.drop(['家族人数'],axis=1).head(3)

	乘客ID	是否幸存	仓位等级	姓名	性别	年龄	兄弟姐妹个数	船票信息	票价	客舱	登船港口
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S

# 方法二
max(text['兄弟姐妹个数'] + text['父母子女个数'])

Part 2 数据清洗及特征处理

2.1 数据清洗

import numpy as np
import pandas as pd

# 查看缺失值
tp = pd.read_csv('titanic_dataset/train.csv')
print(tp.isnull().sum())

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

# 对缺失值进行处理
#对年龄字段用年龄的平均数进行填充(或.median()、.mode())
tp['Age'] = tp['Age'].fillna(tp['Age'].mean())

tp.head(10)

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.000000	1	0	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.000000	1	0	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.000000	0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.000000	1	0	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.000000	0	0	373450	8.0500	NaN	S
5	6	0	3	Moran, Mr. James	male	29.699118	0	0	330877	8.4583	NaN	Q
6	7	0	1	McCarthy, Mr. Timothy J	male	54.000000	0	0	17463	51.8625	E46	S
7	8	0	3	Palsson, Master. Gosta Leonard	male	2.000000	3	1	349909	21.0750	NaN	S
8	9	1	3	Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)	female	27.000000	0	2	347742	11.1333	NaN	S
9	10	1	2	Nasser, Mrs. Nicholas (Adele Achem)	female	14.000000	1	0	237736	30.0708	NaN	C

#对Cabin字段用字段里的前一个数进行填充
tp['Cabin']=tp['Cabin'].fillna(method='ffill')  # 使用后一个数进行填充: method = 'bfill'

tp.head(5)

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	B96 B98	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	G6	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	C123	S

tp['Embarked']=tp['Embarked'].fillna(method='bfill')

tp.isnull().sum()

PassengerId    0
Survived       0
Pclass         0
Name           0
Sex            0
Age            0
SibSp          0
Parch          0
Ticket         0
Fare           0
Cabin          0
Embarked       0
dtype: int64

已将所有缺失值进行填充处理完毕

# 查看数据中是否存在重复值,若有重复值可用drop_duplicates()
tp[tp.duplicated()].count()

PassengerId    0
Survived       0
Pclass         0
Name           0
Sex            0
Age            0
SibSp          0
Parch          0
Ticket         0
Fare           0
Cabin          0
Embarked       0
dtype: int64

# 将清洗过后的数据保存为test_clear.csv
tp.to_csv('titanic_dataset/test_clear.csv')

2.2 特征处理

我们可把本数据集特征分为两大类：

数值型特征：是否幸存、仓位等级、年龄、兄弟姐妹个数、父母子女个数、票价。其中：是否幸存、仓位等级为离散型数值特征，年龄、兄弟姐妹个数、父母子女个数、票价为连续型数值特征
文本型特征：姓名、性别、船票信息、客舱、登船港口。其中性别、船票信息、客舱、登船港口为类别型文本特征。

数值型特征一般可以直接用于模型的训练，但有时候会为了模型的鲁棒性和稳定性会对连续型变量进行离散化处理，而文本型特征往往要转化为数值型特征才能进行建模分析。

对年龄字段进行分箱(离散化)处理

tp['Age'].describe()

count    891.000000
mean      29.699118
std       13.002015
min        0.420000
25%       22.000000
50%       29.699118
75%       35.000000
max       80.000000
Name: Age, dtype: float64

# 方法一：将连续变量Age平均分箱成5个年龄段，并分别用类别变量12345表示
tp['AgeBand'] = pd.cut(tp['Age'], 5 , labels=['1','2','3','4','5'])

tp.head(3)

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked	AgeBand
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	B96 B98	S	2
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C	3
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	G6	S	2

tp.to_csv('titanic_dataset/test_ave.csv')

# 方法二：将连续变量Age划分为[0,5),[5,15),[15,30),[30,50),[50,80)五个年龄段，并分别用类别变量12345表示(pd.cut())
tp['AgeBand']=pd.cut(tp['Age'],[0,5,15,30,50,80],labels=['1','2','3','4','5'])

tp.head(3)

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked	AgeBand
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	B96 B98	S	3
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C	4
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	G6	S	3

tp.to_csv('titanic_dataset/test_cut.csv')

# 方法三：将连续变量Age按10%,30%,50%,70%,90%划分年龄段，并分别用类别变量12345表示(pd.qcut())
tp['AgeBand']=pd.qcut(tp['Age'],[0,0.1,0.3,0.5,0.7,0.9],labels=['1','2','3','4','5'])

tp.head(3)

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked	AgeBand
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	B96 B98	S	2
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C	5
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	G6	S	3

tp.to_csv('titanic_dataset/test_pr.csv')

对文本变量进行转换

方法一：对字段的不同值按顺序进行编号(map方法或使用dict

tp.head(5)

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked	AgeBand
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	B96 B98	S	2
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C	5
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	G6	S	3
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S	5
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	C123	S	5

# 方法一：将文本变量Sex，Cabin, Embarked 用数值变量12345表示
tp['Sex'].value_counts()

male      577
female    314
Name: Sex, dtype: int64

tp['Sex_num']=tp['Sex'].map({'female': 0, 'male': 1})
# tp['Sex_num']=tp['Sex'].replace(['female','male'],[0,1])

tp.head(3)

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked	AgeBand	Sex_num
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	B96 B98	S	2	1
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C	5	0
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	G6	S	3	0

print(tp['Cabin'].value_counts())
print(tp['Cabin'].nunique())

G6             25
B78            21
C78            20
C23 C25 C27    19
F33            19
               ..
D47             1
C111            1
C90             1
C106            1
B4              1
Name: Cabin, Length: 147, dtype: int64
147

由于‘Cabin’列有147个不同的值，因此可以对这些值用字典的方法进行顺序编号

label_dict = dict(zip(tp['Cabin'].unique(), range(tp['Cabin'].nunique())))
tp['Cabin_labelEncode']= tp['Cabin'].map(label_dict)

tp.head(5)

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked	AgeBand	Sex_num	Cabin_labelEncode
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	B96 B98	S	2	1	0
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C	5	0	1
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	G6	S	3	0	2
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S	5	0	3
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	C123	S	5	1	3

print(tp['Embarked'].value_counts())
print(tp['Embarked'].nunique())

S    645
C    169
Q     77
Name: Embarked, dtype: int64
3

# 对'Embarked'列一样对不同值按顺序编号
label_dict = dict(zip(tp['Embarked'].unique(), range(tp['Embarked'].nunique())))
tp['Embarked_labelEncode']= tp['Embarked'].map(label_dict)

tp.head(5)

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked	AgeBand	Sex_num	Cabin_labelEncode	Embarked_labelEncode
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	B96 B98	S	2	1	0	0
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C	5	0	1	1
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	G6	S	3	0	2	0
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S	5	0	3	0
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	C123	S	5	1	3	0

对文本型字段采用one-hot编码(pd.get_dummies(df,[prefix=’ '], prefix固定前缀名为可选)

# 方法二：将文本变量Embarked 用one-hot编码表示(Sex，Cabin同理
x = pd.get_dummies(tp['Embarked'],prefix='Embarked') #生成新的dataframe
tp=pd.concat([tp,x], axis=1)  # 按列合并

tp.head(5)

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked	AgeBand	Sex_num	Cabin_labelEncode	Embarked_labelEncode	Embarked_C	Embarked_S
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	B96 B98	S	2	1	0	0	0	1
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C	5	0	1	1	1	0
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	G6	S	3	0	2	0	0	1
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S	5	0	3	0	0	1
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	C123	S	5	1	3	0	0	1

# 对年龄字段进行one-hot编码
# y=pd.get_dummies(pd.cut(tp['Age'],[0,5,15,30,50,80],labels=['1','2','3','4','5']))
y=pd.get_dummies(pd.cut(tp['Age'],[0,5,15,30,50,80]),prefix='Age')

y.head(5)

	Age_(15, 30]	Age_(30, 50]
0	1	0
1	0	1
2	1	0
3	0	1
4	0	1

从Name字段中提取出Titles的特征(即Mr,Miss,Mrs等) pd.字段.str.extract(’ ', expand=[False|True])

# 提取字段中特征：pd.字段.str.extract('', expand=False)
tp['Titles'] = tp.Name.str.extract('([A-Za-z]+)\.',expand=False)

# 提取字段中包含特定词的特征
#tp.Name.str.contains('Mr')

# 将字段中分隔符的两边分开作为两列
# tp.Name.str.split('.', expand=True) # expand=True不加的话，结果中将只有一列，其实就是一个series

tp.head(5)

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked	AgeBand	Sex_num	Cabin_labelEncode	Embarked_labelEncode	Embarked_C	Embarked_S	Titles
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	B96 B98	S	2	1	0	0	0	1	Mr
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C	5	0	1	1	1	0	Mrs
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	G6	S	3	0	2	0	0	1	Miss
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S	5	0	3	0	0	1	Mrs
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	C123	S	5	1	3	0	0	1	Mr

tp.to_csv('titanic_dataset/test_fin.csv')

Part 3 数据重构及数据可视化

3.1 数据重构

数据合并的几种方法(pd.concat(), pd.merge(), df.join(), df.append())

pd.concat([df1, df2],axis=1)  # 列合并，默认为axis=0行合并
pd.merge([df1,df2],on='key')  # 以'key'为主键进行两个表的列合并,默认为内连接,可指定how='inner'|'outer'选择连接方式

关于pd.merge()的具体参数和用法可参考https://zhuanlan.zhihu.com/p/132579724

# DataFrame自带的连接方法：join()、append()
df1.join(df2)   # 列合并，即将df2合并到df1的右边
df1.append(df2) # 行合并，即将df2合并到df1的下边

数据聚合运算(groupby( ) )

# group by()主要是对数据进行分组以及分组后的组内运算,常见的用法有：
# 按单个字段或多个字段进行简单聚合
df.groupby('字段') or df.groupby(['字段1'，'字段2'])  

# 对字段2按字段1进行聚合后每组求平均(也可求count(),sum(),std()..)
df.groupby('字段1')['字段2'].mean() = df.字段2.groupby('字段1').mean()  

# 对分组后的数据进行多个函数的运算.agg()
df.groupby('字段').agg([np.sum,np.mean, np.std])  = df.groupby('字段').agg(['sum','mean','std'])

3.2 数据可视化

# 数据可视化常常需要用到groupby()函数与plot()函数相结合
import matplotlib.pyplot as plt
%matplotlib inline

res = pd.read_csv('titanic_dataset/test_clear.csv')

res.head(10)

	Unnamed: 0	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
0	0	1	0	3	Braund, Mr. Owen Harris	male	22.000000	1	0	A/5 21171	7.2500	B96 B98	S
1	1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.000000	1	0	PC 17599	71.2833	C85	C
2	2	3	1	3	Heikkinen, Miss. Laina	female	26.000000	0	0	STON/O2. 3101282	7.9250	G6	S
3	3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.000000	1	0	113803	53.1000	C123	S
4	4	5	0	3	Allen, Mr. William Henry	male	35.000000	0	0	373450	8.0500	C123	S
5	5	6	0	3	Moran, Mr. James	male	29.699118	0	0	330877	8.4583	C123	Q
6	6	7	0	1	McCarthy, Mr. Timothy J	male	54.000000	0	0	17463	51.8625	E46	S
7	7	8	0	3	Palsson, Master. Gosta Leonard	male	2.000000	3	1	349909	21.0750	E46	S
8	8	9	1	3	Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)	female	27.000000	0	2	347742	11.1333	E46	S
9	9	10	1	2	Nasser, Mrs. Nicholas (Adele Achem)	female	14.000000	1	0	237736	30.0708	E46	C

res['Age'] = np.round(res['Age'],1)  # 对年龄字段值保留一位小数

res.head(5)

	Unnamed: 0	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	B96 B98	S
1	1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C
2	2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	G6	S
3	3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S
4	4	5	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	C123	S

# 计算男性与女性的平均票价

mean_fare = res['Fare'].groupby(res['Sex']).mean().plot(kind='bar',legend=True)  
# =res.groupby(['Sex'])['Fare'].mean().plot(kind='bar',legend=True)

在这里插入图片描述

可见女性的平均票价相比男性高出一半左右，证明女性更有可能去购买舱位等级高的票

# 统计男女性的存活人数和存活比例

survived_count = res.groupby(['Sex'])['Survived'].sum()
print(survived_count)

Sex
female    233
male      109
Name: Survived, dtype: int64

survived_count.plot(kind='bar',title='Survived_sex',legend=True)

<matplotlib.axes._subplots.AxesSubplot at 0x7f9125dab2d0>

在这里插入图片描述

# 计算婴儿(小于2岁)的存活率
baby_sur = len(res[(res.Survived == 1) & (res.Age < 2)]) / len(res[res['Age'] < 2 ])
print(baby_sur)

0.8571428571428571

女性的存活人数比男性高出一半，婴儿的存活率高达85.7%,比较符合电影中“女性和婴儿”优先上船的剧情。

# 计算客舱不同等级的存活人数
survived_pclass = res.groupby('Pclass')['Survived'].sum().plot(kind='bar',title='Survived_Pclass',legend=True)

在这里插入图片描述

高等级舱位的存活人数最多，可见钱有时候不仅能买到舒适说不定还能买住性命= =

# 统计在不同等级的票中不同年龄的船票花费的平均值
res.groupby(['Pclass','Age'])['Fare'].mean()

Pclass  Age 
1       0.9     151.5500
        2.0     151.5500
        4.0      81.8583
        11.0    120.0000
        14.0    120.0000
                  ...   
3       61.0      6.2375
        63.0      9.5875
        65.0      7.7500
        70.5      7.7500
        74.0      7.7750
Name: Fare, Length: 185, dtype: float64

# 得出不同年龄的总的存活人数，找出存活人数最多的年龄，最后计算年龄中存活人数最高的存活率(存活人数/总人数)
survived_age = res['Survived'].groupby(res['Age']).sum().sort_values(ascending=False)
print(survived_age)

Age
29.7    52
24.0    15
22.0    11
36.0    11
35.0    11
        ..
30.5     0
14.5     0
40.5     0
74.0     0
23.5     0
Name: Survived, Length: 88, dtype: int64

由于排第一Age=29.7的值是我们前面对年龄字段缺失值用年龄的平均值填充得来的，因此我们取真实可靠的Age=24作为存活人数最多的年龄

res['Survived'].sum()

# 计算存活率
print('最大存活率为:%s' %(15 / res['Survived'].sum()))

最大存活率为:0.043859649122807015

# 可视化男女中生存人数与死亡人数的比例图
res.groupby(['Sex','Survived'])['Survived'].count()

Sex     Survived
female  0            81
        1           233
male    0           468
        1           109
Name: Survived, dtype: int64

# stack():图堆叠    unstack():图不堆叠
# 存活：1表示  死亡：0表示
res.groupby(['Sex','Survived'])['Survived'].count().unstack().plot(kind='bar',stacked=True,title='survived_count')

<matplotlib.axes._subplots.AxesSubplot at 0x7f9122472890>

在这里插入图片描述

# 统计每个年龄段的存活人数
res['_age'] = pd.cut(res['Age'],[0,5,15,30,50,80])

res.groupby(['_age','Survived'])['Survived'].count().unstack().plot(kind='bar',stacked=True)

<matplotlib.axes._subplots.AxesSubplot at 0x7f9121b5bc50>

在这里插入图片描述

# 可视化展示不同舱位等级的年龄分布情况 (核密度估计图)
res.Age[res.Pclass == 1].plot(kind='kde')
res.Age[res.Pclass == 2].plot(kind='kde')
res.Age[res.Pclass == 3].plot(kind='kde')
plt.xlabel('Age')
plt.xlim(0, res.Age.max())
plt.legend((1,2,3),loc='best')

<matplotlib.legend.Legend at 0x7f9122324390>

在这里插入图片描述

Part 4 模型搭建

终于到了搭建预测模型的阶段啦~我们将搭建一个预测模型预测测试集中的乘客是否会幸存(ps:貌似有点残忍…

res = res.drop('_age', axis=1)

res.columns

Index(['Unnamed: 0', 'PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age',
       'SibSp', 'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')

res = res.drop('Unnamed: 0', axis=1)

res.isnull().sum()  #判断是否还有缺失值

PassengerId    0
Survived       0
Pclass         0
Name           0
Sex            0
Age            0
SibSp          0
Parch          0
Ticket         0
Fare           0
Cabin          0
Embarked       0
dtype: int64

# 取出所有输入特征进行one-hot转换(文本型数值Sex,Embarked进行one-hot操作)
data = res[['Pclass','Sex','Age','SibSp','Parch','Fare','Embarked']]
data = pd.get_dummies(data)

data.head()

	Pclass	Age	SibSp	Fare	Sex_female	Sex_male	Embarked_C	Embarked_S
0	3	22.0	1	7.2500	0	1	0	1
1	1	38.0	1	71.2833	1	0	1	0
2	3	26.0	0	7.9250	1	0	0	1
3	1	35.0	1	53.1000	1	0	0	1
4	3	35.0	0	8.0500	0	1	0	1

# 选择模型训练
## 划分数据集
from sklearn.model_selection import train_test_split

train_test_split()参数的具体详解可参考
https://www.cnblogs.com/SupremeBoy/p/12247864.html

X = data  # x为输入模型的特征
y = res['Survived']  # y为模型的输出，即预测用户是否幸存

# 划分数据集(X_train和y_train:训练集，X_test和 y_test：测试集)
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=0)
print(X_train.shape)  # 训练特征数
print(X_test.shape)  # 测试特征数

(668, 10)
(223, 10)

4.1 搭建模型

基于线性模型的分类模型可使用：逻辑回归

基于树的分类模型可使用：决策树、随机森林

注意：
1、逻辑回归是分类模型(用于预测分类)，线性回归是回归模型(用于预测数值)
2、随机森林其实是决策树集成为了降低决策树过拟合的情况
3、线性模型所在的模块为 sklearn.linear_model
4、树模型所在的模块为 sklearn.ensemble

 针对本数据集并结合任务，我们选择逻辑回归和随机森林两个模型作最后的分类预测。

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

# 设置LR模型参数为默认参数
LogisticRegression(C=1.0,class_weight=None,dual=False,fit_intercept=True,intercept_scaling=1,max_iter=100,multi_class='ovr',n_jobs=1,
                  penalty='12',random_state=None,solver='liblinear',tol=0.0001,verbose=0,warm_start=False)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='ovr', n_jobs=1, penalty='12', random_state=None,
                   solver='liblinear', tol=0.0001, verbose=0, warm_start=False)

lr = LogisticRegression()
lr.fit(X_train, y_train)  # 输入训练集训练LR模型

# 查看训练集和测试集的score值
print('Training set Score: {: .2f}'.format(lr.score(X_train, y_train)))
print('Testing set Score: {: .2f}'.format(lr.score(X_test, y_test)))

Training set Score:  0.81
Testing set Score:  0.78

以上是LR模型预测的结果，可以通过调整参数C的值来提高预测的Score值。

# RF模型
RandomForestClassifier(bootstrap=True,class_weight=None,criterion='gini',max_depth=None,max_features='auto',
                       max_leaf_nodes=None,min_impurity_decrease=0.0,min_impurity_split=None,min_samples_leaf=1,min_samples_split=2,
                       min_weight_fraction_leaf=0.0,n_estimators=10,n_jobs=1,oob_score=False,random_state=None,verbose=0,warm_start=False)

RF = RandomForestClassifier()
RF.fit(X_train, y_train)

# 查看训练集和测试集的score值
print('Training set Score: {: .2f}'.format(RF.score(X_train, y_train)))
print('Testing set Score: {: .2f}'.format(RF.score(X_test, y_test)))

Training set Score:  0.99
Testing set Score:  0.80

可知，在未调参的情况下，随机森林比逻辑回归训练的结果要好一点，可对随机森林中的参数n_estimators和max_depth调参获得更好的score值。

输出预测结果

有两种方法输出预测结果：

1、直接输出模型预测的分类标签
2、输出不同分类标签的预测概率，概率较大的类别即为预测类别

一般监督模型在sklearn里面有个predict()能输出预测标签，predict_proba()则可以输出标签预测概率

# 查看逻辑回归模型的预测标签结果
pred = lr.predict(X_train)

pred[:10]  # 标签为1的代表存活，为0的代表死亡

array([0, 1, 1, 1, 0, 0, 1, 0, 1, 1])

# 逻辑回归模型预测标签概率
pred_prob = lr.predict_proba(X_train)

pred_prob[:10]

array([[0.62177432, 0.37822568],
       [0.14859439, 0.85140561],
       [0.47842849, 0.52157151],
       [0.20120029, 0.79879971],
       [0.86886004, 0.13113996],
       [0.90963818, 0.09036182],
       [0.13523801, 0.86476199],
       [0.90597394, 0.09402606],
       [0.04782863, 0.95217137],
       [0.11994542, 0.88005458]])

可以看出每类标签的概率，我们可以设置阈值如0.5，大于阈值的概率标签为存活，小于阈值的概率标签为死亡。

4.2 评估模型

1、模型评估是为了体现模型的泛化能力
2、交叉验证(cross-validation)是一种评估泛化性能的统计学方法，它比单次划分训练、测试集的方法更加全面、稳定
3、在交叉验证中，数据被多次划分，并需要训练多个模型

最常用的交叉验证是K折交叉验证，其中K是用户自己设定的值，通常取5或10
准确率(Precision)度量的是被预测为正例的样本中有多少是真正的正例
召回率(Recall)度量的是正类样本中有多少被预测为正类
F-score是准确率与召回率的调和平均

K折交叉验证

# 采用10折交叉验证来评估逻辑回归模型
# 计算交叉验证精度的平均值
from sklearn.model_selection import cross_val_score

lr = LogisticRegression(C=100)

Scores = cross_val_score(lr, X_train,y_train,cv=10) #设置十折交叉验证得出每轮的Score值

Scores

array([0.82089552, 0.7761194 , 0.82089552, 0.79104478, 0.85074627,
       0.86567164, 0.73134328, 0.86567164, 0.74242424, 0.6969697 ])

# 计算平均交叉验证score值
print('Average Cross-validation Score: {:.2f}'.format(Scores.mean()))

Average Cross-validation Score: 0.80

计算混淆矩阵和Precision、Recall、F-score的值

1、混淆矩阵方法通过sklearn.metrics模块加载
2、混淆矩阵需要输入真实标签和预测标签

混淆矩阵定义如下：
在这里插入图片描述

计算Precision、Recall、F-score的值：
在这里插入图片描述

from sklearn.metrics import confusion_matrix

# 训练模型
LogisticRegression(C=100,class_weight=None,dual=False,fit_intercept=True,intercept_scaling=1,max_iter=100,multi_class='ovr',n_jobs=1,
                  penalty='12',random_state=None,solver='liblinear',tol=0.0001,verbose=0,warm_start=False)

lr = LogisticRegression(C=100)
lr.fit(X_train, y_train) 

# 得到预测标签
pred = lr.predict(X_train)

# 得出混淆矩阵
confusion_matrix(y_train, pred)

array([[353,  59],
       [ 72, 184]])

# 计算Precision、Recall、F-score的值
from sklearn.metrics import classification_report

print(classification_report(y_train, pred))

              precision    recall  f1-score   support

           0       0.83      0.86      0.84       412
           1       0.76      0.72      0.74       256

    accuracy                           0.80       668
   macro avg       0.79      0.79      0.79       668
weighted avg       0.80      0.80      0.80       668

绘制ROC曲线

1、通过sklearn.metrics调用
2、ROC曲线下面所包围的面积越大越好

from sklearn.metrics import roc_curve
import matplotlib.pyplot as plt
import numpy as np

%matplotlib inline

关于roc_curve()方法中的参数及返回值说明可参考https://blog.csdn.net/w1301100424/article/details/84546194

fpr: False Positive rate
tpr: True Positive rate

fpr,tpr,thresholds = roc_curve(y_test, lr.decision_function(X_test))
                               
plt.plot(fpr,tpr, label='ROC Curve')
plt.xlabel('fpr')
plt.ylabel('tpr(recall)')
# 找到最接近0的阈值
close_zero = np.argmin(np.abs(thresholds))
plt.plot(fpr[close_zero],tpr[close_zero],'o',markersize=10,label='threshold zero',fillstyle='none',c='k',mew=2)
plt.legend(loc='best')

<matplotlib.legend.Legend at 0x7f911a9cf050>

在这里插入图片描述

总结

通过几天的集中学习，对python做数据分析有了较为清晰的思路，大致流程为：探索数据→数据清洗、特征处理→数据可视化→搭建合适模型，但是具体场景中的数据分析还是需要根据业务场景去对数据进行相应的处理，此项目最大的帮助是构建了一个较为清晰的数据分析框架，并复习了数据分析中常使用的一些库和方法。希望看到此文的你也能有些许收获~

大大大房子

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
Titanic_Data analysis

Part 1:探索titanic数据集import numpy as npimport pandas as pd# 读取文件，并查看文件的前三行df = pd.read_csv('train.csv')df.head(3) PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket F
复制链接

扫一扫