Kaggle项目之泰坦尼克号titanic实践与相关知识点总结

最新推荐文章于 2024-04-14 17:17:16 发布

Hazel1811

最新推荐文章于 2024-04-14 17:17:16 发布

阅读量693

点赞数 1

分类专栏： python

本文链接：https://blog.csdn.net/Sage165/article/details/108429067

版权

python 专栏收录该内容

10 篇文章 0 订阅

订阅专栏

这里写自定义目录标题

泰坦尼克号Titanic

泰坦尼克号Titanic

Kaggle项目之泰坦尼克号titanic实践与相关知识点总结

读入数据

1、读取数据

pandas是常用的python数据处理包 ,它能够把csv文件读入成dataframe格式。
pandas库详细介绍链接https://www.pypandas.cn/docs/

import pandas
titanic = pandas.read_csv("train.csv")
#head()函数参数表示打印出几行数据，默认为五
head3=titanic.head(3)
print(head3)
#描述性数据，均值最值等
print(titanic.describe())
#数据属性和个数
print(titanic.info())

读入数据总共有12列，其中Survived字段表示的是该乘客是否获救，其余都是乘客的个人信息，包括：

PassengerId => 乘客ID

Pclass => 乘客等级(1/2/3等舱位)

Name => 乘客姓名

Sex => 性别

Age => 年龄

SibSp => 堂兄弟/妹个数

Parch => 父母与小孩个数

Ticket => 船票信息

Fare => 票价

Cabin => 客舱

Embarked => 登船港口

2、读入csv\excel\txt

excel和csv
https://www.jianshu.com/p/0fd5551bac37
pandas读入https://www.cnblogs.com/happymeng/p/10481293.html
其他方式读入https://www.cnblogs.com/caiyishuai/p/9462833.html

数据可视化分析

通过可视化图形初步了解数据情况及其与是否存活的关系

图

单个特征与存活率关系
1、乘客等级Pclass与survived关系，某一等级对应存活率之比
2、存活人数中男女比（饼状图）
3、总体年龄频率直方图、是否存活分别的年龄分布（横坐标为survived）
4、兄弟姐妹/父母孩子个数SibSp/Parch，同上。或者横坐标为个数
5、票价
6、登船港口

数据内部关系
各等级车厢年龄分布（三条曲线分布表示不同等级，x为年龄）
登船港口和票价/乘客等级
家庭人口与存活率
舱位等级和性别共同影响生存率

matplotlib教程https://www.ctolib.com/docs/sfile/matplotlib-intro/index.html

数据分析

1、数据处理—特征工程(feature engineering)

缺失值填充

mage = titanic["Age"].median()
titanic["Age"] = titanic["Age"].fillna(mage)
#将空值用平均值替换
print(titanic.describe())

替换string为int类型

print(titanic["Sex"].unique()) 
#对于一维数组或者列表，unique函数去除其中重复的元素，
#并按元素由大到小返回一个新的无元素重复的元组或者列表

#print(titanic["Sex"])  
#返回series类型

print(type(titanic["Sex"]))
#.unique()加括号只打印不重复的值，不加括号打印所有值的对应值
#现在的语法是values()?
#print(titanic["Sex"].values) 
#.values()加括号 错误
#series对象区别于字典，

titanic.loc[titanic["Sex"] == "male","Sex"] = 0
titanic.loc[titanic["Sex"] == "female","Sex"] = 1

缺失值填充及替换为int类型

print(titanic["Embarked"].unique())
titanic["Embarked"] = titanic["Embarked"].fillna("S")
#没有均值的时候，选择一个出现次数较多的值进行填充
titanic.loc[titanic["Embarked"] == "S","Embarked"] = 0
titanic.loc[titanic["Embarked"] == "C","Embarked"] = 1
titanic.loc[titanic["Embarked"] == "Q","Embarked"] = 2

2、线性回归

#二分类  线性回归

#Scikit-learn python机器学习库
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import KFold

predictors = ["Pclass","Sex","Age","SibSp","Parch","Fare","Embarked"]

#初始化  赋值函数
alg = LinearRegression()

#样本均分为3份，3折交叉验证
kf = KFold(n_splits = 3,shuffle = False,random_state = 1)

predictions = []
#
for train,test in kf.split(titanic):
    #获取训练集的值
    train_predictors = (titanic[predictors].iloc[train,:])
    #获取label值
    #对于单独一列值，iloc()只能有一个参数
    train_target = titanic["Survived"].iloc[train]
    #训练模型
    alg.fit(train_predictors,train_target)
    #使用测试集检验
    test_predictions = alg.predict(titanic[predictors].iloc[test,:])
    #测试结果
    predictions.append(test_predictions)

计算准确率

import numpy as np
#将二维数组转换成一维
predictions = np.concatenate(predictions,axis=0)

#映射成分类结果，计算准确率
predictions[predictions > .5] = 1
predictions[predictions <= .5] = 0

#
accuracy = sum(predictions == titanic["Survived"])/len(predictions)
#predictions == titanic["Survived"]   boolean类型，相同为true值为1

print(accuracy)
#二分类，本身准确率就应该有50%

输出为0.7833894500561167

3、逻辑回归

#逻辑回归
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression

alg = LogisticRegression(random_state = 1)

scores = cross_val_score(alg,titanic[predictors],titanic["Survived"],cv = 3)
print(scores.mean())

输出为0.7957351290684623

#上述结果使用的是交叉验证的验证集进行的分类，实际结果中应该使用测试集
titanic_test = pandas.csv("test.csv")
#其他处理数据过程同上

4、随机森林

#随机森林
#有放回的的取值，随机取特征值（可以指定个数）
#构造了多个决策树？ 哪个影响因素对最终结果影响更大，防止过拟合，剔除负面因素
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold
from sklearn.ensemble import RandomForestClassifier

predictors = ["Pclass","Sex","Age","SibSp","Parch","Fare","Embarked"]

alg = RandomForestClassifier(random_state=1,
                          n_estimators=10,#决策树数量
                          min_samples_split=2,
                          min_samples_leaf=1)


kf = KFold(n_splits=3,shuffle=False,random_state=1)
scores = cross_val_score(alg,titanic[predictors],titanic["Survived"],cv = kf)
print(scores.mean())

输出为0.7856341189674523，结果不是很理想，所以要调参

alg = RandomForestClassifier(random_state=1,
                             n_estimators=100,
                             min_samples_split=4,
                             min_samples_leaf=2)
# Compute the accuracy score for all the cross validation folds.  (much simpler than what we did before!)
kf = KFold(n_splits=3, shuffle=False, random_state=1)
scores = cross_val_score(alg, titanic[predictors], titanic["Survived"], cv=kf)
 
# Take the mean of the scores (because we have one for each fold)
print(scores.mean())

输出为0.8148148148148148

Hazel1811

关注

1
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
Kaggle项目之泰坦尼克号titanic实践与相关知识点总结

这里写自定义目录标题泰坦尼克号Titanic读入数据1、读取数据2、读入csv\excel\txt数据可视化分析图数据分析1、数据处理—特征工程(feature engineering)2、线性回归3、逻辑回归4、随机森林功能快捷键如何改变文本的样式参考如何插入一段漂亮的代码片生成一个适合你的列表创建一个表格设定内容居中、居左、居右SmartyPants创建一个自定义列表如何创建一个注脚注释也是必不可少的KaTeX数学公式新的甘特图功能，丰富你的文章UML 图表FLowchart流程图泰坦尼克号Titan
复制链接

扫一扫