kaggle学习（二）Titanic

最新推荐文章于 2023-06-17 19:31:01 发布

陌生的天花板

最新推荐文章于 2023-06-17 19:31:01 发布

阅读量428

点赞数 1

分类专栏： kaggle

kaggle 专栏收录该内容

2 篇文章 0 订阅

订阅专栏

通过学习Titanic这个案例了解workflow

kaggle案例：https://www.kaggle.com/startupsci/titanic-data-science-solutions

之前翻译了一半，网页崩了，不重新翻译了。

Analyze by describing data

Which features are available in the dataset?
Which features are categorical?
Which features are numerical?
Which features are mixed data types?不同类型的feature有不同的处理方法
Which features may contain errors or typos?
Which features contain blank, null or empty values?
What are the data types for various features?
What is the distribution of numerical feature values across the samples?
What is the distribution of categorical features?

之前的工作基本上是在进行问题的理解与数据分析，通过pandas的一些指令来从feature类型、数据量、以至于特征的分布规律上来分析

train_df.describe(include=['O'])

关于pandas中describe函数的功能：https://blog.csdn.net/xckkcxxck/article/details/84799220

基于数据分析做出假设

在前面的数据分析基础上，我们可以做出一些假设

关系性假设：我们想知道不同的feature和survival到底有什么关系，在项目早期就想做出判断，后期再通过模型进行精确的拟合

完整性：我们想补全Age中的缺失，因为这个feature明显和survival有关

　　　　Embarked也需要补全，因为它也可能和survival有关，或者和其他feature有关

修正：Ticket这个feature就是票号，重复率高达22%（？？？），和survival没什么关系，所以把这个feature剔除

　　　Cabin这个feature缺失特别多，存在许多null，看来不好用

　　　PassengerId 这个没什么用

　　　Name也没什么用，并且也没有一个同一的标准

创造：根据 Parch和SibSp创造一个新的特征Family

　　　重新设计Name这个特征，把名称里的Title抽取出来作为一个新的特征

　　　把年龄划分为几个范围，把一个连续的数字特征转化成了一个有序的分类特征

　　　同样，还可以创造一个Fare range特征

分类：还可以根据之前的问题描述做出一些假设

　　　女性更容易生还

　　　儿童更容易生还

　　　坐头等舱的更容易生还

１．Analyze by pivoting features

可以通过pivoting features验证我们的假设

我们发现Pclass=1的生还率高达50%比Pclass=3的高得多，决定把这个特征加入我们的模型中

Sex=female的生还率高达70%，

SibSp and Parch 没什么必然联系，可能需要生成新的feature

train_df[['Pclass', 'Survived']].groupby(['Pclass'], as_index=False).mean().sort_values(by='Survived', ascending=False)

这部分内容需要熟练使用pandas相关的数据分析功能，关于groupby的简单讲解：https://www.jianshu.com/p/42f1d2909bb6

Analyze by visualizing data

通过可视化来分析数据

Correlating numerical features

数值特征和我们目标的关系

直方图可以帮助我们选择用什么模型

观测结果：

儿童生还率高
年纪最大的生还了
大量15-25岁的人没有生还
大多数乘客都是15-35岁

决定：

这种分析证实了我们的假设，指导接下来的workflow

要在模型中考虑年龄
把年龄数据补全
给年龄分区间

这里涉及seaborn包的使用:https://blog.csdn.net/weixin_42398658/article/details/82960379

Correlating numerical and ordinal features

查看数值类、顺序类特征

用一个图来同时分析多个特征

观测结果：

Pclass=3 乘客最多，大多数都没有生还
Pclass=2 Pclass=3 中的儿童大多都生还了
Pclass=1 的乘客大多生还
Pclass 根据年龄变化

决定：

建模时考虑Pclass

Correlating categorical features

查看类别类特征

观测结果：

女的比男的容易生还
在Embarked=C 时男的生还率高，但是这是因为上船地点和Pclass有关，所以间接影响了生还率，可以不考虑

决定：

建模时考虑Sex
补全Embarked特征并加入训练模型

Correlating categorical and numerical features

２．Wrangle data

修改数据

Correcting by dropping features

舍弃一些无用数据，同时在training set 和testing set上操作

print("Before", train_df.shape, test_df.shape, combine[0].shape, combine[1].shape)

train_df = train_df.drop(['Ticket', 'Cabin'], axis=1)
test_df = test_df.drop(['Ticket', 'Cabin'], axis=1)
combine = [train_df, test_df]

"After", train_df.shape, test_df.shape, combine[0].shape, combine[1].shape

Creating new feature extracting from existing

在丢弃Name和PassengerId之前，我们想看看能不能从Name中抽取有用信息

观测结果：

大多数Title和年龄有关
在Tltle Age bands之间生还率变化很小
特定的Title大多数生还了 (Mme, Lady, Sir)，而另外一些基本都死了 (Don, Rev, Jonkheer)

决定：

建模时保存新设计的Title特征

Converting a categorical feature

把分类特征转换成数值

Completing a numerical continuous feature

补全连续数值类的特征

有三种方法：

通过均值和方差随机生成
通过其他特征Pclass,Gender来计算
结合上边两种方法

方法１．３会给数据集引入噪声，所以我们用方法二

Create new feature combining existing features

创造新的feature，FamilySize which combines Parch and SibSp，这样就可以drop Parch and SibSp

Completing a categorical feature

Embarked feature takes S, Q, C values based on port of embarkation. Our training dataset has two missing values. We simply fill these with the most common occurance.

Converting categorical feature to numeric

Quick completing and converting a numeric feature

至此，数据集的处理完成了，我们做了以下几件事情：

丢弃了没用的feature
合成了新的feature
所有feature都转化成数值类型

３．Model, predict and solve

首先确定我们问题的类型：监督学习、分类、回归，根据问题类型选择可用的方法

Logistic Regression
KNN or k-Nearest Neighbors
Support Vector Machines
Naive Bayes classifier
Decision Tree
Random Forrest
Perceptron
Artificial neural network
RVM or Relevance Vector Machine

这个例子先用了逻辑回归

logreg = LogisticRegression()
logreg.fit(X_train, Y_train)
Y_pred = logreg.predict(X_test)
acc_log = round(logreg.score(X_train, Y_train) * 100, 2)
acc_log

又用了SVC

# Support Vector Machines

svc = SVC()
svc.fit(X_train, Y_train)
Y_pred = svc.predict(X_test)
acc_svc = round(svc.score(X_train, Y_train) * 100, 2)
acc_svc

又用了许多其他的方法，例如随机森林

# Random Forest

random_forest = RandomForestClassifier(n_estimators=100)
random_forest.fit(X_train, Y_train)
Y_pred = random_forest.predict(X_test)
random_forest.score(X_train, Y_train)
acc_random_forest = round(random_forest.score(X_train, Y_train) * 100, 2)
acc_random_forest

４．Model evaluation

models = pd.DataFrame({
    'Model': ['Support Vector Machines', 'KNN', 'Logistic Regression', 
              'Random Forest', 'Naive Bayes', 'Perceptron', 
              'Stochastic Gradient Decent', 'Linear SVC', 
              'Decision Tree'],
    'Score': [acc_svc, acc_knn, acc_log, 
              acc_random_forest, acc_gaussian, acc_perceptron, 
              acc_sgd, acc_linear_svc, acc_decision_tree]})
models.sort_values(by='Score', ascending=False)

选出一个效果最好的提交

submission = pd.DataFrame({
        "PassengerId": test_df["PassengerId"],
        "Survived": Y_pred
    })

５．总结

所以workflow主要分为以下几步：

１．读懂题意

２．分析数据

３．整理数据

４．建模训练

５．提交解决方案

陌生的天花板

关注

1
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录