Kaggle_1_Titantic

最新推荐文章于 2022-11-17 15:01:59 发布

YCheng10

最新推荐文章于 2022-11-17 15:01:59 发布

阅读量410

点赞数

分类专栏： Kaggle

本文链接：https://blog.csdn.net/chengyn810/article/details/65939356

版权

Kaggle 专栏收录该内容

1 篇文章 0 订阅

订阅专栏

Introduction
On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. The dataset contains some kinds of information of more than half passengers. In this challenge, I’m asked to complete the analysis of what sorts of people were likely to survive and to predict which passengers survived the tragedy.
Data description
There are training data (891 passengers ) and test data(418 passengers). The dataset includes 10 variables.
Variable Definition Explanation
survival Survival 0 = No, 1 = Yes
pclass Ticket class 1 = 1st, 2 = 2nd, 3 = 3rd
sex Sex Male, female
Age Age in years
sibsp # of siblings / spouses aboard the Titanic Sibling, spouse
parch # of parents / children aboard the Titanic Parent, child
ticket Ticket number
fare Passenger fare
cabin Cabin number
embarked Port of Embarkation C = Cherbourg, Q = Queenstown, S = Southampton
In addition, there are some missing data, especially the variable of age and cabin that is no importance. But I skip these data during the experiment. From the pictures below, it is found that the sex, age and class have significant difference at the survival of passengers. The survival rates of female and higher class are lower.

Fig 1.

Method
Tree Decision, one method of classification, will be used in this task. Since the dimensions of data are low and most variables are categorical, and it is easy to interpret and explain.
Result
I use a package called rpart in R to predict test data and just select age, sex and pclass as predicting variables. The decision tree is as follows and the accuracy of prediction is 96.8% after comparing to gender_submission.

Conclusion

Even though the accuracy is surprisingly high, there are many obstacles in my report, such as the processing of missing data and selection of predicting variables. In future, more comparisons should be made between different decision trees algorithms and classification methods.

R scripts

install.packages("ggplot2")
library(ggplot2)
install.packages("rpart")
library(rpart)
install.packages("rpart.plot")
library(rpart.plot)
install.packages("rattle")
library(rattle)

train<-read.csv("train.csv")
test<-read.csv("test.csv")


ggplot(train,aes(x=Age,fill=factor(Survived)))+
  geom_histogram()+
  facet_grid(Pclass~Sex)

fit<-rpart(Survived~Age+Sex+Pclass,data=train,method="class")
fancyRpartPlot(fit)
testresult<-predict(fit,test,type="class")
solution<-data.frame(ID=test$PassengerId,Survived=testresult)
write.csv(solution,file="solution.csv",row.names = FALSE)