Kaggle_1_Titantic

Introduction
On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. The dataset contains some kinds of information of more than half passengers. In this challenge, I’m asked to complete the analysis of what sorts of people were likely to survive and to predict which passengers survived the tragedy.
Data description
There are training data (891 passengers ) and test data(418 passengers). The dataset includes 10 variables.
Variable Definition Explanation
survival Survival 0 = No, 1 = Yes
pclass Ticket class 1 = 1st, 2 = 2nd, 3 = 3rd
sex Sex Male, female
Age Age in years
sibsp # of siblings / spouses aboard the Titanic Sibling, spouse
parch # of parents / children aboard the Titanic Parent, child
ticket Ticket number
fare Passenger fare
cabin Cabin number
embarked Port of Embarkation C = Cherbourg, Q = Queenstown, S = Southampton
In addition, there are some missing data, especially the variable of age and cabin that is no importance. But I skip these data during the experiment. From the pictures below, it is found that the sex, age and class have significant difference at the survival of passengers. The survival rates of female and higher class are lower.
 
Fig 1.
 
Method
Tree Decision, one method of classification, will be used in this task. Since the dimensions of data are low and most variables are categorical, and it is easy to interpret and explain.  
Result
I use a package called rpart in R to predict test data and just select age, sex and pclass as predicting variables. The decision tree is as follows and the accuracy of prediction is 96.8% after comparing to gender_submission.

Conclusion

Even though the accuracy is surprisingly high, there are many obstacles in my report, such as the processing of missing data and selection of predicting variables. In future, more comparisons should be made between different decision trees algorithms and classification methods.


R scripts

install.packages("ggplot2")
library(ggplot2)
install.packages("rpart")
library(rpart)
install.packages("rpart.plot")
library(rpart.plot)
install.packages("rattle")
library(rattle)

train<-read.csv("train.csv")
test<-read.csv("test.csv")


ggplot(train,aes(x=Age,fill=factor(Survived)))+
  geom_histogram()+
  facet_grid(Pclass~Sex)

fit<-rpart(Survived~Age+Sex+Pclass,data=train,method="class")
fancyRpartPlot(fit)
testresult<-predict(fit,test,type="class")
solution<-data.frame(ID=test$PassengerId,Survived=testresult)
write.csv(solution,file="solution.csv",row.names = FALSE)



评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值