#题目在此
#只是做题过程的记录, 没有再加以整理,故缺乏条理。。
#第一次用R。第一次做Kaggle。因此有些分析仅仅是为了练习使用R,并把分类的主要几个算法全写了一遍,在同样feature的情况下随机森林的结果更好一些,因此改进是在应用随机森林的基础上进行的
#画图部分参考https://github.com/wehrley/wehrley.github.io/blob/master/SOUPTONUTS.md从这里开始R plotting的学习
#从这个教程第一次认识的R,初期是完全跟着该教程做的,所以部分feature出自该教程,但其中有的feature如FamilySize,并没有能提升精度
#目前成绩0.81818,排名68/2257(40-69均为0.81818),在这里卡好久了。。。。
1.读入数据
> train <- read.csv("F:/断鸿/project/Titanic/train.csv", stringsAsFactors = FALSE)
> test <- read.csv("F:/断鸿/project/Titanic/test.csv", stringsAsFactors = FALSE)
> head(train)
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
1 1 0 3 Braund, Mr. Owen Harris male 22 1 0 A/5 21171 7.2500 S
2 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female 38 1 0 PC 17599 71.2833 C85 C
3 3 1 3 Heikkinen, Miss. Laina female 26 0 0 STON/O2. 3101282 7.9250 S
4 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 1 0 113803 53.1000 C123 S
5 5 0 3 Allen, Mr. William Henry male 35 0 0 373450 8.0500 S
6 6 0 3 Moran, Mr. James male NA 0 0 330877 8.4583 Q
> str(train)
'data.frame': 891 obs. of 12 variables:
$ PassengerId: int 1 2 3 4 5 6 7 8 9 10 ...
$ Survived : int 0 1 1 1 0 0 0 0 1 1 ...
$ Pclass : int 3 1 3 1 3 3 1 3 3 2 ...
$ Name : chr "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
$ Sex : chr "male" "female" "female" "female" ...
$ Age : num 22 38 26 35 35 NA 54 2 27 14 ...
$ SibSp : int 1 1 0 1 0 0 0 3 0 1 ...
$ Parch : int 0 0 0 0 0 0 0 1 2 0 ...
$ Ticket : chr "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
$ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
$ Cabin : chr "" "C85" "" "C123" ...
$ Embarked : chr "S" "C" "S" "S" ...
2.观察数据
2.1 数据缺失值
> train[train == ""] <- NA
> sum(is.na(train$Cabin)) / nrow(train)
[1] 0.7710438
> sum(is.na(train$Age)) / nrow(train)
[1] 0.1986532
77%的Cabin数据缺失,实在很难补全,故在分析中先省去Cabin
2.2 分析各项2.2.1 Pclass
> prop.table(table(train$Pclass, train$Survived), 1)
0 1
1 0.3703704 0.6296296
2 0.5271739 0.4728261
3 0.7576375 0.2423625
> mosaicplot(train$Pclass ~ train$Survived, main="Survival ~ Pclass", shade=FALSE, color=TRUE, xlab = "Pclass", ylab="Survived")
2.2.2 Sex
> prop.table(table(train$Sex, train$Survived), 1)
0 1
female 0.2579618 0.7420382
male 0.8110919 0.1889081
> mosaicplot(train$Pclass ~ train$Survived, main="Survival ~ Pclass", shade=FALSE, color=TRUE, xlab = "Pclass", ylab="Survived")
2.2.3 Age
> boxplot(train$Age ~ train$Survived, main="Survival ~ Age", ylab = "Age", xlab="Survived")
2.2.4 SibSp
> prop.table(table(train$SibSp, train$Survived), 1)
0 1
0 0.6546053 0.3453947
1 0.4641148 0.5358852
2 0.5357143 0.4642857
3 0.7500000 0.2500000
4 0.8333333 0.1666667
5 1.0000000 0.0000000
8 1.0000000 0.0000000
2.2.5 Parch
> prop.table(table(train$Parch, train$Survived), 1)
0 1
0 0.6563422 0.3436578
1 0.4491525 0.5508475
2 0.5000000 0.5000000
3 0.4000000 0.6000000
4 1.0000000 0.0000000
5 0.8000000 0.2000000
6 1.0000000 0.0000000
2.2.6 Fare
> boxplot(train$Fare ~ train$Survived, main="Survival ~ Fare", ylab = "Fare", xlab="Survived")
2.2.7 Embarked
> prop.table(table(train$Embarked, train$Survived), 1)
0 1
C 0.4464286 0.5535714
Q 0.6103896 0.3896104
S 0.6630435 0.3369565
2.2.8 Name
名字本身应该不会与survive有关,但名字中可以看出很多信息,可能会影响survive
从名字(例如“Braund, Mr. Owen Harris”)可以得到的信息有称谓, 从称谓中又可以得到婚姻状况,地位,国籍等,如Dona, Lady, Countess都是地位较高的富人,Countess来自欧洲,但这些称谓太过分散,有的称谓只有1个人,考虑将这些相近且人数较少的称谓归为一类。对于国籍来讲,数量众多的Mr,Mrs等在多个国家是通用的,难以区分,因此国籍因素忽略不计。于是可新增一列表示地位教养,一列表示婚姻状况
> test$Survived <- NA
> combi <- rbind(train, test)
> combi$Title <- sapply(combi$Name, FUN=function(x) {strsplit(x, split='[,.]')[[1]][2]})
> combi$Title <- sub(' ', '', combi$Title)
> table(combi$Title)
Capt Col Don Dona Dr Jonkheer Lady Major Master Miss Mlle
1 4 1 1 8 1 1 2 61 260 2
Mme Mr Mrs Ms Rev Sir the Countess
1 757 197 2 8 1 1
> combi$Title[combi$Title %in% c('Mme', 'Mlle')] <- 'Mlle'
> combi$Title[combi$Title %in% c('Dona', 'Lady', 'the Countess')] <- 'Lady'
> combi$Title[combi$Title %in% c('Capt', 'Don', 'Major', 'Sir', 'Jonkheer', 'Dr')] <- 'Sir'
> combi$Title <- factor(combi$Title)
从名中很难得到信息,从姓,如果结合后面的兄弟父母特征可能可以得到谁和谁是家人,但是因为同姓很多,无法准确判断,最好只能找到哪些人来自具有同样姓氏的家庭而且在船上的家人数量一致。
> combi$FamilySize <- combi$SibSp + combi$Parch + 1
> combi$Surname <- sapply(combi$Name, FUN=function(x) {strsplit(x, split='[,.]')[[1]][1]})
> combi$FamilyID <- paste(as.character(combi$FamilySize), combi$Surname, sep="")
> combi$FamilyID[combi$FamilySize <= 2] <- 'Small'
> famIDs <- data.frame(table(combi$FamilyID))
> famIDs <- famIDs[famIDs$Freq <= 2,]
> combi$FamilyID[combi$FamilyID %in% famIDs$Var1] <- 'Small'
> combi$FamilyID <- factor(combi$FamilyID)
2.2.9 Ticket
票的表示分为几类,并有很大不同,查资料没有查到其中的字母代表什么,但可以肯定不同字母的票和有无字母都分属几类不同的票,可以据此分类
> combi$TicketClass[strtoi(combi$Ticket) < 10000] <- 1
> combi$TicketClass[strtoi(combi$Ticket) < 100000 & strtoi(combi$Ticket) >= 10000] <- 2
> combi$TicketClass[strtoi(combi$Ticket) >= 100000 & strtoi(combi$Ticket) < 200000] <- 3
> combi$TicketClass[strtoi(combi$Ticket) >= 200000 & strtoi(combi$Ticket) < 300000] <- 4
> combi$TicketClass[strtoi(combi$Ticket) >= 300000 & strtoi(combi$Ticket) < 400000] <- 5
> combi$TicketClass[is.na(strtoi(combi$Ticket))] <- 6
> combi$TicketClass <- factor(combi$TicketClass)
综上,以上各项均可列入影响因素,共有Pclass, Sex, Age, SibSp, Parch, Fare, Embarked, Title, Married(待验证), FamilySize, FamilyID, Ticket(待验证)。
先用已验证的项训练
2.3 补全缺失值
> which(is.na(combi$Pclass))
integer(0)
> which(is.na(combi$Age))
[1] 6 18 20 27 29 30 32 33 37 43 46 47 48 49 56 65 66 77 78 83 88 96 102 108 110 122 127 129 141 155 159 160
[33] 167 169 177 181 182 186 187 197 199 202 215 224 230 236 241 242 251 257 261 265 271 275 278 285 296 299 301 302 304 305 307 325
[65] 331 335 336 348 352 355 359 360 365 368 369 376 385 389 410 411 412 414 416 421 426 429 432 445 452 455 458 460 465 467 469 471
[97] 476 482 486 491 496 498 503 508 512 518 523 525 528 532 534 539 548 553 558 561 564 565 569 574 579 585 590 594 597 599 602 603
[129] 612 613 614 630 634 640 644 649 651 654 657 668 670 675 681 693 698 710 712 719 728 733 739 740 741 761 767 769 774 777 779 784
[161] 791 793 794 816 826 827 829 833 838 840 847 850 860 864 869 879 889 902 914 921 925 928 931 933 939 946 950 957 968 975 976 977
[193] 980 983 985 994 999 1000 1003 1008 1013 1016 1019 1024 1025 1038 1040 1043 1052 1055 1060 1062 1065 1075 1080 1083 1091 1092 1097 1103 1108 1111 1117 1119
[225] 1125 1135 1136 1141 1147 1148 1157 1158 1159 1160 1163 1165 1166 1174 1178 1180 1181 1182 1184 1189 1193 1196 1204 1224 1231 1234 1236 1249 1250 1257 1258 1272
[257] 1274 1276 1300 1302 1305 1308 1309
> which(is.na(combi$Sex))
integer(0)
> which(is.na(combi$SibSp))
integer(0)
> which(is.na(combi$Parch))
integer(0)
> which(is.na(combi$Fare))
[1] 1044
> combi$Fare[1044] <- median(combi$Fare, na.rm=TRUE)
> which(is.na(combi$Embarked))
[1] 62 830
> combi$Embarked[c(62,830)] = "S"
> which(is.na(combi$Title))
integer(0)
> which(is.na(combi$FamilySize))
integer(0)
> which(is.na(combi$FamilyID))
integer(0)
年龄,用决策树进行补全
> library(rpart)
> Agefit <- rpart(Age ~ Pclass + Sex + SibSp + Parch + Fare + Embarked + Title + FamilySize, data=combi[!is.na(combi$Age),], method="anova")
> combi$Age[is.na(combi$Age)] <- predict(Agefit, combi[is.na(combi$Age),])
3. 模型
> combi$Sex <- factor(combi$Sex)
> combi$Embarked <- factor(combi$Embarked)
> combi$Survived <- factor(combi$Survived)
> train <- combi[1:891,]
> test <- combi[892:1309,]
3.1 决策树
> library(rattle)
Rattle: A free graphical interface for data mining with R.
XXXX 3.3.0 Copyright (c) 2006-2014 Togaware Pty Ltd.
键入'rattle()'去轻摇、晃动、翻滚你的数据。
> library(rpart.plot)
> library(RColorBrewer)
> fit <- rpart(Survived ~ Pclass + Sex + Age + SibSp + Parch + Fare + Embarked, data=train, method="class")
> fancyRpartPlot(fit)
结果:0.78469
3.2 随机森林
3.2.1 randomForest包
该包对能处理的factor数有一定上限,FamilyID超过了这个上限,因此对其再次进行合并,减小factor数。
> library(randomForest)
> combi$FamilyID2 <- combi$FamilyID
> combi$FamilyID2 <- as.character(combi$FamilyID2)
> combi$FamilyID2[combi$FamilySize <= 3] <- 'Small'
> combi$FamilyID2 <- factor(combi$FamilyID2)
> train <- combi[1:891,]
> test <- combi[892:1309,]
> set.seed(415)
> fit <- randomForest(Survived ~ Pclass + Sex + Age + SibSp + Parch + Fare + Embarked + Title + FamilySize + FamilyID2, data=train, importance=TRUE, ntree=2000)
结果:0.77512
3.2.2 party包(are able to handle factors with more levels)
> library(party)
> set.seed(415)
> fit <- cforest(Survived ~ Pclass + Sex + Age + SibSp + Parch + Fare + Embarked + Title + FamilySize + FamilyID, data = train, controls=cforest_unbiased(ntree=2000, mtry=3))
> Prediction <- predict(fit, test, OOB=TRUE, type = "response")
结果:0.81340
#发现在同等条件下所有方法中随机森林-party是结果最好的
#之后对特征进行优化均用随机森林模型3.3 SVM
> library(kernlab)
Attaching package: ‘kernlab’
The following object is masked from ‘package:modeltools’:
prior
> svm.model <- ksvm(Survived ~ Pclass + Sex + Age + SibSp + Parch + Fare + Embarked + Title + FamilySize + FamilyID, data = train)
结果:0.78947
3.4 kNN
> library(kknn)
> knn.model <- kknn(Survived ~ Pclass + Sex + Age + SibSp + Parch + Fare + Embarked + Title + FamilySize + FamilyID, train, test)
> submit <- data.frame(PassengerId = test$PassengerId, Survived = knn.model$fit)
结果:0.73206
3.5 adaboost
3.5.1 adabag包> library(adabag)
> ada.model <- boosting(Survived ~ Pclass + Sex + Age + SibSp + Parch + Fare + Embarked + Title + FamilySize + FamilyID, data=train)
> Prediction <- predict(ada.model, test)
> submit <- data.frame(PassengerId = test$PassengerId, Survived = Prediction$class)
结果:0.69856
3.5.2 ada包
> library(ada)
> ada.model <- ada(Survived ~ Pclass + Sex + Age + SibSp + Parch + Fare + Embarked + Title + FamilySize + FamilyID, data=train)
结果:0.71770
> Prediction <- predict(fit,test)
> submit <- data.frame(PassengerId = test$PassengerId, Survived = Prediction)
> write.csv(submit, file = "F:/断鸿/project/Titanic/ .csv", row.names = FALSE)
5. 优化
5.1 随机森林调参数
原(3rd) ntree=2000, mtry=3 0.81340
4th ntree=3000, mtry=5 0.79426
5th ntree=2000, mtry=5 0.79904
6th ntree=2000, mtry=2 0.81818
7th ntree=1000, mtry=2 0.81340
8th ntree=3000, mtry=2 0.81818
9th ntree=5000, mtry=2 0.81818
5.2 加入TicketClass
5.2.1 所有字母开头的归为一类,数字按最高位分类
> fit <- cforest(Survived ~ Pclass + Sex + Age + SibSp + Parch + Fare + Embarked + Title + FamilySize + FamilyID + TicketClass, data = train, controls=cforest_unbiased(ntree=2000, mtry=2))
1st ntree=2000, mtry=2 0.80383
5.2.2 字母数字共两类
> combi$TicketClass2[combi$TicketClass != 6] <- 1
> combi$TicketClass2[combi$TicketClass == 6] <- 2
> train <- combi[1:891,]
> test <- combi[892:1309,]
> fit <- cforest(Survived ~ Pclass + Sex + Age + SibSp + Parch + Fare + Embarked + Title + FamilySize + FamilyID + TicketClass2, data = train, controls=cforest_unbiased(ntree=2000, mtry=2))
2nd ntree=2000, mtry=2 0.81818
再调一下参数
3rd ntree=5000, mtry=2 0.81340
4th ntree=1000, mtry=2 0.80861
5th ntree=3000, mtry=2 0.81340
6th ntree=2000, mtry=3 0.81818
7th ntree=2000, mtry=5 0.81340
5.2.3 字母数字都分类
字母有 A/5 PC STON/O2 PP C.A. SC/Paris S.C./A.4. CA S.P. S.O.C. SO/C W./C.
SOTON/OQ W.E.P. .......不行。。实在太多,没办法分啊,分了也是过拟合的节奏。。
5.3 Age
训练集中缺乏age的不算太多,会不会拟合的年龄成为噪声呢?
鉴于测试集中也有很多是缺失age的,所以还是要先补全,仍采用原方法进行补全。但在训练时去掉初始没有Age的行
> combi$isAgethere[!is.na(combi$Age)] <- 1
> combi$isAgethere[is.na(combi$Age)] <- 0
> combi$isAgethere <- factor(combi$isAgethere)
> train <- train[train$isAgethere == "1",]
ntree=2000, mtry=2 0.80861
还是不要去掉拟合的数据了。。
5.4 权重
一些因素比另一些因素更重要,一些因素如家人可能过于重复。##总是出错,到底怎么写。。。。。
5.5 婚姻
男性的婚姻状况虽然有部分可以判断,但是过少不具代表性(如可以从有没有同姓的已婚女人或孩子判断,但可能不准)。女性可以根据称谓分为已婚,未婚,不知是否已婚三类。鉴于男性大部分都没能生还,所以先忽略男性是否已婚对结果造成的影响,将其全归为一类,一共分为四类
> combi$Married <- 0
> combi$Married[combi$Title %in% c("Capt", "Col", "Don", "Dr", "Jonkheer", "Major", "Master", "Mr", "Rev", "Sir")] <- 1
> combi$Married[combi$Married != 1 & combi$Title %in% c("Dona", "Lady", "Mme", "Mrs", "the Countess")] <- 2
> combi$Married[combi$Married != 1 & combi$Married != 2 & combi$Title %in% c("Miss", "Mlle")] <- 3
> combi$Married <- factor(combi$Married)
> train <- combi[1:891,]
> test <- combi[892:1309,]
> fit <- cforest(Survived ~ Pclass + Sex + Age + SibSp + Parch + Fare + Embarked + Title + FamilySize + FamilyID + Married, data = train, controls=cforest_unbiased(ntree=2000, mtry=2))
ntree=2000, mtry=2 0.80383
5.6 以上只是单个因素变化,结合?
所有全都加起来, ntree=2000, mtry=2 0.79904果然还是不要婚姻了吗?
除去婚姻:
> fit <- cforest(Survived ~ Pclass + Sex + Age + SibSp + Parch + Fare + Embarked + Title + FamilyID + CabinClass + FamilySize + CabinClass + TicketClass2, data = train, controls=cforest_unbiased(ntree=2000, mtry=3))
ntree=2000, mtry=2 0.81818
ntree=5000, mtry=2 0.81818
5.7 去掉familysize
ntree=2000, mtry=2 0.818185.8 Cabin
Cabin缺失数据太多,但Cabin的第一个字母表示在第几层,可能对第一二等舱来说第几层不影响他们逃生,但至少在第三等舱,所在位置即距离救生艇的远近可能会有影响
> combi$CabinClass <- 0
> combi$CabinClass[substr(combi$Cabin,1 ,1) == "A"] <- 1
> combi$CabinClass[substr(combi$Cabin,1 ,1) == "B"] <- 2
> combi$CabinClass[substr(combi$Cabin,1 ,1) == "C"] <- 3
> combi$CabinClass[substr(combi$Cabin,1 ,1) == "D"] <- 4
> combi$CabinClass[substr(combi$Cabin,1 ,1) == "E"] <- 5
> combi$CabinClass[substr(combi$Cabin,1 ,1) == "F"] <- 6
> combi$CabinClass[substr(combi$Cabin,1 ,1) == "G"] <- 7
> combi$CabinClass <- factor(combi$CabinClass)
> train <- combi[1:891,]
> test <- combi[892:1309,]
ntree=2000, mtry=2 0.81340
5.10 Fare
Fare不是每个人的应付费用,更像是一家人每人的实际费用,一等舱里不乏几美元的Fare, 还有0的,三等舱里也有很多几十美元的fare。去掉Fare:ntree=2000, mtry=2 0.80861
为什么反而更差了。。难道不是噪声。。
6.待解决问题,几点疑惑
6.1 在0.81818这里卡了很久了,还有什么特征可以提取?抑或改进方法?
6.2 随机森林里设置weight总是报错
6.3 对于正确率0.85以上甚至达到0.97的,Forum里一致认为是要么cheating要么overfitting的结果,难道在不过拟合的情况下不能得到模型hit 0.9?