接续上次(中集)我们所做的事情,我们已经把Age的资料都给补齐了,我们来回顾一下上次最后的资料:
str(full)
现在我们想要新增一个新的变数, Child 和 Mother 。
在这之前,想先来看一下性别与存活率之间的关係。
# First we'll look at the relationship between age & survivalggplot(full[1:891,], aes(Age, fill = factor(Survived))) + geom_histogram() + # I include Sex since we know (a priori) it's a significant predictor facet_grid(.~Sex) + theme_few()
以及创建 Child 和 Mother ,我们定义所谓 Child 就是小于18岁的人,反之就是成人。
# Create the column child, and indicate whether child or adultfull$Child[full$Age < 18] <- 'Child'full$Child[full$Age >= 18] <- 'Adult'# Show countstable(full$Child, full$Survived)
接着是 Mother ,这边定义母亲是女性,且直系亲属大于1人,且不是Miss的称谓,并且年龄大于18岁。
# Adding Mother variablefull$Mother <- 'Not Mother'full$Mother[full$Sex == 'female' & full$Parch > 0 & full$Age > 18 & full$Title != 'Miss'] <- 'Mother'# Show countstable(full$Mother, full$Survived)
把刚刚做好的新变数转换成 factor
# Finish by factorizing our two new factor variablesfull$Child <- factor(full$Child)full$Mother <- factor(full$Mother)
如此一来,我们所需要的变数大致都完成了,接下来我们要预测存活率!还记得我们在上集的时候,资料是分成训练集以及测试集,现在我们把他们拆解回来。
# Split the data back into a train set and a test settrain <- full[1:891,]test <- full[892:1309,]
并且选好一个seed,开始机器学习吧!
我们选上我们觉得较为重要的变数,先用训练集做出我们的model:
# Set a random seedset.seed(754)# Build the model (note: not all possible variables are used)rf_model <- randomForest(factor(Survived) ~ Pclass + Sex + Age + SibSp + Parch + Fare + Embarked + Title + FsizeD + Child + Mother, data = train)# Show model errorplot(rf_model, ylim=c(0,0.36))legend('topright', colnames(rf_model$err.rate), col=1:3, fill=1:3)
由上图,我们黑线表示我们的错误率大概低于20%左右,所以算是一个还可以的模型。
接着我们可以试着看一下各个变数的重要性。
# Get importanceimportance <- importance(rf_model)varImportance <- data.frame(Variables = row.names(importance), Importance = round(importance[ ,'MeanDecreaseGini'],2))# Create a rank variable based on importancerankImportance <- varImportance %>% mutate(Rank = paste0('#',dense_rank(desc(Importance))))# Use ggplot2 to visualize the relative importance of variablesggplot(rankImportance, aes(x = reorder(Variables, Importance), y = Importance, fill = Importance)) + geom_bar(stat='identity') + geom_text(aes(x = Variables, y = 0.5, label = Rank), hjust=0, vjust=0.55, size = 4, colour = 'red') + labs(x = 'Variables') + coord_flip() + theme_few()
看来之前所製作的 Title 是一个蛮有用的变数,不过后来製作的其他变数就没有那么成功了。也有可能只是model并不是做的那么完善,但是还是很值得可以欣赏 Megan 所做的这些尝试。
最后我们当然要来预测我们的测试集的存活率:
# Predict using the test setprediction <- predict(rf_model, test)# Save the solution to a dataframe with two columns: PassengerId and Survived (prediction)solution <- data.frame(PassengerID = test$PassengerId, Survived = prediction)# Write the solution to filewrite.csv(solution, file = 'rf_mod_Solution.csv', row.names = F)
最后我们来看一下我们预测的结果(solution)和标準答案的差异,并且算一下预测成功率。先把标準答案的资料给读进来。
gender_submission <- read.csv('F:/Users/yueh/Desktop/titanic08/gender_submission.csv', stringsAsFactors = F)
solution$submission<-as.factor(gender_submission$Survived)table(solution$Survived, solution$submission)
因此,我们知道预测成功率是88.9%,这次的分享就到这边,谢谢大家。