[R语言]资料分析实作分享-----铁达尼号存活分析(下集)-58码农网

接续上次(中集)我们所做的事情，我们已经把Age的资料都给补齐了，我们来回顾一下上次最后的资料:

str(full)

现在我们想要新增一个新的变数， Child 和 Mother 。

在这之前，想先来看一下性别与存活率之间的关係。

# First we'll look at the relationship between age & survivalggplot(full[1:891,], aes(Age, fill = factor(Survived))) +   geom_histogram() +   # I include Sex since we know (a priori) it's a significant predictor  facet_grid(.~Sex) +   theme_few()

以及创建 Child 和 Mother ，我们定义所谓 Child 就是小于18岁的人，反之就是成人。

# Create the column child, and indicate whether child or adultfull$Child[full$Age < 18] <- 'Child'full$Child[full$Age >= 18] <- 'Adult'# Show countstable(full$Child, full$Survived)

接着是 Mother ，这边定义母亲是女性，且直系亲属大于1人，且不是Miss的称谓，并且年龄大于18岁。

# Adding Mother variablefull$Mother <- 'Not Mother'full$Mother[full$Sex == 'female' & full$Parch > 0 & full$Age > 18 & full$Title != 'Miss'] <- 'Mother'# Show countstable(full$Mother, full$Survived)

把刚刚做好的新变数转换成 factor

# Finish by factorizing our two new factor variablesfull$Child  <- factor(full$Child)full$Mother <- factor(full$Mother)

如此一来，我们所需要的变数大致都完成了，接下来我们要预测存活率！还记得我们在上集的时候，资料是分成训练集以及测试集，现在我们把他们拆解回来。

# Split the data back into a train set and a test settrain <- full[1:891,]test <- full[892:1309,]

并且选好一个seed，开始机器学习吧！
我们选上我们觉得较为重要的变数，先用训练集做出我们的model:

# Set a random seedset.seed(754)# Build the model (note: not all possible variables are used)rf_model <- randomForest(factor(Survived) ~ Pclass + Sex + Age + SibSp + Parch +                                             Fare + Embarked + Title +                                             FsizeD + Child + Mother,                                            data = train)# Show model errorplot(rf_model, ylim=c(0,0.36))legend('topright', colnames(rf_model$err.rate), col=1:3, fill=1:3)

由上图，我们黑线表示我们的错误率大概低于20%左右，所以算是一个还可以的模型。

接着我们可以试着看一下各个变数的重要性。

# Get importanceimportance    <- importance(rf_model)varImportance <- data.frame(Variables = row.names(importance),                             Importance = round(importance[ ,'MeanDecreaseGini'],2))# Create a rank variable based on importancerankImportance <- varImportance %>%  mutate(Rank = paste0('#',dense_rank(desc(Importance))))# Use ggplot2 to visualize the relative importance of variablesggplot(rankImportance, aes(x = reorder(Variables, Importance),     y = Importance, fill = Importance)) +  geom_bar(stat='identity') +   geom_text(aes(x = Variables, y = 0.5, label = Rank),    hjust=0, vjust=0.55, size = 4, colour = 'red') +  labs(x = 'Variables') +  coord_flip() +   theme_few()

看来之前所製作的 Title 是一个蛮有用的变数，不过后来製作的其他变数就没有那么成功了。也有可能只是model并不是做的那么完善，但是还是很值得可以欣赏 Megan 所做的这些尝试。

最后我们当然要来预测我们的测试集的存活率:

# Predict using the test setprediction <- predict(rf_model, test)# Save the solution to a dataframe with two columns: PassengerId and Survived (prediction)solution <- data.frame(PassengerID = test$PassengerId, Survived = prediction)# Write the solution to filewrite.csv(solution, file = 'rf_mod_Solution.csv', row.names = F)

最后我们来看一下我们预测的结果(solution)和标準答案的差异，并且算一下预测成功率。先把标準答案的资料给读进来。

gender_submission <- read.csv('F:/Users/yueh/Desktop/titanic08/gender_submission.csv', stringsAsFactors = F)

solution$submission<-as.factor(gender_submission$Survived)table(solution$Survived, solution$submission)

因此，我们知道预测成功率是88.9%，这次的分享就到这边，谢谢大家。

给这篇文章的作者打赏

关于作者: 网站小编

相关文章

HBO Max vs.Netflix：当你负担不起两者时如何选择

课内笔记整理---作业系统实务(资安相关篇)

excel vba捞网页数据问题

热门文章

1[R语言]资料分析实作分享-----铁达尼号存活分析(下集)

2鼠年全马铁人挑战 WEEK 27：看板规划工具 - Trello

3您应该知道的7个Web爬网限制

4[实战之jQuery] bootstrap-select之单选时无清除按钮

5[笔记,PHP,PDO]常用方法封装