[R语言]资料分析实作分享-----铁达尼号存活分析(中集)-58码农网

接续上次的文章，今天我们会针对缺失值做处理，我们先来看上次的资料最后的型态。

str(full)

我们先用 table 来看一下 Embarked 有几种不同的资料。

table(full$Embarked)

会发现有两个空白的值，接着我们来实际找出他们。

filter(full,full$Embarked=="")

我们可以看到在第62跟830笔资料是空缺的。那我们有很多种方法可以处理缺失值，可以把他移除，或是帮助他补值，在决定怎么做之前，先来看看其他资料长的怎么样吧！我们使用 Embarked(上船港口) 作为x轴，票价为y轴，并且根据不同票舱的等级各自做出三个盒状图。

# Get rid of our missing passenger IDsembark_fare <- full %>%  filter(PassengerId != 62 & PassengerId != 830)# Use ggplot2 to visualize embarkment, passenger class, & median fareggplot(embark_fare, aes(x = Embarked, y = Fare, fill = factor(Pclass))) +  geom_boxplot() +  geom_hline(aes(yintercept=80),     colour='red', linetype='dashed', lwd=2) +  scale_y_continuous(labels=dollar_format()) +  theme_few()

又因为他们的Fare都是$80 且 Pclass都是1，所以编号是62跟830的乘客有较高机率是从C港上船的，所以我们或许可以将他们的上传资料改为"C"。

# Since their fare was $80 for 1st class, they most likely embarked from 'C'full$Embarked[c(62, 830)] <- 'C'

接着，我们来看看 Fare 有没有缺失值。

filter(full,is.na(full$Fare))

那我们就直接来看看1044笔资料的形式。

full[1044, ]

因为1044笔资料的Pclass=3，且Embarked=S，根据刚刚的图形，比较安全的设置是把他的票价订在Pclass=3，且Embarked=S的中位数。再画一个图形来确认我们的想法。

ggplot(full[full$Pclass == '3' & full$Embarked == 'S', ],   aes(x = Fare)) +  geom_density(fill = '#99d6ff', alpha=0.4) +   geom_vline(aes(xintercept=median(Fare, na.rm=T)),    colour='red', linetype='dashed', lwd=1) +  scale_x_continuous(labels=dollar_format()) +  theme_few()

确实的，在Pclass=3且Embarked=S的情况下，在中位数(红线)附近是比较密集的，所以我们可以较为安心的把1044笔资料的票价设为中位数。

# Replace missing fare value with median fare for class/embarkmentfull$Fare[1044] <- median(full[full$Pclass == '3' & full$Embarked == 'S', ]$Fare, na.rm = TRUE)

接着 Megan L. Risdal 做了一个比较特别的预测。

因为Age的缺失值也相当多，不过从这些资料当中，要预测存活率的话，Age的资料似乎是不可删减的，先来看看到底缺了多少资料吧！

# Show number of missing Age valuessum(is.na(full$Age))

竟然缺了263笔年龄的资料，所以在预测存活率之前，Megan决定要先来预测年龄，这边他选择使用 mice package 去做多重预测，在捨弃一些无关变数之后，进行多个变数的预测，如下:

# Make variables factors into factorsfactor_vars <- c('PassengerId','Pclass','Sex','Embarked',                 'Title','Surname','Family','FsizeD')full[factor_vars] <- lapply(full[factor_vars], function(x) as.factor(x))# Set a random seedset.seed(129)# Perform mice imputation, excluding certain less-than-useful variables:mice_mod <- mice(full[, !names(full) %in% c('PassengerId','Name','Ticket','Cabin','Family','Surname','Survived')], method='rf')

# Save the complete output mice_output <- complete(mice_mod)

预测的结果包含了以上11个变数，不过我们只需要预测出来的Age的部分。

但是会有疑问，这样的预测真的好吗?那我们来看看预测前后的分布图。

# Plot age distributionspar(mfrow=c(1,2))hist(full$Age, freq=F, main='Age: Original Data',   col='darkgreen', ylim=c(0,0.04))hist(mice_output$Age, freq=F, main='Age: MICE Output',   col='lightgreen', ylim=c(0,0.04))

从这两张图可以发现，分布的情况非常相似，所以这似乎是好的预测模型。因此我们就只撷取Age的部分，因为其他部份的预测结果可能很糟，且不是我们想要的。

# Replace Age variable from the mice model.full$Age <- mice_output$Age# Show new number of missing Age valuessum(is.na(full$Age))

最后我们就把预测好的 Age 存取致我们原来的资料，今天的缺失值处理就先到这里，下一篇我章我们会把我们现在的资料再做一次特徵工程，最终再来预测我们要的存活率。今天的文章到这里，谢谢大家。

原文网址:[https://www.kaggle.com/mrisdal/exploring-survival-on-the-titanic]

给这篇文章的作者打赏

关于作者: 网站小编

相关文章

HBO Max vs.Netflix：当你负担不起两者时如何选择

课内笔记整理---作业系统实务(资安相关篇)

excel vba捞网页数据问题

热门文章

1[R语言]资料分析实作分享-----铁达尼号存活分析(中集)

2[C#][ASP.NET] 将 Zip 写入 OutputStream 的几种方法比较

3MySQL table基本操作

4JS30 Day 23：Speech Synthesis学习笔记

5【我可以你也可以的Node.js】第二二篇 - Crypto 加密模组 #以串接 BitoPro API 为例