-
Exploratory data analysis
处理数据
summary(nc$gained)
跟summarize()命令很像, 但是不可以自己设置需要的统计量,会提供简单的统计量,如下:
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 20.00 30.00 30.33 38.00 85.00 27
可以看出,missing data 有27个。
想象mother’s smoking habit and the weight of her baby.两个变量之间的关系。
注意:这两个变量一个是数字变量,一个是类别变量,我之前做报告疾病project的时候常识错了,想用geom_plot(), 或line来作图,变量类型都不一样,根本做不出来,这里提供很好的思路:side-by-side boxplot。
ggplot(data = nc, mapping = aes(x = habit, y = weight)) +
geom_boxplot()
从图中可以看出:
- Median birth weight of babies born to non-smoker mothers is slightly higher than that of babies born to smoker mothers.
- Range of birth weights of babies born to non-smoker mothers is greater than that of babies born to smoker mothers.
- The IQRs of the distributions are roughly equal.
图中可以大致的看出趋势,如果还需要更精确的中值对比可以用下:
nc %>%
group_by(habit) %>%
summarise(mean_weight = mean(weight))
## # A tibble: 3 x 2
## habit mean_weight
## <fct> <dbl>
## 1 nonsmoker 7.14
## 2 smoker 6.83
## 3 <NA> 3.63
一定要记得!!:group_by,select, filter的时候是用category variable, summarize时用numrical variable。(老犯)
-
Inference
强大的inference()函数
1. 假设检验
inference(y = weight, x = habit, data = nc, statistic = "mean", type = "ht", null = 0,
alternative = "twosided", method = "theoretical")
y:response variable(numerical variable)
x: explanatory variable(category variable, splits the data into groups
data: is the data frame these variables are stored in.
statistic: the sample statistic we’re using, (the population parameter estimating). eg. “median” and “proportion”.
type: a hypothesis test ("ht") or a confidence interval ("ci").
When performing a hypothesis test,
null value
alternative: can be "less", "greater", or "twosided".
method: "theoretical" or "simulation"
2. 置信区间
inference(y = weight, data = nc, statistic = "mean",
type = "ci",conf_level = 0.95 , method = "theoretical")
argument 与假设检验基本一样, 但是因为置信区间是对单变量的,所以没有x了
y:response variable(numerical variable)
data: is the data frame these variables are stored in.
conf_level: 0.xx eg. 0.90, 0.95
statistic: the sample statistic we’re using, (the population parameter estimating). eg. “median” and “proportion”.
type: a hypothesis test ("ht") or a confidence interval ("ci").
When performing a confidence intercal,
order = c("smoker","nonsmoker") 可以设置谁减去谁
method: "theoretical" or "simulation"
-
A Non-Inference Task:
Determine the age cutoff for younger and mature mothers.
思路:筛选出mature变量中非na和younger mom, 然后在summarize中用min 和 max 函数处理mage, 得到younger mom 分类中年龄的极值, 同样,再筛选mature mom。