Statistics with R-Inferential statistics-Week 3-Foundations for inference - Inference for numerical

羊shy

于 2018-07-28 22:16:07 发布

阅读量251

点赞数

本文链接：https://blog.csdn.net/weixin_41808937/article/details/81265883

版权

Exploratory data analysis

处理数据

summary(nc$gained)

跟summarize（）命令很像，但是不可以自己设置需要的统计量，会提供简单的统计量，如下：

## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 20.00 30.00 30.33 38.00 85.00 27

可以看出，missing data 有27个。

想象mother’s smoking habit and the weight of her baby.两个变量之间的关系。

注意：这两个变量一个是数字变量，一个是类别变量，我之前做报告疾病project的时候常识错了，想用geom_plot(), 或line来作图，变量类型都不一样，根本做不出来，这里提供很好的思路：side-by-side boxplot。

ggplot(data = nc, mapping = aes(x = habit, y = weight)) +
  geom_boxplot()

从图中可以看出：

Median birth weight of babies born to non-smoker mothers is slightly higher than that of babies born to smoker mothers.
Range of birth weights of babies born to non-smoker mothers is greater than that of babies born to smoker mothers.
The IQRs of the distributions are roughly equal.

图中可以大致的看出趋势，如果还需要更精确的中值对比可以用下：

nc %>%
  group_by(habit) %>%
  summarise(mean_weight = mean(weight))

## # A tibble: 3 x 2
## habit mean_weight
## <fct> <dbl>
## 1 nonsmoker 7.14
## 2 smoker 6.83
## 3 <NA> 3.63

一定要记得！！：group_by，select， filter的时候是用category variable， summarize时用numrical variable。（老犯）

Inference

强大的inference（）函数

1. 假设检验

inference(y = weight, x = habit, data = nc, statistic = "mean", type = "ht", null = 0, 
          alternative = "twosided", method = "theoretical")

y：response variable(numerical variable)

x: explanatory variable(category variable, splits the data into groups

data: is the data frame these variables are stored in.

statistic: the sample statistic we’re using, (the population parameter estimating). eg. “median” and “proportion”.

type: a hypothesis test ("ht") or a confidence interval ("ci").

When performing a hypothesis test,

null value

alternative: can be "less", "greater", or "twosided".

method: "theoretical" or "simulation"

2. 置信区间

inference(y = weight, data = nc, statistic = "mean", 
      type = "ci",conf_level = 0.95 , method = "theoretical")

argument 与假设检验基本一样，但是因为置信区间是对单变量的，所以没有x了

y：response variable(numerical variable)

data: is the data frame these variables are stored in.

conf_level: 0.xx eg. 0.90, 0.95

statistic: the sample statistic we’re using, (the population parameter estimating). eg. “median” and “proportion”.

type: a hypothesis test ("ht") or a confidence interval ("ci").

When performing a confidence intercal,

order = c("smoker","nonsmoker") 可以设置谁减去谁

method: "theoretical" or "simulation"

A Non-Inference Task:

Determine the age cutoff for younger and mature mothers.

思路：筛选出mature变量中非na和younger mom, 然后在summarize中用min 和 max 函数处理mage，得到younger mom 分类中年龄的极值，同样，再筛选mature mom。

羊shy

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

Statistics with R-Inferential statistics-Week 3-Foundations for inference - Inference for numerical

Exploratory data analysis

Inference

1. 假设检验

2. 置信区间

A Non-Inference Task: