Statistics with R-Inferential statistics-Week 3-Foundations for inference - Inference for numerical

  • Exploratory data analysis

处理数据

summary(nc$gained)

跟summarize()命令很像, 但是不可以自己设置需要的统计量,会提供简单的统计量,如下:

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00   20.00   30.00   30.33   38.00   85.00      27

可以看出,missing data 有27个。

想象mother’s smoking habit and the weight of her baby.两个变量之间的关系。

注意:这两个变量一个是数字变量,一个是类别变量,我之前做报告疾病project的时候常识错了,想用geom_plot(), 或line来作图,变量类型都不一样,根本做不出来,这里提供很好的思路:side-by-side boxplot。

ggplot(data = nc, mapping = aes(x = habit, y = weight)) +
  geom_boxplot()

从图中可以看出:

  1. Median birth weight of babies born to non-smoker mothers is slightly higher than that of babies born to smoker mothers.
  2. Range of birth weights of babies born to non-smoker mothers is greater than that of babies born to smoker mothers.
  3. The IQRs of the distributions are roughly equal.

图中可以大致的看出趋势,如果还需要更精确的中值对比可以用下:

nc %>%
  group_by(habit) %>%
  summarise(mean_weight = mean(weight))

## # A tibble: 3 x 2
##   habit     mean_weight
##   <fct>           <dbl>
## 1 nonsmoker        7.14
## 2 smoker           6.83
## 3 <NA>             3.63

一定要记得!!:group_by,select, filter的时候是用category variable, summarize时用numrical variable。(老犯)

  • Inference

  强大的inference()函数

1. 假设检验

inference(y = weight, x = habit, data = nc, statistic = "mean", type = "ht", null = 0, 
          alternative = "twosided", method = "theoretical")

y:response variable(numerical variable)

x: explanatory variable(category variable, splits the data into groups

data: is the data frame these variables are stored in.

statistic: the sample statistic we’re using, (the population parameter estimating). eg. “median” and “proportion”.

type: a hypothesis test ("ht") or a confidence interval ("ci").

When performing a hypothesis test,

null value

alternative: can be "less", "greater", or "twosided".

method:  "theoretical" or "simulation" 

2. 置信区间

inference(y = weight, data = nc, statistic = "mean", 
      type = "ci",conf_level = 0.95 , method = "theoretical")

argument 与假设检验基本一样, 但是因为置信区间是对单变量的,所以没有x了

y:response variable(numerical variable)

data: is the data frame these variables are stored in.

conf_level: 0.xx eg. 0.90, 0.95

statistic: the sample statistic we’re using, (the population parameter estimating). eg. “median” and “proportion”.

type: a hypothesis test ("ht") or a confidence interval ("ci").

When performing a confidence intercal,

order = c("smoker","nonsmoker") 可以设置谁减去谁

method:  "theoretical" or "simulation" 

  • A Non-Inference Task:

Determine the age cutoff for younger and mature mothers. 

思路:筛选出mature变量中非na和younger mom, 然后在summarize中用min 和 max 函数处理mage, 得到younger mom 分类中年龄的极值, 同样,再筛选mature mom。

 

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值