4 篇文章 1 订阅

# Week 1

## 1.1 Data visualisation 数据可视化

# install.packages("palmerpenguins")
library(palmerpenguins)


help(penguins, package = "palmerpenguins")
# or more simply
?penguins


library(dplyr)
dplyr::glimpse(penguins) # glimpse the structure of the penguins data frame


ggplot(data = penguins) + aes(x = species, fill = sex) +
geom_bar(position = "fill") +
labs(x = "", y = "Proportion of penguins", fill = "Sex") +
scale_y_continuous(labels = scales::percent_format()) +
facet_grid(cols = vars(island), scales = "free_x", space = "free_x") +
theme_linedraw(base_size = 22)


## 1.2 Data collection 数据收集

Sample and Population 样本和人口

• A sample is part of a population（sample是population的一部分）
• A statistic can be computed from a sample, and used to estimate a parameter.（可以从样本计算统计量，并用于估计参数）
• A statistic summarises what the researcher knows. A parameter is what the researcher wants to know.（统计数据总结了研究人员所知道的。 参数是研究人员想知道的）

• Hard to observe the population （很难观测到整体人群）
• Not enough time （没有足够的时间）
• Not enough money （没有足够的资金）
• Not enough resource （没有足够的资源）

• Reduce the number of measurements （减少测量次数）
• Save time, money and resources （节省时间，资源和金钱）
• Might be essential in destructive testing （在destructive testing 必不可少）

sample的定义

Sampling is the process of selecting a subset of observations from an entire population of interest so that characteristics from the subset (sample) can be used to draw conclusion or making inference about the entire population.（抽样是从整个感兴趣的总体中选择观察子集的过程，以便可以使用子集（样本）中的特征对整个总体得出结论或进行推断。）

Bias 偏见

Bias is any factor that favours certain outcomes or responses, or influences an individual’s responses. Bias may be unintentional (accidental), or intentional (to achieve certain results).

• Selection bias / sampling bias: the sample does not accurately represent the population. Example: Attendees at a Star Trek convention may report that their favorite genre is science fiction.

• Non-response bias: Certain groups are under-represented because they elect not to participate. Example: a restaurant may give each table a “customer satisfaction” survey with their bill.

• Measurement or designed bias: Bias factors in the sampling method influence the data obtained. Example: a respondent may answer questions in the way she thinks the questioner wants her to answer.

## 1.3 Controlled experiments 对照实验

Randomised controlled double-blind trials 随机对照双盲试验

• Investigators obtain a representative sample of subjects. （获取具有代表性的样本）
• Investigators randomly allocate the subjects into a treatment group and a control group.（随机将被测试者分为治疗组和对照组，提到randomly要想到independent）
• The control group is given a placebo, but neither the subjects nor the investigators know the identity of the 2 groups (double-blind).（在对照组添加药剂，但是被测试者们并不知道）
• Investigators compare the responses of the 2 groups.（比较两组反应）
• The design is good because we expect the 2 groups to be similar, hence any difference in the responses is likely to be caused by the treatment.（为我们所期望的，任何差异都可能是治疗组引起的）

Observational studies 观察性研究

• By necessity, many research questions require an observational study, rather than a controlled experiment.（许多研究问题需要观察性研究，而不是受控实验）
• Similarly, most educational research is based on observational studies.（大多数教育研究是基于观察性研究）
• The conclusions of observational studies require great care.（结论要非常小心）

• A good randomised controlled experiment can establish causation, an observational study can only establish association.（前者randomised controlled experiment可以建立起因果关系，而后者observational study只能建立关联）
• An observational study may suggest causation, but it can’t prove causation.（观察性研究可能会提示因果关系，而不会证明因果关系）

• Confounding occurs when the treatment group and control group differ by some third variable (other than the treatment) which influences the response that is studied.(当治疗组和对照组因影响所研究的反应的某些第三变量（治疗除外）不同时，就会发生混杂)
• Confounders can be hard to find, and can mislead about a cause and effect relationship.(混杂因素很难找到，并且会误导因果关系)

• Sometimes there is a clear trend in individual groups of data that disappears when the groups are pooled together.(当将这些组汇集在一起​​时，单个数据组中的明显趋势会消失。)
• It occurs when relationships between percentages in subgroups are reversed when the subgroups are combined, because of a confounding or lurking variable.(当子组合并时子组中百分比之间的关系由于混杂或潜在变量而发生逆转时，就会发生这种情况)

## 1.4 Chi-squared tests 卡方检验

1. 清楚地了解实验是第一步，也是最重要的一步，然后设置research 问题
set hypotheses: H0 VS H1

2. 计算evidence
set test statistic T
set assumptions
Select a critical value(α)：Common values are 5% and 1%

3. 得出conclusion
Calculate p-value
reject the null hypothesis or not reject it

Chi-squared tests

Hypothesis 假设

null hypothesis: The statement against which you search for evidence is called the null hypothesis, and is denoted by H0. It is generally a “no difference” statement.(您搜索证据所依据的陈述称为原假设,用 H0 表示。它通常是“无差异”陈述。)

alternative hypothesis: The statement you claim is called the alternative hypothesis, and is denoted by H1 (or sometimes you’ll see HA)(您声称的陈述称为备择假设，用 H1 表示（或者有时您会看到 HA）)

Assumptions 假设

• Each observation are generally assumed to have been chosen at random from a population.(观测值从总体中随机选择）
• We say that such random variables are iid (independently and identically distributed).（这样的随机变量是iid（独立同分布））
• Each test we consider will have its own set of assumptions.（每个测试都有自己的一组假设。）

Test statistic

• The observed test statistic, t0, is where we plug our observed data into the formula for the test statistic.（观察到的检验统计量 t0 是我们将观察到的数据插入检验统计量公式的地方。）

• Large (positive or negative depending on H1) observed test statistic values is taken as evidence of poor agreement with H0.（大（正或负取决于 H1）观察到的测试统计值被视为与 H0 不一致的证据。）

Decision

• An observed large positive or negative value of t0 and hence small p-value is taken as evidence of poor agreement with H0.

– If the p-value is small, then either H0 is true and the poor agreement is due to an unlikely event, or H0 is false. The smaller the p-value, the stronger the evidence against the null hypothesis.

A large p-value does not mean that there is evidence that the null hypothesis is true.

# Week 2

## 2.1 goodness of fit tests 拟合优度检验

goodness of fit tests中，有两种distributions分布。分别是discrete distributioncontinuous distribution

• discrete distribution： 我看见1辆汽车，2辆汽车，3辆汽车。不能是我看见1.23辆汽车，3.4辆汽车。其中 Binomial distributionNormal distribution 这个分布出现。

• continuous distribution： 我的体重是134.56斤， 你的体重是100.34斤，他的体重是180.2斤。能出现小数点。其中Normal distribution出现在这个分布

Poisson distribution 泊松分布

A Poisson random variable represents the probability of a given number of events occurring in a fixed interval (e.g. number of events in a fixed period of time) if these event occur independently with some known average rate λ per unit time（.泊松随机变量表示给定数量的事件在固定间隔内发生的概率（例如，在固定时间段内的事件数量），如果这些事件以每单位时间某个已知的平均速率 λ 独立发生。）

Chi-squared tests for discrete distributions 离散分布的卡方检验

## 2.2 Measures of performance 绩效衡量标准

Types of errors 重点知识：

• True positive = correctly identified （阳性被检测出来）

• False positive = incorrectly identified （结果为阳性，但实际上是阴性）

• True negative = correctly rejected （阴性被检测出来）

• False negative = incorrectly rejected （结果为阴性，但实际上是阳性）

## 2.3 Measures of risk 风险措施

Prospective and retrospective studies前瞻性和回顾性研究

• A prospective study is based on subjects who are initially identified as disease-free and classified by presence or absence of a risk factor.(通过实现的设计去完成问题，有很强的目的性和因果性）
• A random sample from each group is followed in time (prospectively) until eventually classified by disease outcome.（对来自每组的随机样本进行及时（前瞻性）跟踪，直到最终按疾病结果分类。）

Estimating population proportions 估计人口比例

Relative risk 相对风险

The relative risk is the ratio of the probability of having the disease in the group with the risk factor to the probability of having the disease in the group without the risk factor.

Odds ratio 优势比

• A common alternative to the relative risk is the odds ratio, denoted OR.(相对风险的常见替代方法是优势比，表示为 OR。)
• Odds are a ratio of probabilities. The odds are used as an alternative way of measuring the likelihood of an event occurring.(赔率是概率的比率。 赔率用作衡量事件发生可能性的另一种方法。)

Standard errors and confidence intervals for odds ratios 优势比的标准误和置信区间

# Week 3

## 3.1 Testing for homogeneity

Chi-squared test of homogeneity

With our observed counts and expected counts in each cell, we can construct a chi-squared test for homogeneity,

The expected cell counts are

Testing for homogeneity in general tables

## 3.2 Testing for independence

Testing for independence in 2×2 tables

Independence

Test statistic

## 3.3 Testing in small samples

Fisher’s exact test

• The χ2 approximation for the test statistic is only reasonable when n is sufficiently large. I.e. we need the expected cell frequencies to all be 5 or more. However, if this is not the case, then we need to take care and maybe consider exact tests, i.e. calculating the exact p-value for the test statistic.

• In R the function fisher.test() is available to carry out these calculations both for 2×2 tables and general contingency tables.

Yates’ chi-squared test

# 总结

• 1
点赞
• 7
收藏
• 打赏
• 4
评论
11-08 293
09-24 986
08-02 36

### “相关推荐”对你有帮助么？

• 非常没帮助
• 没帮助
• 一般
• 有帮助
• 非常有帮助

¥2 ¥4 ¥6 ¥10 ¥20

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、C币套餐、付费专栏及课程。