主要是Coursera Basic Statistics课程的笔记。
第一周 Exploring Data
Descriptive Statistics
Different Levels of Measurement:
Nominal(定标), Ordinal(定序), Interval(定距), Ratio(定比): Interval和Ratio的差别是Inteval的零不是表示没有,比如温度为0并不代表没有温度。
Central Tendency and Dispersion:
Central Tendency指标有:Mode,Median,Mean(俗称3M)
Dispersion指标有:Range,Interquantile Range(IQR),Variance, Standard Deviation
另外一个Z-Scores:to specific a observation is common or exceptional,(变量值-均值)/标准差,
在R中对应的函数有:
Measurement | Function |
---|---|
Mode | N/A |
Mean | mean() |
Median | median() |
Range | range() |
IQR | IQR() |
Variance | var() |
Standard Deviation | sd() |
Z-Scores | N/A |
第二周 Correlation and Regression
Frequency table: One varible
Contingency table: Two varible
When the two varible are quantitative, we use scatterplot.
Correlation: Pearson r,取值范围[-1,1],正数表示正相关,负数表示负相关,数值表示强度:
0.8-1.0 极强相关
0.6-0.8 强相关
0.4-0.6 中等程度相关
0.2-0.4 弱相关
0.0-0.2 极弱相关或无相关
Regression:
y^=a+bx
, 其中:
b=r∑zxzyn
, r为皮尔森系数,z为z-score,n为样本数
Explained variance: The percentage of the variance in the dependent variable that can be explained using the formula of the regression line. You can measure this with r-squared.
R语言对应函数:
Name | function |
---|---|
Frequency Table or Contingency Table | table() |
Pearson’s r/Correlation | cor() |
Linear Regression | lm() |
Scatter Plot | plot() |
Regression Line | abline() |
第三周 Probability
Experiment
Trial
Outcome
Event
Random Variable
Marginal Probability
Two methods to calculate probability:
- Tree Diagram
- Contingency Table
The complement of
X
is
Independent intersecting events are two events that do not influence each other and can occur similtaneously. An example might be the outcome of rolling two dices.
Disjoint exhaustive events are mutually exclusive, so only one of the events can happen at a time.
Intersection:
P(A∩B)
Union:
P(A∪B)=P(A)+P(B)−P(A∩B)
Joint Probability:
P(AB)
, i.e. P(A and B)
Conditional Probability:
P(A∣B)=P(AB)P(B)
, i.e. P(A given B), reduced sample space
袋子里有6颗红球4颗绿球,从袋子里随机拿出两个球:
无放回:依赖事件
有放回:独立事件
If event A and event B are independent:
P(AB)=P(A)∗P(B)
P(A∣B)=P(AB)P(B)=P(A)∗P(B)P(B)=P(A)
(比如投两枚硬币,一枚硬币的结果不会影响另一枚硬币的结果)
And how to calculate
P(AB)
when events are dependent?
See this course
Bayes’ Law:
∵P(AB)=P(A∣B)∗P(B)=P(B∣A)∗P(A)
∴P(A∣B)=P(B∣A)∗P(A)P(B)
where
P(A)
is called prior probability, and
P(B∣A)
is called posterior probability.
Fallible:
Union
Independence
Bayesian Probability II:It is 0.5 because there are two sides, left and right.
Digression:
Do you know the result of 0.1+0.2 in Python?
Why don’t my numbers add up?
Basic Answers
第四周 Probability distributions
Probability distributions:
- probability mass function(概率质量函数,离散随机变量,函数值等于概率)
- probability density function (概率密度函数,连续随机变量,函数图像下方面积等于概率)
- cumulative probability distribution(累计概率分布)
Mean and variance of a random variable:
The normal distribution:
The binomial distribution: