Notes for Statistics
分享一下自己商统的笔记
by Feiran Jia
Lecture 1 Introduction
Variables
- Quantitative
- Categorical 明确的
- Ordinal 有先后顺序的
- Nominal 定义好赋予数值
Data sets
- Cross-section
- Time series
Sampling Error or Noise
Sampling error is a purely random difference between a sample and population of interest that arises because the sample is a random subset of the population.
Lecture 2 Displaying and Describing Quantitative Data
Histogram
- Frequency histogram: bar height = frequency
- Relative histogram:bar height = 频数/总数,直方纵坐标之和为1
- Density histogram: bar height = fraction/bin width
Central Tendency 数据聚集程度
Mean
Population | Sample |
---|---|
Numbers of observations | N |
Mean | μ=∑Ni=1yiN |
Median
- Order observations from smallest (in value) to the largest
- Find the middle - that would be the median of your data
Mode
- Observation that occurs more often
- Not unique (unimodal, bimodal)
Spread 数据离散程度
Range 极差 is an absolute difference between the smallest and the largest value in the data.
Interquartile Range 四分位数 IQR
- Sort your data in ascending order.
- Divide your data into two equal groups at the median.
- Find the median of the first, “low” group. This is called Q1, or first quartile.
- The median of the second, “high” group is the third quartile, Q3.
- The interquartile range (IQR) is the difference between Q3 and Q1
数列 | 参数 | 四分差 |
---|---|---|
1 | 102 | |
2 | 104 | |
3 | 105 | Q1 |
4 | 107 | |
5 | 108 | |
6 | 109 | Q2 (中位数) |
7 | 110 | |
8 | 112 | |
9 | 115 | Q3 |
10 | 118 | |
11 | 118 |
Percentiles
- median: 50th percentile
- first quartile Q1: 25th percentile
- third quartile Q3: 75th percentile
Variance
Population | Sample |
---|---|
Number of observations | N |
Variance | σ2=∑Ni=1(yi−μ)2N |
Total Sum of Squares = TSS = ∑ni=1(yi−y¯)2
degrees of freedom = ν = n - 1
Standard Devation
population standard deviation: σ=∑Ni=1(yi−μ)2N‾‾‾‾‾‾‾‾‾‾√
sample standard deviation: s=∑ni=1(yi−y¯)2n−1‾‾‾‾‾‾‾‾‾√
Comparison/Standardization
Coefficient of Variation (CV) 变异系数
- CV = Standard deviation / Mean
- how much variability is in the data compared to the mean: 变量值平均水平高,其离散程度的测度值越大,反之越小。在进行数据统计分析时,如果变异系数大于15%,则要考虑该数据可能不正常,应该剔除。
z-score
- z=y−y¯s
- Variable z has a mean of 0 and standard deviation equal to 1
- Value of z-score indicates how many standard deviations a value is from the mean
Lecture 3&4 Linear Relationship: Association, Correlation and Linear Regression
Covariance
Population | Sample |
---|---|
Number of observations | N |
Covariance | σxy=∑Ni=1(xi−μ)(yi−μy)N |
两个变量在变化过程中是同方向变化
Correlation 相关系数
为了能准确的研究两个变量在变化过程中的相似程度,我们就要把变化幅度对协方差的影响,从协方差中剔除掉。
Population | Sample |
---|---|
Covariance | σxy |
Standard Deviations | σx,σy |
How to find | ρ=σxyσxσy |
Coefficient of correlation is always between -1 and 1.
-1: strong negative linear relationship.
1: strong positive linear relationship
0: no linear relationship
The Linear Model
y=b0+b1∗x
b0 : y-intercept
b1 : slope of the line
e=y−ŷ observed y , Predicted
OLS=Ordinary Least Squares 证明
Minimize sum of squares: min ∑ni=1(yi−yi^)2 or min ∑ni=1(yi−b0−b1∗x)2
solution:
b1=rsysx
b0=y¯−b1x¯
need calculate
Proof:
b1=∑ni=1(xi−x¯)(yi−y¯)∑ni=1(xi−x¯)2=∑ni=1(xi−x¯)(yi−y¯)/(n−1)∑ni=1(xi−x¯)2/(n−1)=sxy