首先,介绍几个概念:
1. IQR:interquartile range. = Q3- Q1
outlier: 异常值。
2. 如何判断一个值是否为outlier?? ------ 使用1.5* IQR rule。
例如: 一系列数据为: 1,1,6,13,13,14,14,14,15,15,16,18,18,18,19。
step 1: 找到1/4位数,1/2即中位数,3/4位数。此题中,Q1=13, median=14, Q3=18.
step 2: calculate IQR = Q3-Q1 = 18-13=5.
step 3: judge. If a number x< Q1-1.5*IQR , OR x > Q3+1.5*IQR, x is an outlier. For example, given a nunmber 30. Because 30 >18+1.5*5=25.5, 30 is judged to be a outlier.
以上。
3. population variance 的另一种计算公式:
∑ (xi^2)/n - µ^2
4. 什么是z-score?
z-score, is also called z-value, standard score, normal score... In normal distribution, given a value x, z-score is equal to (x - µ)/sigma . Z- score shows how far away a single data is from the mean relatively
For example, a normal distribution with µ =81, sigma=6.3. for x=65, Z-score= (65-81)/6.3 = -2.54 3. empirical rule?
68-95-99.7 rule
5. how to calculate correlation coefficient r.
∑ (Zxi * Zyi) / (n-1) , in which Zxi means Z-score of variate x.
6. In linear regression, formula is as follows:
y_pred= mx + b (1)
in (1), m is calculated by
m = r * Sy/Sx. (2)
例如, 4 scatters, namely, (1,1), (2,2), (2,3), (3,6), giving a linear regression formula.
Step 1, By calculating, we get x(mean) = 2, Sx=0.816; y(mean)=3, Sy=2.16.
Step 2, calculating r, r= ∑ (Zxi * Zyi) / (n-1) = 0.946
Step 3, calculating m, m= r * Sy/Sx = 2.5
Step 4, 将(x_mean, y_mean), 也就是(2,3) 带入 (1), we get the result:
y_pred=2.5 x -2
5. what's coefficient of determination ?
r^2 is called coefficient of determination.
(1) SE(y_mean) = (y1-y_mean)^2+ (y2-y_mean)^2 + (y3-y_mean)^2+ .....
(2) SE(line) = (y1-y1_pred) ^2 + (y2-y2_pred)^2 + (y3-y3_pred)^3 + ....
(3) r^2 = 1 - SE(line)/SE(y_mean).
例如, 1⃣️ 对于非线性回归,SE(y_mean) is 41.1879
2⃣️ 对于linear regression,SE(line) is 13.7627.
so, according to 1⃣️2⃣️,r^2 = 1- 13.7627 / 41.1879≈66.59%
thus, 0.6659 is the coefficient of determination, 66.59%, 也表示 how well this line could fit these data.
7. What's Root-mean-square deviation (RMSD) ,
It's also called "standard deviation of residuals "
Ri is residual, which calculated as follows,
Ri = y - y_pred (1)
so,
RMSD = ∑ ( Ri^2 ) / n-1 (2)
8. 另一种求linear regression的方法:分别求m跟b的偏导,然后令等于0,解二元一次方程组,结果如下:
参考:可汗学院 :https://www.khanacademy.org/math/statistics-probability/summarizing-quantitative-data
https://www.khanacademy.org/math/statistics-probability/modeling-distributions-of-data