Machine Learning Simple Notes
(一些基础的notes…)
Basics
Machine Learning
- Model(模型) + Evaluation(评估标准) + Optimization(优化算法) + Validation(验证)
- Using datasets D to learn specific model G from model space(hypothesis space) H
so that G is close to the best model F in H.
ML Problem Classification
- out space
- Classification
- binary-class
- multi-class
- one-v-all:k binary Classifiers
- one-v-one:train a binary classifier for each pair of class,
k(k-1)/2 SVM binary classifiers - softmax regression(multiclass logistic regression)
- structure(sentence classification)
- Regression
- Clustering
- Classification
- data input space
- Supervised Learning
- Unsupervised Learning
- Semi-supervised Learning
- Reinforce Learning
- algorithms space
- Parameter
- Non-parameter
- K-nearest Neighbors
- Kernel Estimation
- Locally Weighted Linear Regression
- Semi-parameter
- learning with different protocols
- batch learning
- online learning
- active learning
- features input space
- concrete features
- raw features
- abstract features
Theories
Describe the learning feasibility and learning process(PAC learning theory)
- finite hypothesis space H
- infinite hypothesis space H
- dichotomy
(a specific combination case of N input samples),denotes
H(S1, S2, S3,…, Sn)(函数簇,每个函数簇中的假设函数属于同一dichotomy) - growth function
(maximum dichotomy[最大函数簇] for a specific hypothesis space H and N samples), denots mH(N) - bound function
(maximum growth function for a specific break point k and N samples with different H)
denotes B(N,k), B(N,k) <= B(N-1,k) + B(N-1,k-1) < N^(k-1) = N^(Dvc) - 本质上由于样本集合是稀疏离散的,导致无限维假设空间中只有某些簇对样本能起到有效分割,每一簇中的假设函数对此样本集合起到相同的分割作用,也就是在hoeffding不等式中,每一簇中的假设函数导致|E(in,h)-E(out,h)|>epsilon误差的概率是重合的(不独立),其概率P(|E(in,h)-E(out,h)|>epsilon)是同时发生的,反映所有假设函数P(E(in,h)-E(out,h)>epsilon)的并集不会无限增大,及不等式右侧2*M*e{-2*epsilon^2*N}中的M退化为有限维,而此维数受到样本集合大小N限制。从而转变为有限维的情况,而在有限维情况下有结论:给定足够大的N,能够使得Probably Approximately Correct(PAC)下,有
P(for any h belong to H,|E(in,h)-E(out,h)|>epsilon) < 2*M*e{-2*epsilon^2*N},由此从理论上证明了学习背后的可行性和正确性。意即确实学习了,improve了。
- dichotomy
- Empirical Risk Minimization
- VC bound
- for any h in H
P(|E(in,h)-E(out,h)|>epsilon) <= 4*mH(2N)exp(-1/8*epsilon^2*N) <= 4(2N)^(k-1)*exp()
- for any h in H
- VC dimension:Maximum_not_break_point(minimum_break_point-1)
- Dvc不是无限大(存在break_point),给定足够大的N,能保证PAC条件下hoeffding不等式
成立。即在一定误差下能用E(in)估计E(out),即能用输入datasets估计generalization的情况。 - 假设函数空间H最多能把Dvc数量的样本集合shatter,即能学习样本集合的所有组合,可以
衡量假设函数空间的学习复杂度 - VC dimension与输入样本特征数量的关系:Dvc = d(特征数量)+1
- 证明思路:Dvc <= d+1 && Dvc >= d+1
- Dvc <= d+1 <=>
对于d+2个输入的任意组合,不能shatter <=>
不存在W,使得WX=Y成立 <=>
由于X是(N+2)*(N+1)维的,而且输入向量之间线性无关,所以方程个数大于(W)自由变量的个数,导致W无解 - Dvc >= d+1 <=>
能shatter d+1个输入的某个组合 <=>
存在W,使得WX=Y成立 <=>
由于X是(N+1)*(N+1)维的,且可构造X为正定矩阵,所以X可逆,即W=inv(X)Y,有解
- 衡量假设函数空间的学习复杂度(自由度),或者说从另一方面衡量样本集合的学习能力
- VC dimension与样本数量N、特征维数的关系
- with propability 1-a
E(out,g) <= E(in,g) + sqrt(8/N*ln[4*((2N)^Dvc)/a]) = E(in,g) + omega(N,H,a)(模型复杂度model complexity) - sample compexity N
- Dvc不是无限大(存在break_point),给定足够大的N,能保证PAC条件下hoeffding不等式
Bias and Variance(Underfitting and Overfitting) trade-off
- underfitting
- overfitting
- datasets too small
- noise too large(stochastic noise and deterministic noise[depends on H])
- model too compexity(VC dimension too big)
- Regularization
- 本质:通过增加对特征系数w的限制,减少需要搜索的假设空间的复杂度或者说维数,从而使得模型复杂度较高的同时较少deterministic noise,从而得到较好的tradeoff,
另一方面,从贝叶斯角度,相当于添加了先验知识(先验概率),然后极大化后验概率。
- 本质:通过增加对特征系数w的限制,减少需要搜索的假设空间的复杂度或者说维数,从而使得模型复杂度较高的同时较少deterministic noise,从而得到较好的tradeoff,
- Training Error(Risk)
- Model Selection
- Feature Selection
- Cross Validation
- Model Metrics
- accuracy,precision,recall
- accuracy = TP+TN/TP+TF+NP+NF ==> 正确分类的比例
- precision = TP/TP+FP ==> 衡量不会将负样本错误分类为正样本的概率(欺诈检测)
- recall = TP/TP+FN ==> 衡量从样本集区分全体正样本的能力
- confusion_matrix
- 行代表prediction类别,列代表事实上的类别,A(i,j)表示将类别i分类为类别j的数量
- f1-measure = 2*(precision*recall)/(precision+recall)[in f-measure where a=1]
- f-measure = (a^2+1)*(precision*recall)/(precision+recall)
- 综合考虑精确率与召回率的影响
- ROC and AUC ==> inbalanced datasets
- accuracy,precision,recall
Algorithms(Models)
Supervised Learning
- 1.Least Square Mean(LSM)
- 2.Logistic Regression(LR)
- 3.Percepton
- 4.Naive Bayes(NB)
- 5.Support Vector Machine(SVM)
- 6.Decision Tree
- 7.Neighbors
- 8.Linear Discriminative Analysis(LDA)
- 9.Resemble Methods
- (1)Boosting : Gradient Boosted Decision[Regression] Trees(GBRT[GBDT])
- (2)Bagging : Random Forests
Unsupervised Learning
Clustering / Dimension Reduction / Density Estimation
- 1.KMeans Clustering
- 2.Hierarchy Clustering
- 3.Expectation Maxmization(EM)
- 4.Gaussian Mixture Models(GMM)
- 5.Density-Based Spatial Clustering of Applications with Noise(DBSCAN)
- 6.Mean Shift
Others
- 1.Artificial Neural Network and Deep Learning
- 2.Dimension Reduction
- (1)PCA/Kernel PCA
- (2)Matrix Factorization/SVD
- 3.Gaussian Process
- 4.Bayesian Network and Graphical Models
- 5.LDA
- 6.PageRank
- 7.Apriori
- 8.Empirical Risk Minimization(ERM)
Techniques
This section includes some theories aspects and techniques used in machine learning.
- Normalization
- Principle Componets Ananysis
- Singular Value Decomposition
- Matrix Factorization
inbalanced datasets
- oversampling and undersampling
- different metrics such as AUC
- ensemble different algorithms
- cost matrix
nonlinear transformation for linear models
拉格朗日数乘法的本质
- min{ f(x) } , s.t. g(x) <= 0
- L(x) = f(x) + C*g(x) with C>0
- L(x)’ = f’ + C*g’ = 0 ==> 假设f的等值线与g在x0处相交且最小,f’和g’在x0处的法向量共线,
也即f的梯度没有沿着g曲线的切向分量了,梯度下降停止
ref:机器学习基石课程