- machine learning in practice
10.3-7 model selection
The process of different value of parameters in the model
在选择模型的过程中,我们可能需要做主观决策的情形如下:
Degree of Polynomial 多项式次数
Regularization parameter 正则化项参数
The No. of the hidden layers in NN 神经网络中隐藏层的层数
思路:
提出多个模型假设,利用TrainIng set 求出各模型参数
将各备选模型的参数选定后,利用 cross validation set 交叉验证得出误差,选择误差最小的最优模型。
Test set 用于在得出的最优模型上计算 generalization error。
为什么需要将原始 data set 分成 training, cross validation (cv), and test set? 而不是 training and test set?
在只有两个 data set 的情况下,选择最优模型时,原则为最小 test error,那么再去考量该最优模型的表现时,反复使用 test set 就没有意义了,不具有泛化性。正确的做法应该是要用没有使用过(没有出现过)的样本数据进行测试,得出泛化误差。 - machine learning system design 构造学习算法
11.1 Prioritizing what to work on?
Example: building a spam classifier
11.2 error analysis (based on CV)
Manually- inspiration of building features
“stemming” software (E.g. Porter stemmer)
Need numeriacal evaluation (E.g. Cross balidation error), 有助于让决策者更好地决定模型
11.3 Error metrics for skews classes (偏斜)
Ex. Cancer classification example: skewed class->different error evaluation metric
Precision/recall: in the case that y=1 is in presence of rare class
11.4-5 Tradeoff between precision and recall
Many different possible shapes
F1 score (F score)—how to compare precision §/ recall numbers ®
Average? (反例:y=1)
F1 score= 2PR/(P+R), 乘积形式,合理性
Varying the threshold, you can control the tradeoff between precision and recall.
F1 score give you a real number evaluation metric
Select the threshold, the method is the same as model selection, just use CV
11.6 Data for machine learning
Perception (logistic regression)
Winnow
Memory-based
Naïve-based
Dataalgorithm
What can really drive the performance is you can give the algorithm a ton of training data.
“it is not who has the best algorithm that wins, it is who has the most data”—banko and brill, 2001.
Large data rationale
Many parameters—low bias algorithms—J will be small
Large training set—low variance algorithms—J_training~=J_test
12 support vector machines
12.1 Optimization objective
Supervised algorithm
The performance depends on the choice of features, how to choose the regularition parameters
SVM is a kind of supervised learning algorithm, which performs well compared with logistic regression, and neural network.
12.2 large margin intuition 大间距分类器
The larger margin 可以去查查
12.3 mathematics
Vector inner product
SVM decision boundary
12.4-5 kernels and how to get the landmark
SVM with kenel
SVM parameterd
12.6 use SVM
Other choices of kernel
Note: not all similarity functions make valid kernels. (need to satisfy technical condition called ”Mercer’s Theorem” to make sure SVM packages’ optimizations run correctly, and do not diverge)
Many off-the-shelf kernels available:
Polynomial kernel
More esoteric: string kernel, chi-square kernel, histogram intersection kernel…
Multi-class classification
Many SVM packages already have built-in multi-calss classification functionality
Otherwise, use one-vs.-all method/
13 Clustering
13.1 Unsupervised learning
Given a sort of unlabeled training set to an algorithm, and we just ask the algorithm: find some structure in the data
13.2 k-means algorithm
random initialization
(option works best)
Iterative algorithm: cluster arrangement, move centroid
Until the labels and the centroids do not change
13.3 Optimization objective
13.4 Random initialization
K<m, pick k trainingg examples randomly, set them as the initialization cluster centroid
Why “random initialization”? whether to get to the local optima, and whether the algorithm is slow maybe depends on different situation of initialization.
13.5 Choosing the number of clusters
Elbow method
Sometimes, running k-means to get clusters for later/ downstream purpose, them through it , we can evaluate k-means based on a metric for how well it performs for that later purpose
14 Dimensionality Reduction降维
14.1 motivation1: data compression
Compressed data use less computer memory or disk space, and speed up our learning algorithm.
If we allow ourselves to approximate the original data set by projecting all my examples onto something, then, I only need one number to specify the position of a point.
14.2 motivation2: data visualization
14.3 principle component analysis problem (PCA) formulation
The PCA is to find a vector on to project the data so as to minimize the projection error. (perform mean normalization and feature scaling firstly)
PCA is not Linear regression
14.4 PCA algorithm
Data preprocessing
Compute “covariance matrix” and “eigenvectors”
14.5 choosing the number of principle components K
Reconstruction from compressed representation
14.7 advice for applying PCA
Run PCA on training set, not CV/test, get x->z first, then apply it to CV& test set.
To prevent overfitting is a bad use of PCA (丢失信息)
The advice is that before running PCA, try raw data. Only if you have evidence or strong reason to believe, thn run PCA and consider using the compressed data.
15 Anomaly detection 异常检测
15.1 problem motivation
Example:
Fraud detection
Manufacturing
15.2 Gaussian Distribution (Normal distribution)
15.3 Algorithm
Model a P(x) from the data sets, trying to figure out what are high probability features, what are lower probability types of features
p(x)=p(x_1;μ_1,σ_1^2 )p(x_2;μ_2,σ_2^2 )…p(x_n;μ_n,σ_n^2)
p(x)=∏_(j=1)n▒p(x_j;μ_j,σ_j2 )
15.4 Anomaly detection v.s. supervised learning
When the no. of examples (y=1) is comparable larger, it’s appropriate to use supervised learning, if not, recall the cheating algorithm.
Examples for anomaly detection: fraud detection, manufacturing, monitoring machine in a data center
Examples for supervised learning: e-mail spam classification, weather prediction, cancer classification
15.5 Choose what features to use
Non-gaussian feature: The most common idea is to use log transformation data (also other methods)
Error analysis
15.6-7 multivariate gaussian distribution 多元高斯分布
Parameter fitting
Relationship to original model
Original model corresponds to multivariate Guassian distribution where meets the requirements of sigma (要求只要对角线有方差值而其他是零)(axis-aligned)
Original model Multivariate Gaussian Model
Manually create features
Computational cheaper
Ok even if m IS Small Automatically captures correlations between features
Computationally more expensive
Must have m>n, else sigma is non-invertible
16 Recommender System
16.1 problem formulation
16.2 Context based recommendation
Context means that the features of movies are known
16.3 Collaborative Filtering 协同过滤机制
16.4 Collaborative Filtering Algorithm
For CBR, given features, to learn user parameters
For CF, given user parameters, to learn features
In the collaborative filtering algorithm, put the above two together.
16.5 Vectorization—low rank matric factorization
Example: how to find movie j related to movie i?<–> the most similar movies
16.6 Implementational detail
Use mean normalization to fix the problem of that users who have not rated anyone of the movies.
17 Large scale machine learning
17.1 Learning with large datasets
Two computational reasonable ways are stochastic gradient descent and Map Reduce.
17.2-4 Stochastic Gradient descent & Mini-batch Gradient descent
17.5 Online learning setting 在线学习机制
Continuous stream of data
Adapt to changing user preference and keep track of users behavior/taste
Example: product search
17.5 Map-reduce and data parallelism
Example: Split the training set to 4 machines, compute temporary variable separately, combine together
Many learning algorithms can be expressed as computing sums of functions over the training set
Example: logistic regression
Multi-core machines 多核计算
18 application example: photo OCR
18.1 problem description and pipeline
How a complex machine learning system can be out together
Concept of machine learning type line 机器学习流水线 and how to allocate resources
Apply machine learning to computer vision problem and the artificial data synthesis
Photo OCR : photo Optical Character Recognition 照片光学字符识别
How to get the computer to understand the content of these pictures a little bit better, to get the computer to read the text to the purest in images that we take
Applications: blind camera, car navigation system
Photo OCR pipeline: Text detection—character segmentation—character classification/recognition
18.2 Sliding windows 滑动窗口
Stepsize/stride
18.3 Getting lots of data and artificial data
Character recognition字体库,可以从0开始构建训练集;也可以基于样本扩样。
Ways of Getting more data: artificial data synthesis; collect/label it yourself; “crowd source” (E.g. Amazon Mechanical Turk)
18.4 Ceiling analysis what part of the pipeline to work on next?