拼命三娘冲(20191118)_吴恩达课程笔记2

本文介绍了机器学习中的模型选择过程,如多项式次数、正则化参数和神经网络隐藏层的选择,并强调了训练、交叉验证和测试集的划分原因。此外,讨论了错误分析、优先级设定、精度与召回率的权衡以及F1分数在评估模型中的作用。还涵盖了支持向量机、聚类、降维、异常检测和推荐系统等主题。
摘要由CSDN通过智能技术生成
  1. machine learning in practice
    10.3-7 model selection
    The process of different value of parameters in the model
    在选择模型的过程中,我们可能需要做主观决策的情形如下:
    Degree of Polynomial 多项式次数
    Regularization parameter 正则化项参数
    The No. of the hidden layers in NN 神经网络中隐藏层的层数
    思路:
    提出多个模型假设,利用TrainIng set 求出各模型参数
    将各备选模型的参数选定后,利用 cross validation set 交叉验证得出误差,选择误差最小的最优模型。
    Test set 用于在得出的最优模型上计算 generalization error。
    为什么需要将原始 data set 分成 training, cross validation (cv), and test set? 而不是 training and test set?
    在只有两个 data set 的情况下,选择最优模型时,原则为最小 test error,那么再去考量该最优模型的表现时,反复使用 test set 就没有意义了,不具有泛化性。正确的做法应该是要用没有使用过(没有出现过)的样本数据进行测试,得出泛化误差。
  2. machine learning system design 构造学习算法
    11.1 Prioritizing what to work on?
    Example: building a spam classifier
    11.2 error analysis (based on CV)
    Manually- inspiration of building features
    “stemming” software (E.g. Porter stemmer)
    Need numeriacal evaluation (E.g. Cross balidation error), 有助于让决策者更好地决定模型
    11.3 Error metrics for skews classes (偏斜)
    Ex. Cancer classification example: skewed class->different error evaluation metric
    Precision/recall: in the case that y=1 is in presence of rare class
    11.4-5 Tradeoff between precision and recall
    Many different possible shapes
    F1 score (F score)—how to compare precision §/ recall numbers ®
    Average? (反例:y=1)
    F1 score= 2PR/(P+R), 乘积形式,合理性
    Varying the threshold, you can control the tradeoff between precision and recall.
    F1 score give you a real number evaluation metric
    Select the threshold, the method is the same as model selection, just use CV
    11.6 Data for machine learning
    Perception (logistic regression)
    Winnow
    Memory-based
    Naïve-based
    Dataalgorithm
    What can really drive the performance is you can give the algorithm a ton of training data.
    “it is not who has the best algorithm that wins, it is who has the most data”—banko and brill, 2001.
    Large data rationale
    Many parameters—low bias algorithms—J will be small
    Large training set—low variance algorithms—J_training~=J_test
    12 support vector machines
    12.1 Optimization objective
    Supervised algorithm
    The performance depends on the choice of features, how to choose the regularition parameters
    SVM is a kind of supervised learning algorithm, which performs well compared with logistic regression, and neural network.
    12.2 large margin intuition 大间距分类器
    The larger margin 可以去查查
    12.3 mathematics
    Vector inner product
    SVM decision boundary
    12.4-5 kernels and how to get the landmark
    SVM with kenel
    SVM parameterd
    12.6 use SVM
    Other choices of kernel
    Note: not all similarity functions make valid kernels. (need to satisfy technical condition called ”Mercer’s Theorem” to make sure SVM packages’ optimizations run correctly, and do not diverge)
    Many off-the-shelf kernels available:
    Polynomial kernel
    More esoteric: string kernel, chi-square kernel, histogram intersection kernel…
    Multi-class classification
    Many SVM packages already have built-in multi-calss classification functionality
    Otherwise, use one-vs.-all method/
    13 Clustering
    13.1 Unsupervised learning
    Given a sort of unlabeled training set to an algorithm, and we just ask the algorithm: find some structure in the data
    13.2 k-means algorithm
    random initialization
    (option works best)
    Iterative algorithm: cluster arrangement, move centroid
    Until the labels and the centroids do not change
    13.3 Optimization objective
    13.4 Random initialization
    K<m, pick k trainingg examples randomly, set them as the initialization cluster centroid
    Why “random initialization”? whether to get to the local optima, and whether the algorithm is slow maybe depends on different situation of initialization.
    13.5 Choosing the number of clusters
    Elbow method
    Sometimes, running k-means to get clusters for later/ downstream purpose, them through it , we can evaluate k-means based on a metric for how well it performs for that later purpose
    14 Dimensionality Reduction降维
    14.1 motivation1: data compression
    Compressed data use less computer memory or disk space, and speed up our learning algorithm.
    If we allow ourselves to approximate the original data set by projecting all my examples onto something, then, I only need one number to specify the position of a point.
    14.2 motivation2: data visualization
    14.3 principle component analysis problem (PCA) formulation
    The PCA is to find a vector on to project the data so as to minimize the projection error. (perform mean normalization and feature scaling firstly)
    PCA is not Linear regression
    14.4 PCA algorithm
    Data preprocessing
    Compute “covariance matrix” and “eigenvectors”
    14.5 choosing the number of principle components K
    Reconstruction from compressed representation
    14.7 advice for applying PCA
    Run PCA on training set, not CV/test, get x->z first, then apply it to CV& test set.
    To prevent overfitting is a bad use of PCA (丢失信息)
    The advice is that before running PCA, try raw data. Only if you have evidence or strong reason to believe, thn run PCA and consider using the compressed data.
    15 Anomaly detection 异常检测
    15.1 problem motivation
    Example:
    Fraud detection
    Manufacturing
    15.2 Gaussian Distribution (Normal distribution)
    15.3 Algorithm
    Model a P(x) from the data sets, trying to figure out what are high probability features, what are lower probability types of features
    p(x)=p(x_1;μ_1,σ_1^2 )p(x_2;μ_2,σ_2^2 )…p(x_n;μ_n,σ_n^2)
    p(x)=∏_(j=1)n▒p(x_j;μ_j,σ_j2 )
    15.4 Anomaly detection v.s. supervised learning
    When the no. of examples (y=1) is comparable larger, it’s appropriate to use supervised learning, if not, recall the cheating algorithm.
    Examples for anomaly detection: fraud detection, manufacturing, monitoring machine in a data center
    Examples for supervised learning: e-mail spam classification, weather prediction, cancer classification
    15.5 Choose what features to use
    Non-gaussian feature: The most common idea is to use log transformation data (also other methods)
    Error analysis
    15.6-7 multivariate gaussian distribution 多元高斯分布
    Parameter fitting
    Relationship to original model
    Original model corresponds to multivariate Guassian distribution where meets the requirements of sigma (要求只要对角线有方差值而其他是零)(axis-aligned)

Original model Multivariate Gaussian Model
Manually create features
Computational cheaper
Ok even if m IS Small Automatically captures correlations between features
Computationally more expensive
Must have m>n, else sigma is non-invertible

16 Recommender System
16.1 problem formulation
16.2 Context based recommendation
Context means that the features of movies are known
16.3 Collaborative Filtering 协同过滤机制
16.4 Collaborative Filtering Algorithm
For CBR, given features, to learn user parameters
For CF, given user parameters, to learn features
In the collaborative filtering algorithm, put the above two together.
16.5 Vectorization—low rank matric factorization
Example: how to find movie j related to movie i?<–> the most similar movies
16.6 Implementational detail
Use mean normalization to fix the problem of that users who have not rated anyone of the movies.
17 Large scale machine learning
17.1 Learning with large datasets
Two computational reasonable ways are stochastic gradient descent and Map Reduce.
17.2-4 Stochastic Gradient descent & Mini-batch Gradient descent
17.5 Online learning setting 在线学习机制
Continuous stream of data
Adapt to changing user preference and keep track of users behavior/taste
Example: product search
17.5 Map-reduce and data parallelism
Example: Split the training set to 4 machines, compute temporary variable separately, combine together
Many learning algorithms can be expressed as computing sums of functions over the training set
Example: logistic regression
Multi-core machines 多核计算
18 application example: photo OCR
18.1 problem description and pipeline
How a complex machine learning system can be out together
Concept of machine learning type line 机器学习流水线 and how to allocate resources
Apply machine learning to computer vision problem and the artificial data synthesis
Photo OCR : photo Optical Character Recognition 照片光学字符识别
How to get the computer to understand the content of these pictures a little bit better, to get the computer to read the text to the purest in images that we take
Applications: blind camera, car navigation system
Photo OCR pipeline: Text detection—character segmentation—character classification/recognition
18.2 Sliding windows 滑动窗口
Stepsize/stride
18.3 Getting lots of data and artificial data
Character recognition字体库,可以从0开始构建训练集;也可以基于样本扩样。
Ways of Getting more data: artificial data synthesis; collect/label it yourself; “crowd source” (E.g. Amazon Mechanical Turk)
18.4 Ceiling analysis what part of the pipeline to work on next?

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值