介绍机器学习数学原理的书_机器学习原理介绍

介绍机器学习数学原理的书

学习是表示,评估和优化的结果 (Learning is the Result of Representation, Evaluation, and Optimization)

The field of machine learning has exploded in recent years and researchers have developed an enormous number of algorithms to choose from. Despite this great variety of models to choose from, they can all be distilled into three components.

近年来,机器学习领域迅猛发展,研究人员开发了众多算法供您选择。 尽管有多种模型可供选择,但它们都可以分为三个部分。

The three components that make a machine learning model are representation, evaluation, and optimization. These three are most directly related to supervised learning, but it can be related to unsupervised learning as well.

构成机器学习模型的三个组件是表示,评估和优化。 这三个与监督学习最直接相关,但也可以与无监督学习相关。

Representation - this describes how you want to look at your data. Sometimes you may want to think of your data in terms of individuals (like in k-nearest neighbors) or like in a graph (like in Bayesian networks).

表示法 -描述您要如何查看数据。 有时,您可能想根据个人(例如在k近邻中)或在图形中(例如在贝叶斯网络中)来考虑数据。

Evaluation - for supervised learning purposes, you’ll need to evaluate or put a score on how well your learner is doing so it can improve. This evaluation is done using an evaluation function (also known as an objective function or scoring function). Examples include accuracy and squared error.

评估 -出于监督学习的目的,您需要评估学习者的学习状况或对其进行评分,以提高学习效果。 使用评估函数(也称为目标函数得分函数 )完成评估。 示例包括准确性和平方误差。

Optimization - using the evaluation function from above, you need to find the learner with the best score from this evaluation function using a choice of optimization technique. Examples are a greedy search and gradient descent.

优化 -使用上面的评估功能,您需要使用多种优化技术从该评估功能中找到得分最高的学习者。 例如贪婪搜索和梯度下降。

泛化是关键 (Generalization is Key)

The power of machine learning comes from not having to hard code or explicitly define the parameters that describe your training data and unseen data. This is the essential goal of machine learning: to generalize a learner’s findings.

机器学习的力量来自于无需硬编码或明确定义描述训练数据和看不见数据的参数。 这是机器学习的基本目标:概括学习者的发现。

To test a learner’s generalizability, you’ll want to have a separate test data set that is not used in any way in training the learner. This can be created by either splitting your entire training data set into a training and test set, or to just collect more data. If the learner were to use data found in the test data set, this would create a sort of bias in your learner to do better than in reality.

为了测试学习者的通用性,您需要有一个单独的测试数据集,该数据集在培训学习者时没有任何使用。 可以通过将您的整个训练数据集拆分为一个训练和测试集,或者仅收集更多数据来创建。 如果学习者要使用测试数据集中的数据,这将在学习者中造成某种偏向,使其比实际情况做得更好。

One method to get a sense on how your learner will do on a test data set is to perform what is called cross-validation. This randomly splits up your training data into a given number of subsets (for example, ten subsets) and leaves one subset out while the learner trains on the rest. And then once the learner has been trained, the left out data set is used for testing. This training, leaving one subset out, and testing is repeated as you rotate through the subsets.

了解学习者对测试数据集的处理方式的一种方法是执行交叉验证 。 这会将您的训练数据随机分为给定数量的子集(例如,十个子集),而一个子集留给学习者进行其余训练。 然后,一旦培训了学习者,剩下的数据集将用于测试。 轮流浏览这些子集时,将省略一个子集并重复测试。

当心过度拟合 (Beware of Overfitting)

If a learning algorithm fits a given training set well, this does not simply indicate a good hypothesis. Overfitting occurs when the hypothesis function J(Θ) fits your training set too closely having a high variance and low error on the training set while having a high test error on any other data.

如果学习算法很好地适合给定的训练集,则这并不仅仅表示一个好的假设。 当假设函数J(Θ)过于紧密地拟合您的训练集时,在训练集上具有高方差和低误差,而在其他任何数据上都具有高测试误差,则会发生过度拟合。

In other words, overfitting occurs if the error of the hypothesis as measured on the data set that was used to train the parameters happens to be lower than the error on any other data set.

换句话说,如果在用于训练参数的数据集上测得的假设的误差恰好低于任何其他数据集的误差,则会发生过度拟合。

选择最佳多项式 (Choosing an Optimal Polynomial Degree)

Choosing the right degree of polynomial for the hypothesis function is important in avoiding overfitting. This can be achieved by testing each degree of polynomial and observing the effect on the error result over various parts of the data set. Hence, we can break down our data set into 3 parts that can be used in optimizing the hypothesis’ theta and polynomial degree.

为假设函数选择正确的多项式级数对于避免过度拟合很重要。 这可以通过测试多项式的每个阶数并观察对数据集各个部分的错误结果的影响来实现。 因此,我们可以将数据集分为3个部分,这些部分可用于优化假设的theta和多项式度。

A good break-down ratio of the data set is:

数据集的良好分解率是:

  • Training set: 60%

    训练集:60%
  • Cross validation: 20%

    交叉验证:20%
  • Test set: 20%

    测试集:20%

The three error values can thus be calculated by the following method:

因此,可以通过以下方法计算三个误差值:

  1. Use the training set for each polynomial degree in order to optimize the parameters in Θ

    对每个多项式使用训练集以优化Θ的参数

  2. Use the cross validation set to find the polynomial degree with the lowest error

    使用交叉验证集查找误差最小的多项式
  3. Use the test set to estimate the generalization error

    使用测试集估计泛化误差

解决过度拟合的方法 (Ways to Fix Overfitting)

These are some of the ways to address overfitting:

这些是解决过度拟合的一些方法:

  1. Getting more training examples

    获得更多培训示例
  2. Trying a smaller set of features

    尝试较少的功能
  3. Increasing the parameter λ lambda

    增加参数λ lambda

翻译自: https://www.freecodecamp.org/news/machine-learning-principles-explained/

介绍机器学习数学原理的书

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值