Machine Learning(Notes of Hung-yi Lee's lessons)

本文链接：https://blog.csdn.net/qq_40475568/article/details/79104456

Notes of Machine Learning

Ⅰ.Main steps

Step 1. Model

Create some functions and put them in a function set. For example, we create these following functions:

b: bias

w: weight

y: predicted output with the current input-xcp

So, now there are many functions in our function set, we'll test them later to find out which one is the best.

Step 2. Goodness of Function

Collecting training data.

Test the fuctions created previously by Loss Function.

y hat n: actual value of the nth output.

Estimation error: the more smaller, the more better.

Step 3. Best Function (Gradient Descent)

If Loss fuction is differentiable, we can use Gradient Descent.

How to do?

Consider Loss Function with one parameter w:

1. (Randomly)Pick an initial value w⁰

2. Compute

(1)if negative, increase w;

(2)if positive, decrease w;

How much will we increase (step size)?

It depends on the current differential value and η (positive correlation).
So

η is called "learning rate", a value that was setted previously, it determines the updating range of the parameter.

3. Compute

4. Many iteration...until we find global optimal (there is no local optimal in Linear Regression (the loss function is convex),don't worry about that), and we can have a preliminary fuction.

5. Compute Average Error on Training data.

6. Collect another n data as testing data and compute average error on testing data.

7. Observe the data and testing the other functions in our function set (maybe more complicated model?), usually a more complex model yields lower error on training data, but error on testing data maybe higher(overfitting), so we should select a suitable model.

8. Collect more testing data, if our function doesn't fit it well, Back to Step 1, redesign it.

9. Back to Step 2, regularization(prevent overfitting)

λ: a value ajusted by yourself. Larger λ, considering the training error less.

The function with smaller wi are better (the smoother image), we prefer smooth function, but don't be too smooth.

Ⅱ.Analysis of Error

1. Sources of error

Error due to "bias" and "variance".

A simple model has a Large Bias but Small Variance, a complex model has a Small Bias but a Large Variance.

Overfitting: error mainly from variance;

Underfitting: error mainly from bias.

2. Diagnosis

If our model cannot even fit the training examples, then we have large bias (underfitting);

If it can fit the training data, but large error on testing data, then we probably have large variance (overfitting).

3. Solution

For large bias, we should redesign our model:

· Add more features as input

· A more complex model

For large variance:

· Collect more data, a very effective way, but not always pratical, or we can make some fake data, like voice of female to male etc.

· Regularization. Make the fuctional image smoother, but may destroy bias.

4. Model Selection

· There is usually a trade-off between bias and variance.

· Select a model that balances two kinds of error to minimize total error.

Use Cross Validation or N-fold Validation

Ⅲ.Gradient Descent

Ⅳ.Classification

If the observed data are truly sampled from theGenerative Model, then fitting the parameters of the generative model to maximize the data likelihood is a common method.However, since most statistical models are only approximations to thetrue distribution, if the model's application is to infer about a subset of variables conditional on known values of others, then it can be argued that the approximation makes more assumptions than are necessary to solve the problem at hand. In such cases, it can be more accurate to model the conditional density functions directly using aDiscriminative Model, although application-specific details will ultimately dictate which approach is most suitable in any particular case .

1.Generative Model

Generative model is a model that we assumed by observing the training data It also has three steps similar to ML.

Generative models are used in machine learning for either modeling data directly (i.e., modeling observations drawn from aprobability density function), or as an intermediate step to forming aconditional probability density function. Generative models are typically probabilistic, specifying ajoint probability distribution over observation and target (label) values. A conditional distribution can be formed from a generative model throughBayes' rule.

Step 1. Function Set(Model)

Use Bayes

If it > 0.5, output: Class 1;

Otherwise, output: Class 2.

There are many different kinds of probability distribution in the function set, such as: Gaussion Distribution,Bernoulli Distribution etc. It is chosen by ourselves, for example, for binary features, we can assume they are from Bernoulli Distribution.

Step 2. Goodness of a Function

Take Gaussion Distribution as an example. In this distribution, what we should evaluate are mean μ and convarience matrix Σ.

Here we use Maximum Likelihood.

(
μ∗,
Σ∗)is the maximum likelihood of all the ( μ , Σ ).

And it's easy to find out , we can take the partial with respect to ( μ , Σ ) of

L(μ,Σ), and evaluate the ( μ , Σ ) where partial differential value is zero.

If we assume all the dimensions are independent, then we are using Naive Bayes Classifier.

Step 3. Best Function

In Generative Model, we estimate N₁,N₂...,μ¹,μ²..., Σ.

Then have w and b.

Here you may wonder, if our purpose is to evaluate w and b, why not do it directly?

Actually, we have such a way- Discrinimative Model.

2.Discriminative Model

Discriminative models, also called conditional models, are a class of models used in machine learningfor modeling the dependence of unobserved (target) variablesyon observed variablesx. Within aprobabilistic framework, this is done by modeling the conditional probability distribution P ( y | x ), which can be used for predictingyfromx.

In Discriminative Model, we don't have to assume what the model will be like, instead, we directly evaluatew andb.

Step 1. Function Set(Model)

Visualize function set:

Step 2. Goodness of a Function (Loss Function)

H(p,q) is cross entropy between distribution p and q, cross entropy discribes how close two distributions are, if they are the same, the value of H(p,q) is zero.

Compared with Linear Regression:

Step 3. Best Function

We also use Gradient Descent.

For Logistic Regression, the value of target

ŷ n is 0 or 1, the value of output is the value between 0 and 1.

3.Generative v.s. Discriminative

In probability andstatistics, agenerative model is a model for generating all values for a phenomenon, both those that canbe observed in the world and "target" variables that can only be computed from those observed. By contrast,discriminative modelsprovide a model only for the target variable(s), generating them by analyzing the observed variables. In simple terms, discriminative models infer outputs based on inputs, while generative models generate both inputs and outputs, typically given somehidden parameters.

Usually, the w and b generated from these two different models are different.

Benefit of Generative Models：

· With the assumption of probability distribution, less traininig data is needed;

· With the assumption of probability distribution, more robust to the noise;

· Priors and class-dependent probabilities can be estimated from different sources.For example, phonetic recognition, Generative Model is the main method, Discriminative Model is only a part of it, because we still need a Prior probability-the probability of a sentence being spoken, and this process don't need any audio data.

Benefit of Discriminative Models：

Discriminative models, as opposed to generative models, do not allow one to generate samples from the joint distribution of observed and target variables. However, for tasks such as classification and regression that do not require the joint distribution, discriminative models can yield superior performance (in part because they have fewer variables to compute). On the other hand, generative models are typically more flexible than discriminative models in expressing dependencies in complex learning tasks. In addition, most discriminative models are inherently supervised and cannot easily support unsupervised learning. Application-specific details ultimately dictate the suitability of selecting a discriminative versus generative model.

Keep studying, keep stepping forward.