NLP(1)----Introduction to neural networks

最新推荐文章于 2020-12-11 15:26:59 发布

崩坏的芝麻

最新推荐文章于 2020-12-11 15:26:59 发布

阅读量670

点赞数 1

文章标签： NLP introduction neural networks

本文链接：https://blog.csdn.net/wangqingbang/article/details/95113591

版权

Nowadays, the application of neural networks is very extensive. At the same time, the concept of deep learning has been further applied because the computing power of computers has been greatly improved. This blog attempts to use a brief text to briefly describe the neural network introduction.

This is my class notes on learning NLP , which I shared on my blog. —Mr.Wang QingBang

Neural Networks

Numbers
Variables
Operators
Functions
Parameters
Cost Functions
Optimizers
Gradients

Deep Learning

Nonlinear Neural Models
Multilayer Perceptrons
Using Discrete Variables
Example Application

Body

Number
As far as i’m concerned, number is a symbol of thing, because of number, we also can describe the objective characteristics of things more specifically.

Operators
In fact, an operator can directly show the connection between two Numbers.

Functions
When the operators becomes complicated, we need to abstract the operation. This is the function.

y=3x
Input, output

Interestingly, after we have a function ,we can say we can accurately predict the y value of each input x corresponding to the output.

In a sense, many of the things we do are similar to a function, such as translation.

在这里插入图片描述
x - Functions > y

Obviously, the problem is how to find this function. This is also the hardest part. This remind me of the expert knowledge base that emerged twenty years ago. This is a simple prediction or translation method for specifying rules, bue now, if we use machine learning, we will achieve better results.

Parameters
If we are in the above formula, the symbol is added to indicate the relationship between the x and y. This is Parameter.

y = Wx + B
W and B are parameters. Comes from data. Parameters - Need to be estimated.

So, how do we estimate the parameters?
We need data. For example:

Now, we give you some data to estimate the parameters(W and B).

x	y
1	0
5	16
6	20

You can get
{
y = 1x + 0
1 = 1 * 1 + 0
5 = 1 * 5 + 0
6 = 1 * 6 + 0
}

But, you also can get
{
y = 2x + 2
4 = 2 * 1 + 2
12 = 2 * 5 + 2
6 = 14 * 6 + 2
}

So, which one is better?

Cost Function
We need a function to evaluate the model. ⇒ C(w,b)
So：
$b)=\sum_{n \in\{0,1,2\}}\left(y_{n}-\hat{y}_{n}\right)^{2}$

Therefore, we can use cost function to calculate. Finally, we can conclude that the second plan is the best one.

Optimizers
How to find the parameters w and b?
It’s not a simple question. We need to find a method to optimize our model.
在这里插入图片描述
In simple terms, we constantly try the values of w and b to find better parameters.

Gradients
Step size, We try different parameters sizes one by one, but what is the right change at a time?
在这里插入图片描述
Sometime, we will have millions of parameters, such as images.

$\underset{w, b \in[-\infty, \infty]}{\arg \min } C(w, b,...)$
Of course , you can make the step size small enough, but this also means you need to spend more time debugging parameters.
In fact, we just use derivatives to solve the problem.we can substitute data.
$\frac{\partial C}{\partial w}=\frac{\partial \sum\left(\hat{y}_{n}-y_{n}\right)^{2}}{\partial w}=\sum_{n}-2\left(\hat{y}_{n}-y_{n}\right) x_{n}$
$\frac{\partial C}{\partial b}=\frac{\partial \sum\left(\hat{y}_{n}-y_{n}\right)^{2}}{\partial b}=\sum_{n}-2\left(\hat{y}_{n}-y_{n}\right) x_{n}$
In each iteration we substitute the value of the current parameter.
在这里插入图片描述

Code implementation:

def dJ_sgd(theta, X_b_i, y_i):
    return X_b_i.T.dot(X_b_i.dot(theta) - y_i) * 2.

def sgd(X_b, y, initial_theta, n_iters):
    t0 = 5
    t1 = 50
    def learning_rate(t):
        return t0 / (t + t1)

    theta = initial_theta
    for cur_iter in range(n_iters):
        rand_i = np.random.randint(len(X_b))
        gradient = dJ_sgd(theta, X_b[rand_i], y[rand_i])
        theta = theta - learning_rate(cur_iter) * gradient
    return theta

在这里插入图片描述

Deep Learning

Nonlinear Neural Models
y = 2x is a very typical nonlinear models. However , the problem is that most real-life phenomena are not linear. The activation function is a method which many people recognize.
For example:

sigmoid function
$S(t)=\frac{1}{1+e^{-t}}$

We can build a complex multilayer perceptrons just by using the sigmod function.

Multilayer Perceptrons
In most cases, the sensor processing of on layer is not very satisfactory. Therefore, we often use multilayer perceptrons.
在这里插入图片描述
In my opinion, multilayer perceptrons are more like functions with multiple parameters. Perhaps most of the preceptrons we see are like this:

Don’t think that one or two layers of perceptrons can’t complete the task. However, this way maybe will cause some matters, such like overfitting and underfitting.
在这里插入图片描述
Obviously, the horizontal axis is model complexity, and the vertical axis is task complexity. When you add more layers, you also get more explanatory power. But, you will have the risk of overfitting except you have enough data.

In this case, bias plays a very important role. Regularization can make our model not necessarily meet all the characteristics to become very complicated.

Using Discrete Variables
Discrete variables have a very important role in similar classification models. So how do this?

We usually get some probability of output.
在这里插入图片描述