NLP(1)----Introduction to neural networks

Nowadays, the application of neural networks is very extensive. At the same time, the concept of deep learning has been further applied because the computing power of computers has been greatly improved. This blog attempts to use a brief text to briefly describe the neural network introduction.

This is my class notes on learning NLP , which I shared on my blog. —Mr.Wang QingBang

Directory

Neural Networks

  • Numbers
  • Variables
  • Operators
  • Functions
  • Parameters
  • Cost Functions
  • Optimizers
  • Gradients

Deep Learning

  • Nonlinear Neural Models
  • Multilayer Perceptrons
  • Using Discrete Variables
  • Example Application

Body

Number
As far as i’m concerned, number is a symbol of thing, because of number, we also can describe the objective characteristics of things more specifically.

Operators
In fact, an operator can directly show the connection between two Numbers.

Functions
When the operators becomes complicated, we need to abstract the operation. This is the function.

y=3x
Input, output

Interestingly, after we have a function ,we can say we can accurately predict the y value of each input x corresponding to the output.

In a sense, many of the things we do are similar to a function, such as translation.

在这里插入图片描述
x - Functions > y
在这里插入图片描述
Obviously, the problem is how to find this function. This is also the hardest part. This remind me of the expert knowledge base that emerged twenty years ago. This is a simple prediction or translation method for specifying rules, bue now, if we use machine learning, we will achieve better results.

Parameters
If we are in the above formula, the symbol is added to indicate the relationship between the x and y. This is Parameter.

y = Wx + B
W and B are parameters. Comes from data. Parameters - Need to be estimated.

So, how do we estimate the parameters?
We need data. For example:

Now, we give you some data to estimate the parameters(W and B).

xy
10
516
620

You can get
{
   y = 1x + 0
  1 = 1 * 1 + 0
  5 = 1 * 5 + 0
  6 = 1 * 6 + 0
}

But, you also can get
{
   y = 2x + 2
  4 = 2 * 1 + 2
  12 = 2 * 5 + 2
  6 = 14 * 6 + 2
}

So, which one is better?

Cost Function
We need a function to evaluate the model. ⇒ C(w,b)
So:
C ( W , b ) = ∑ n ∈ { 0 , 1 , 2 } ( y n − y ^ n ) 2 C(W, b)=\sum_{n \in\{0,1,2\}}\left(y_{n}-\hat{y}_{n}\right)^{2} C(W,b)=n{0,1,2}(yny^n)2

Therefore, we can use cost function to calculate. Finally, we can conclude that the second plan is the best one.

Optimizers
How to find the parameters w and b?
It’s not a simple question. We need to find a method to optimize our model.
在这里插入图片描述
In simple terms, we constantly try the values of w and b to find better parameters.

Gradients
Step size, We try different parameters sizes one by one, but what is the right change at a time?
在这里插入图片描述
Sometime, we will have millions of parameters, such as images.

arg ⁡ min ⁡ w , b ∈ [ − ∞ , ∞ ] C ( w , b , . . . ) \underset{w, b \in[-\infty, \infty]}{\arg \min } C(w, b,...) w,b[,]argminC(w,b,...)
Of course , you can make the step size small enough, but this also means you need to spend more time debugging parameters.
In fact, we just use derivatives to solve the problem.we can substitute data.
∂ C ∂ w = ∂ ∑ ( y ^ n − y n ) 2 ∂ w = ∑ n − 2 ( y ^ n − y n ) x n \frac{\partial C}{\partial w}=\frac{\partial \sum\left(\hat{y}_{n}-y_{n}\right)^{2}}{\partial w}=\sum_{n}-2\left(\hat{y}_{n}-y_{n}\right) x_{n} wC=w(y^nyn)2=n2(y^nyn)xn
∂ C ∂ b = ∂ ∑ ( y ^ n − y n ) 2 ∂ b = ∑ n − 2 ( y ^ n − y n ) x n \frac{\partial C}{\partial b}=\frac{\partial \sum\left(\hat{y}_{n}-y_{n}\right)^{2}}{\partial b}=\sum_{n}-2\left(\hat{y}_{n}-y_{n}\right) x_{n} bC=b(y^nyn)2=n2(y^nyn)xn
In each iteration we substitute the value of the current parameter.
在这里插入图片描述
在这里插入图片描述
Code implementation:

def dJ_sgd(theta, X_b_i, y_i):
    return X_b_i.T.dot(X_b_i.dot(theta) - y_i) * 2.

def sgd(X_b, y, initial_theta, n_iters):
    t0 = 5
    t1 = 50
    def learning_rate(t):
        return t0 / (t + t1)

    theta = initial_theta
    for cur_iter in range(n_iters):
        rand_i = np.random.randint(len(X_b))
        gradient = dJ_sgd(theta, X_b[rand_i], y[rand_i])
        theta = theta - learning_rate(cur_iter) * gradient
    return theta

在这里插入图片描述

Deep Learning

Nonlinear Neural Models
y = 2x is a very typical nonlinear models. However , the problem is that most real-life phenomena are not linear. The activation function is a method which many people recognize.
For example:

  • sigmoid function
    S ( t ) = 1 1 + e − t S(t)=\frac{1}{1+e^{-t}} S(t)=1+et1
    在这里插入图片描述
    We can build a complex multilayer perceptrons just by using the sigmod function.

Multilayer Perceptrons
In most cases, the sensor processing of on layer is not very satisfactory. Therefore, we often use multilayer perceptrons.
在这里插入图片描述
In my opinion, multilayer perceptrons are more like functions with multiple parameters. Perhaps most of the preceptrons we see are like this:
在这里插入图片描述
Don’t think that one or two layers of perceptrons can’t complete the task. However, this way maybe will cause some matters, such like overfitting and underfitting.
在这里插入图片描述
Obviously, the horizontal axis is model complexity, and the vertical axis is task complexity. When you add more layers, you also get more explanatory power. But, you will have the risk of overfitting except you have enough data.

In this case, bias plays a very important role. Regularization can make our model not necessarily meet all the characteristics to become very complicated.

Using Discrete Variables
Discrete variables have a very important role in similar classification models. So how do this?

We usually get some probability of output.
在这里插入图片描述

Summary
We can use a graph to extract the core idea.
在这里插入图片描述
End

Thanks

  • 1
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值