Little about Optimization

最新推荐文章于 2022-06-20 13:06:03 发布

QuiSuis-Je

最新推荐文章于 2022-06-20 13:06:03 发布

阅读量672

点赞数

分类专栏： Machine Learning 文章标签： Bayesian Learning

Machine Learning 专栏收录该内容

53 篇文章 0 订阅

订阅专栏

Optimization is quite important and useful in many fields, no one can express the complex mathematical concepts in such a short article. However here I present some basic knowledge about optimization problems, especially the unconstrained optimization. The content of this article is also great aided by Amos Storkey.

Unconstrained Optimization Problems

Unconstrained nonlinear optimization methods can be classified by what information theyuse:

1) If using zeroth order of the function, we can just use basic search approaches, like bracketing.
2) If we use first order of the function, we can use gradient method like steepest descent
3) If we use first order to approximate second order, we can use algorithms like conjugate gradient or Quasi-Newton
4) If we use second order information, we can use Newton-Raphson approach.
5) However, we barely use higher order information as additional computation costs not worth any accuracy gain.

Usually, we have a better result by using a higher order information, why not just use high order methods?

1) Compute high order information may be costly. Suppose we have N dimension data. Then at each point, we have 1 value, N derivatives, and N^2 second order derivatives. Computing these values takes time and storage.
2) Using high order information may be costly. Suppose we are using the inverse of Hessian, it has the computational complexity of O(N^3)

To achieve the benefits of high order information, we usually use low order approximation methods.

Suppose now we want to use the first order to optimize some function, how can we utilize the first-order derivative to guide our function to search the space? The most common approach is to use the gradient of the space to choose a direction, then move along this direction for one step. However, here we might face a problem to choose the length of the step. One algorithm named Line Search takes this problem into consideration.

1) Suppose we are at the position of $\theta_t$
2) We take the first derivative to have a direction $v$ to move in this parameter space
3) Now we move along this space to minimize the cost function

It seems that gradient descent search is clever enough to find the optimal solution. However, it has a problem. Sometimes it is jumping in the space, the path it search just is just zig-zagging. So it might occur to us that we need some more information.

Now we have a look at second order information. How can we get get this second order information? We can use the Taylor expansion:

Where the second order information is called Hessian:

If H is positive definite, this models the error surface as a quadratic bowl as show in Figure below

Now how can we use this information? We start from the simple problem, suppose we have a quadratic error function:

We can have the answer directly:

However, as I have mentioned above, computing Hessian takes O(N^2) and computing the inverse takes O(N^3). Any clever way to do this?

One good solution is conjugate gradients. Suppose we are still given the quadratic function.

1) We can find some basis V = [v_1,v_2,...,v_N] from H such that V'HV is diagonal. Without loss of generality, we can have

2) Now we can express \theta in the new basis

3) Now the function becomes:

Now we have successfully decomposed the original problem into some sub optimal problem. And each direction $v$ is conjugate to others, which means we have good direction to follow without computing the Hessian.

So the conjugate gradients algorithms could be:
1) Pick up one direction which is conjugate with previous ones
2) Optimize in that direction with line search

Usually it will get near the optimum in only a few steps.

Can this approach adaptable for general nonlinear functions? Usually, we can still use this method, but we have a problem that Hessian might not be positive definite. One approach is using scaled additive term to cope with this non-positive definite matrix. And during the search procedure, using a dynamic stepsize instead of line search.

Little Extension

Now we have some basic idea about unconstrained optimization problems. What about the problem is under some constraints?

One basic and very useful idea is, remove the constraints. Suppose we have a constraint that \theta>0. We can set a function:

Now \phi is unconstrained.

Another way is to use constrained optimization methods. Here I will not discuss in details. However these methods can usually be classified into two groups:Linear ProgrammingandQuadratic Programming.

Another question we might talk about, is to find the global optimum. Most algorithms will not guarantee this, one simple approach to deal with local minimum, it to train the optimizer with different initial starts, and find the best.

QuiSuis-Je

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Little about Optimization

Optimization is quite important and useful in many fields, no one can express the complex mathematical concepts in such a short article. However here I present some basic knowledge about optimization
复制链接

扫一扫

专栏目录