Andrew Ng machine learning 课程笔记--顺序最小优化算法

Kernels:maybe this is the living area of a house that you are trying to make a prediction on,like whether it will be sold in the next six months.quite often,we will atke this feature X and we will map it to a richer set of features.so for example,we will take X and map it to these four polynomial features,and let me actually call this mapping phi.as we will let phi of X denote the mapping from your original features to some higher dimensional set of features.if you do this and you want to use the features phi of X.then all you needto do is go back to the learning algorithm and everywhere you see XI,xj,we will replace it with the inner product between phi of XI and phi of XJ.so this corresponds to running a support vector machine with the features given by phi of X rather than with your original one-dimensional input feature X.and in a scenario that I want to consider,sometimes phi of X will be very high dimensional,and in fact sometimes phi of X;so for example,phi of X can contain very high degree polynomial features.sometimes phi of X will actually even be an infinite dimensional vector of features,and our question id if phi of X is an extremely high dimensional,then you can't actually compute to these inner products very efficiently ,it seems,because computers need to represent an extremly high dimensional feature vector   and then take inefficient.it turns out that in many important special cases,we can write down,let us call the kernel function,denoted by K,whhich will be this,which would be inner product between those feature vectors.it turns out there will be important special cases where computing phi of X is computationally very expensive:maybe is impossible.there is an infinite dimensional vector,and you can't compute infinite dimensional vectors.there will be some important special cases,where phi of X is very expensive to represent because it is so high dimensional,but nonetheless.let us say you have two inputs,x and z .normally I should write those as XI and XJ,but I am just going to write X and Z to save on writing.let us say my kernek is K of X,Z equals X transpose Z squared.and so this is : right?X transpose Z: this thing here is X transpose Z and this is X transpose Z,so this is X transpose Z squared.and that is equal to that.And so this kernel corresponds to the feature mapping where phi of X is equal to :and I will write this down .for the case of N equals free,I guess.you can verify for yourself that this thing becomes the inner product between phi of X and phi of Z,because to get an inner product between two vectors is;you can just take a sum of the corresponding elements of the vectors.you multiply them.so iif this is phi of X,then the inner product between phi of X and phi of Z will be the sum over all the elements of this vector times the corresponding elements of phi of Z,and what you get is this one.and so the cool thing about this is that in order to compute phi of X,you need just to compute phi of X.if N is a dimensional of X and Z,then phi of X is a vector of all pairs of XI XJ multiplied by each other.and so the length of phi of X is N squared.you need order N squared time just to compute phi of X.but to compute K :is to compute the kernel function,all you need is order N time,because the kernel function is defined,as X transpose Z squared,so you just take the inner product between X and Z,which is order N time and you square that and you have computed this kernel function,and so you just computed the inner productbetween two vectors where each vector has N squared elements but you did it in N square time.

Generalizations:if you define KXZ to be equal to X transpose Z plus C squared,so again,you can compute this kernel in order and time then that turns out to correspond to a feature vector where I am just going to add a few more elements at the bottom where you add root 2.let me read that.that was root 2 CX1 root 2 CX2 root 2 CX3 and C.and so this is a way of creating a feature vector with both the monomials,meaning the first order terms,as well as the quadratic or the inner product terms between XI and XJ,and the parameter C allows you to control the relative waiting between the monomial terms ,so the first order terms,and the quadratic terms.again ,this is still inner product between vectors of length and square in order N time.more generally,here are some other examples of kernels.actually,a generalization of the one I just derived right now would be the following kernel.and so this corresponds to using all N plus DQZ features of all monomials.monomials just mean the products of XI XJ XK.just all the polynomial terms up to degree D and plus so on the order of N plus D to the power of D,so this grows exponentially in D.this is very high dimensional feature vector,but again,you can implictly construct the feature vector and take inner products between them.It's very computationally efficient,because you just compute the inner product between X and Z,add C,and you atke that real number to the power of D and by plugging this as akernel,you are implictly working in a extremely high dimensional computing space.so what I have given is just a few specific examples of how to create kernels.I want to go over just a few specific examples of kernels.so let us ask you more generally if you are faced with a new machine learning problem,how do you come up with a kernel?there are many ways of think about it,but here is one intuition that is sort of useful.so give a set of attributes of X,you are going to use a feature vector of phi of X and given a set of attributes of Z,you are going to use an input feature vector phi of Z,and so the kernel is computing the inner product between phi of X and phi of Z.and so one intuition:this is a partial intuition.this isn't as rigorous intuition that it is used for.it's that if X and Z are very similar,then phi of X and phi of Z will be pointing in the same direction,and therefore the inner product would be large.whereas in contrast,if X and Z are very dissimilar,then phi of X and phi of Z may be pointing different directions,and so the inner product may be small.that intuition is not a rigorous one,but it's sort od a useful one to think about.and so if you are faced with a new learning problem;if I give you some random thing to classify and you want to decide how to come up with a kernel,one way is to try to come up with the function P of XZ that is large,if you want to learn the algorithm to think of X and Z as similar and small.again,this isn't always true,but this is one of several instuitions.so if you are trying to classify some brand new thing;you are trying to classify ,one thing you could do is try to come up with a kernel that ia large when you want the algorithm to think these are similar things or these are dissimilar.and so this answers the question of for example  he is say I have something I want to classify,and let us say I write down the function that I think is a good measure of how similar or dissimilar X and Z are for my specific problem.Let's say I write down K of XZ equals E to the minus.and I think this is a good measure of how similar X  and Z are,is there really exist such phi that KXZ is equal to the inner product?it turns out that there is a result that characterizes necessary and sufficient conditions for one functions that you might choose is a valid kernels.I should go ahead show part of that result now.Suppose K is a valid kernel,and when I say K is a kernel,what I mean is there does indeed exist some function phi for which this holds true.then let any set of points XI up to XM be given.let me define a matrix K.we need to find the kernel matrix to be an M by M matrix such that K subscript Ij is equal to the kernel fuction applied to two of my examples.then it turns out that I  want you to consider Z transpose KZ.by definition of matrix multiplication,that it is and so KIJ is a kernel function between XI and XJ,if K is a valid kernel;if K is a function for which there exists some phi such that K of XI xj is the inner product between phi of XI and phi of XJ.so if K is a valid kernel,we shown that the kernel matrix must be posisemidefinite.it turns out that the conversen and so this gives you a test for whether a function K is a valid kernel.so this is a theorem due to Mercer,and so kernel are also sometimes called Mercer kernels.

Applization:to apply a support vector machine kernel,you choose one of these functions,and the choice of this would depend on your problem.it depends on what is agood measure of one or two examples similar and one or two examples different for your problem.you replace everywhere you see these things,you replace it with K of XI,xj.and then you run exactly the same support vector machine algorithm,only everywhere you see these inner products,you replace them with that and what you have just done is you have taken a support vector machine and you have take each of your feature vector X and you have replaced it with implicitly a very high dimensional feature vector.it turns out that the Gaussian kernel corrresponds to a feature vector that infinite dimensional.nontheless,you can run a support vector machine I a finite amount of time,even though you are working with infinite dimensional feature vectors.because all you ever need to do is compute these things,and you don't ever need to represent these infinite dimensional feature vectors explicitly.I started that we wanted to atart to develop non-linear learning algorithms.so here is one useful picture to keep in mind,which is that let us say your original data;let us say that you have one dimensional input data.what the kernel is the following.and then you run SVM in this infinite dimensional space and also exponentially high dimensional space,and you will find the optimal margin classifier;you linear classifier to which data is not really separable in your original space.one way to choose is save aside a small amount of your data and try different values of sigmer and trai anSVM using two thirds of your data.try different values of sigmer,then see what works best on a separate hold out cross valition set:on a separate set  that you are testing .sommetimes about learning algorithms .

Kernels:but it turns out that the idea of kernels is actually more general than support vector machine,and inparticualr,we tok this SVM algorithm ,we derived a dual and that was what let us write the entire algorithm in terms of inner products of these.it turns out that you can take many of the other algorithms that you have seen in this class: in fact,it turns out you can take most of the linear algorithms such as linear regressin,logistic regression it turns out you can take all of these algorithms and then rewrite them entirely in term of these inner products.than that mean you can replace then with K of XI XJ and that means you can take any of these algorithms and implicitly map the features vectors of these very high dimensional feature spaces and have the algorithm still work. The idea of kernel is perhaps most widely used with SVM ,but it is actually more general than that,you can any algorithms you have seen and many of the algorithms that we will see later this quarter as well and write them in terms of inner products and thereby kernelize them and apply them to infinite dimensionap feature spaces.

L1 norm Soft margin SVM:let's say I have a data set.this is a linear aeparable data set,but what I do if I have a couplle of other examples there that makes the data nonlinearly separable,and in fact,sometimes .when you derive the dual of the optimizatiion problem and when you simplify,youhave to maximize phi which is actually the same as before.so it turns out ,whenyou derive the dual and simplify,it turns out that the only way the dual change s compared to the the previous one is that rather than contraint that the alpha are greater than or equal to zero,we now have a constraint that the alphas are between zero and C.

KKT:the necessary conditions for something to be an optimal solution to constrain optimization problems.you can zctually derive conversions conditions,so we want to solve this optimization problem.when do you know the alphas have converged to the global optimum?

SMO:an algprithm for actually solving this optimization problem.we wrote down the dual optimization problem with convergence criteria,so let us come up with an efficient algorthm.we will try to change two Alphas at a time.it is called the sequential minimal optimization .the minimal refers we are choosing the smallest number of Alpha Is to change a time.which in the case we need to change at least two at a time.in order to derive that step where we update in respect to Alpha I and Aipha J,if you minimize the quadratic function,maybe you get a value that lies in the box,and if so ,you are done.maybe when you optimize your quadratic function,you may end up with a value outside,so you end up with a solution like that. If that happens,you clip your solution just to map it back inside the box.

Coordinate assent:maxmize some funnction of w,from aiphi 1 to aiphi m.and wil do from I equals 1 to m,the coordinate assent essentially holds all the parameters except alpha I fixed and then it just maximizes this function with respect to just one of the parameteres.sometimes the optimization objective w sometimes is very inexpensive to optimize w with respect to any one of your parameters,and so coordinate assent has to take many more iterations than say Newton's methhod in order to converge.it turns out that are many optimization problems for which it's particularly easy to fix all but one of the parameters and optimize with respect to just that one parameter,that if is true,then the inner group of coordinate assent with optimizing with respect to Alpha can be done very quickly and cause.it turns out that this will be true when we modify this algorithm to solve the SVM optimization problems.

It turns out either the polynomial kernel or the Galcean kernel works fine for this problem,and just by writing down this kernel and throwiing an SVM at it,an SVM gave performance comparable to the very best neuronetworks.this is superising.because SVM doesnot take into account any knowledge about the pixels,and particular it doesnot know that this pixel is next that pixel because it is just representing the pixel intensity value as a vector.

Dynamic programming algorithm:

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
完整版:https://download.csdn.net/download/qq_27595745/89522468 【课程大纲】 1-1 什么是java 1-2 认识java语言 1-3 java平台的体系结构 1-4 java SE环境安装和配置 2-1 java程序简介 2-2 计算机中的程序 2-3 java程序 2-4 java类库组织结构和文档 2-5 java虚拟机简介 2-6 java的垃圾回收器 2-7 java上机练习 3-1 java语言基础入门 3-2 数据的分类 3-3 标识符、关键字和常量 3-4 运算符 3-5 表达式 3-6 顺序结构和选择结构 3-7 循环语句 3-8 跳转语句 3-9 MyEclipse工具介绍 3-10 java基础知识章节练习 4-1 一维数组 4-2 数组应用 4-3 多维数组 4-4 排序算法 4-5 增强for循环 4-6 数组和排序算法章节练习 5-0 抽象和封装 5-1 面向过程的设计思想 5-2 面向对象的设计思想 5-3 抽象 5-4 封装 5-5 属性 5-6 方法的定义 5-7 this关键字 5-8 javaBean 5-9 包 package 5-10 抽象和封装章节练习 6-0 继承和多态 6-1 继承 6-2 object类 6-3 多态 6-4 访问修饰符 6-5 static修饰符 6-6 final修饰符 6-7 abstract修饰符 6-8 接口 6-9 继承和多态 章节练习 7-1 面向对象的分析与设计简介 7-2 对象模型建立 7-3 类之间的关系 7-4 软件的可维护与复用设计原则 7-5 面向对象的设计与分析 章节练习 8-1 内部类与包装器 8-2 对象包装器 8-3 装箱和拆箱 8-4 练习题 9-1 常用类介绍 9-2 StringBuffer和String Builder类 9-3 Rintime类的使用 9-4 日期类简介 9-5 java程序国际化的实现 9-6 Random类和Math类 9-7 枚举 9-8 练习题 10-1 java异常处理 10-2 认识异常 10-3 使用try和catch捕获异常 10-4 使用throw和throws引发异常 10-5 finally关键字 10-6 getMessage和printStackTrace方法 10-7 异常分类 10-8 自定义异常类 10-9 练习题 11-1 Java集合框架和泛型机制 11-2 Collection接口 11-3 Set接口实现类 11-4 List接口实现类 11-5 Map接口 11-6 Collections类 11-7 泛型概述 11-8 练习题 12-1 多线程 12-2 线程的生命周期 12-3 线程的调度和优先级 12-4 线程的同步 12-5 集合类的同步问题 12-6 用Timer类调度任务 12-7 练习题 13-1 Java IO 13-2 Java IO原理 13-3 流类的结构 13-4 文件流 13-5 缓冲流 13-6 转换流 13-7 数据流 13-8 打印流 13-9 对象流 13-10 随机存取文件流 13-11 zip文件流 13-12 练习题 14-1 图形用户界面设计 14-2 事件处理机制 14-3 AWT常用组件 14-4 swing简介 14-5 可视化开发swing组件 14-6 声音的播放和处理 14-7 2D图形的绘制 14-8 练习题 15-1 反射 15-2 使用Java反射机制 15-3 反射与动态代理 15-4 练习题 16-1 Java标注 16-2 JDK内置的基本标注类型 16-3 自定义标注类型 16-4 对标注进行标注 16-5 利用反射获取标注信息 16-6 练习题 17-1 顶目实战1-单机版五子棋游戏 17-2 总体设计 17-3 代码实现 17-4 程序的运行与发布 17-5 手动生成可执行JAR文件 17-6 练习题 18-1 Java数据库编程 18-2 JDBC类和接口 18-3 JDBC操作SQL 18-4 JDBC基本示例 18-5 JDBC应用示例 18-6 练习题 19-1 。。。

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值