Andrew Ng machine learning 课程笔记--朴素贝叶斯算法

最新推荐文章于 2020-12-06 09:54:43 发布

组煮珠竹

最新推荐文章于 2020-12-06 09:54:43 发布

阅读量126

点赞数

分类专栏：机器学习

机器学习专栏收录该内容

26 篇文章 0 订阅

订阅专栏

Naïve Bayes:in this model ,each of our features were zero,one,so indicating whether different words appear,and the length or the feature vector was ,sort of,the length N of the feature vector was the number of words in the dictonary.

The Multivariate Bernoulli Event Model:it refers to the fact that there are multiple B ernoulli ranom variables

The Multinomial Event Model:my ith training example,XI will be a feature vector,XI sub group one,XI sub group two,XI subscript NI where NI id equal to thee number of words in this email.And each of these elements of the feature vector will be an index into my dictionary .in the second example,N is the number of words in a given email,if it's the I email subscripting,then this N subscript I ,so N will be different for different training examples,and here XI will be,these values from 1 to 50000,XI is essentially the identity of the Ith word in a given piece of email.it turns out that for text classification,the naïve bayes algorithm with the second event model,it turns out that almost always does better than the first naïve bayes model I talked about when you are applying it to the specific case to the specific of text classification.it doesnot care about the ordering of the words.You can shuffle all the words in the email,and it does exactly the same thing.so in natural language processing,it's called a Unigram Model.there are many other models in natural language processing,like higher order makeup models that take into account some of the ordering of the words.it turns out for natural language processing the models like the bigram models or trigram models,I beleve they do only very slightly better .

Laplace smoothing:a method to give you better estimates of their probability distribution over a multinomial

Non-linear classifiers:

Neutral network:having my features here and then I would feed them to say a few of these little sigmoid units,and these together will feed into yet another sigmoid unit,say,which willl output my final output H subscript theta of X.A nd just to give these names ,let me call the values output by these thhree intermidiate sigmoidal unite;let me call them A1,A2,A3.and so the value A1 will be computed as G of X transpose,and then some set of parameters,which I 'll write as theta one,and similarly A2 will be computed as G of X transpose theta two,where G is sigmoid function.and G of Z ,our final hypothesis will output G of A transpose theta four.one way to learn the parameters of an algorithm like this is to just use gradient interscent to minimize J of theta as a function of theta.the neutral network is that you can look at what these intermediate notes are computing.so this neural network has what is called a hidden layer before you can then have the output layer.

Else:it turns out that a quadratic cost function llike I wrote down on the chalkboard just now,it turns out that unlike logistic regresion,that will almost always respond to non-convex optimization problem,and so whereas for logistic regression if you run gradient descent or Newton's method or whatever,you converse the global optimer.this is not true for neural networks.In general,there are lots of local optimer and ,sort of ,much harder optimization problem.

LeNet:there is this robustness to noise.

Support vector machine:I'm going to say that the functional margin of a hyper plane WB with respect to a specific training example,XIYI has been defind as Gamma Hat I equals YI times W transpose XI plus B.It defines a linear separating boundary,and so when I say hyper plane,I just mean the decision boundary that ia defined by the parameters W,B.Gamma Hat is equal to min over all your trainng examples of gamma hat I.add a normalization condition that the norm of the parameter W is equal to one.if the norm of W is equal to one,then the functional margin is equal to the geometric margin,the geometric margin is just equal to the functional margin divided by the norm of W.

组煮珠竹

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Andrew Ng machine learning 课程笔记--朴素贝叶斯算法

Naïve Bayes:in this model ,each of our features were zero,one,so indicating whether different words appear,and the length or the feature vector was ,sort of,the length N of the feature vector was the ...
复制链接

扫一扫

专栏目录