Open Notes: Machine Learning 机器学习基础笔记（2：Perceptron, Logistic Regression）

EverNoob

已于 2022-05-07 09:23:33 修改

阅读量458

点赞数

分类专栏： Machine_Learning Notes 文章标签：机器学习

于 2021-03-14 17:49:19 首次发布

本文链接：https://blog.csdn.net/maxzcl/article/details/114758597

版权

Notes 同时被 2 个专栏收录

136 篇文章 0 订阅

订阅专栏

Machine_Learning

53 篇文章 1 订阅

订阅专栏

by Max Z. C. Li (843995168@qq.com)

based on lecture notes of Prof. Kai-Wei Chang, UCLA 2018 winter CM 146 Intro. to M.L., with marks and comments (//,==>, words, etc.)

all graphs/pictures are from the lecture notes; I disavow the background ownership watermarks auto-added by csdn.

original acknowledgment: "The instructor gratefully acknowledges Eric Eaton (UPenn), who assembled the original slides, Jessica Wu (Harvey Mudd), David Kauchak (Pomona), Dan Roth (Upenn), Sriram Sankararaman (UCLA), whose slides are also heavily used, and the many others who made their course materials freely available online."

SL Algorithms

Linear Classifier (LC): Perceptron (06)

Linear Classifier:

the value of the label y is decided by Linear Threshold Units (LTUs) with the following rule:

or sgn( transpose([W, b]) * [x, 1]) = 𝑤 ⋅ 𝑥 for convenience (new w, x will be used in next sections).

i.e.

"b" stands for bias term, and clearly when the result is 0, we make an educated guess by picking the most common value among {y} as the prediction.

visualization:

Basic ideas:

Perceptron is suitable for linear classification and learns by mistakes:

The goal is to find a separating hyperplane and For separable data, we are guaranteed to find one [hyperplane/classification scheme]. ==> the algorithm will converge when the data is separable.

It is an online algorithm i.e. it Processes one example at a time.

Perceptron Algorithm (Rosenblatt 1958)

when we are dealing with binary classification, i.e. y is from {-1, 1} we can further specify:

limits: learn only what it can represent ==> linear models

it is shown (Minsky and Papert (1969)) that Parity functions can’t be learned (XOR) But we already know that XOR is not linearly separable.

we can control the learning rates (binary classification in the example) by:

Mistake Bound

Convergence:

Margin:

The margin of a hyperplane for a dataset is the distance between the hyperplane and the data point nearest to it.

The margin of a data set (𝛾) is the maximum margin possible for that dataset using any weight vector.

Mistake Bound Theorem [Novikoff 1962, Block 1962]

easily proven by induction.

max possible mistakes = (data_Raidus / data_margin)**2

data_radius will be affected by the dimention of the data ==> e.g. R**2 = n for a n-dim Boolean function.

Advantage:

have no data preference;

guaranteed convergence time for linearly separable data; no further data collection needed, no further mistakes.

Disadvantage

linearly separable is rare in real world practices ==> craft/select features to enforce linear separability of the dataset if possible.

Variants

1. on finite data:

2. voting and averaging

Voted perceptron
Remember every weight vector in your sequence of updates. At final prediction time, each weight vector gets to vote on the label.

The number of votes a set of weights gets is the number of iterations it survived before being updated

Comes with strong theoretical guarantees about generalization, impractical because of storage issues.

Averaged perceptron (a dp-upgraded version of voting)

Instead of using all weight vectors, use the average weight vector (i.e longer surviving weight vectors get more say)

More practical alternative and widely used

//more "accumulated" than "averaged" ==> we do not care about magnitude, only the sign, averaging is not necessary

3. Marginal Perceptron

obviously a larger margin of the hyperplane means that the model will generalize better.

==> find the maximal margin if possible by redefine error as:

Online vs. Batch Learning

Batch ==> all data available at training time

Online ==> feed the learner one sample at a time

Different learning protocols:

The assumption & learning goal may different: E.g., batch learning assumes data are i.i.d (independent and identically distributed) while online learning may provide worst-case bound under adversarial data.

Computation
Space: online learning consider one instance at a time
Convergence rate: some fast converged optimization method requires access to the entire dataset

LC: Logistic Regression

Basic Idea:

when the data is not linearly separable ==> try to predict P(y=1|x) //perceptron is always correct when trained; it does not give probability predictions.

binary H-space:

Expand hypothesis space to functions whose output is [0-1] ==> because we want a % ==> now the problem is a regression problem ==> The output y is still discrete valued (-1 or 1).

Define a transformation function such that:

For the transformation we usually use a normalized exponential function, call Sigmoid (or logistic) function:

why?

and the modified H-space is:

the goal now is to find:

h_w(x) = P( y = 1 | x, w) appr= P(y=1 | x)

Interlude

To study a ML model throughly, we need to know:

the modeling
i/o
H-space
loss function (criteria of model fitting)
the algorithm
how to learn the model
how to predict with the model
the analysis
properties/guarantees of the model
comparison and connections to other models

//this is a recommended checklist, not necessarily an optimal study order, e.g. the loss function usually is closely tied to the training algorithms.

Prediction

predict probability

y = 1 ==> -z < 0 or wx > 0; y = -1 ==> -z > 0; guess the most common when z = 0;

Decision Boundary

Predict y=1 if P(y=1|x,w ) ≥ P(y= -1|x, w) ==> since P(y = 1 | x) = 1- P( y = -1 | x) :

or log 𝑧 ≥ 0 𝑖𝑓𝑓 𝑧 ≥ 1:

anyway, the decision boundary for a non-linear Sigmoid function is linear:

Maximum Likelihood (Training Preambles)

Likelihood (joint density)

using Bernoulli distribution as an example:

===> We want our learned function h_w(x) to be P(y=1|x) ==> which requires the weight set w to represent the inherent structure/distribution of the dataset ==> which can be achieved by choosing w that maximizes the likelihood of the data being as collected. ==> because we are assuming the true distribution is the most likely to generate the dataset. ==> which is not necessarily true, but reasonable.

for this simple example, theta = k / n; but it's unlikely that we would have a closed-form solution for most models.

Maximum Likelihood Estimator:

for logistic regression we have:

or the general training task:

An alternative representation by assign y to {0, 1} or [0, 1]:

Training (Gradient Descent)(08)

Why: There are no closed form solutions ===> try out, step by step into the right directions

Convergence:

If function is convex, it converges to the global optimum (need proper choice of step-size):

Algorithms:

𝜂 is often called step size or learning rate -- how far our update will go along the the direction of the negative gradient

Example:

==> calculate the gradient:

==> follow the iteration procedure:

With a suitable choice of 𝜂, the iterative procedure converges to a stationary point where

A stationary point is only necessary for being the minimum

//we'll keep working on the training method later, but for now the LR section is over.

Epilogue: Convex Functions (08)

to mathematically determine if a function is convex or not:

for multi-variate, use it's Hessian matrix:

Convex Optimization Methods:

Gradient descent
Stochastic (sub)-gradient descent
Coordinate descent methods
Newton methods
LBFGS

EverNoob

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
Open Notes: Machine Learning 机器学习基础笔记（2：Perceptron, Logistic Regression）

by Max Z. C. Li ([email protected])based on lecture notes of Prof. Kai-Wei Chang, UCLA 2018 winter CM 146 Intro. to M.L., with marks and comments (//,==>, words, etc.)all graphs/pictures are from the lecture notes; I disavow the background ownership
复制链接

扫一扫