machine learning

最新推荐文章于 2022-11-13 22:16:42 发布

Z_shsf

最新推荐文章于 2022-11-13 22:16:42 发布

阅读量430

点赞数

分类专栏： machine learning signal processing 文章标签：机器学习算法

本文链接：https://blog.csdn.net/ZSZ_shsf/article/details/48752761

版权

machine learning 同时被 2 个专栏收录

56 篇文章 0 订阅

订阅专栏

signal processing

55 篇文章 11 订阅

订阅专栏

重点

如果预测的目标变量时连续的，那么我们称之为回归问题，如果是离散的，称之为分类问题
监督和非监督最大的区别，数据集有没有“标准”答案去让机器学习，比如判断肿瘤良性恶性，如果给出了标准答案，怎样的数据是良性，怎样是恶性，那么学习后的算法本身就是经过监督学习的，如果只是给了几堆数据，并没有指出哪堆是良性，哪对是恶性，那么就是无监督学习！
数学之美，在机器学习领域体现的淋漓尽致，需要多少数据才能正好学好算法，数据多了浪费精力，计算力等等，数据少了，机器学习不完善，这是一个数值计算问题，当然也离不开后来的实践检验，模型的推导也离不开数学，数学知识尤其是posibility 相关的，非常重要，当然还有数据结构，队列，链表之类的知识。这些都深深影响了我！ml之美，让我热血澎湃，第一次感觉我非well done 不可，真的有种强烈的吸引力，之前是对我而言！把ml应用到实际生活中，尤其是科研里感兴趣的部分，这是最需要最有意义的部分！

main parts

linear regression
gradient decent
normal equations

notations

m=training example
x = input variables/features
y = output variable
(x,y)=training example
i training example
n = the number of feature
theta = paramethers
hypothesis, get input and maps from it to output
the 1/2, is just for simplize mathematic computation

两种方法检验收敛

1,searching algorithem: (gradient descent)start with a value thea keep changing parameter to
reduce j of theta little, there is a common case ,local optimum, that means where you stop is correlation with where you start .
a:=b; overwrite a using b, read as colon equal. a = b; just means a is equal to b
there is a parameter alpha ,which control how long your step should be ,if too large, you may overshooting the optimum, and if too small, it is too slow when converge. let’s say, there is a good thing, as you close to local optimum, the gradient would be smaller and smaller, so ,that means your step would be slow.
least-square is a algorithem that get an area that has only a optimun, so lacal optimun is showing up in other complix situation.
testing converge: 1,look at two different iterations, and see whether theta is change a lot
2, just looking at thea, judge whether thea is change too much. standard rules of thumb.
a more often use alternative algorithem in very big database is :statistics gradient decent (or incremental gradient descent ), just repeat {},just use the first example to update theta, and then use new theta as old theta along with the next example to update it .
new notation: deviation of matrix, so cool!
probabilistic interpretation

最小二乘

选择最小二乘法的一个重要原因是概率上，最小二乘，即平方量，代表着一个数据的方差，两个数据之间的方差越小，表示两者越接近。
目标变量 = theta*输入变量+sigma;
sigma，表示erro term，代表了未考虑进的所有特征以及随机误差，一般使用高斯概率密度函数，并假设sigma服从IID（independently and identically distributed），由此推导目标变量的分布（根据目标变量-theta*输入变量），即以theta为参数的某输入下的输出分布。进而定义出似然函数，由独立同分布求得，再由于对数函数单调，为数学上计算的方便，进行似然函数求对数，再把最大化，求其参数。由结果可以看出，在数据上先前的假设条件下，最小二乘回归于最大似然参数估计一致。从侧面反映了其普适性。
locally weighted linear regression

解释几个概念

underfitting:模型并没有捕获数据的结构；
overfitting:虽然拟合的很好，但实际用处不大，没办法做预测。
参数学习算法：给定数目的参数值，修正参数，达到目标；
非参数学习算法：参数数目不确定，随数据集的变化而变化；
比如前面的线性规划算法，修正theta来最小化代价函数；但这里的LWR，由于在代价函数前面乘上了一个权值，所以有些可能分配的权值很小，这就意味着可以忽略不计，或者说参数数量发生了改变，即不确定。权值函数这里选择训练集中数据与预测值数据的距离平方除以2倍的bandwidth 参数（决定权值随距离下降的速度）

w (i) = e x p (- ( x ( i ) - x ) 2 2 τ 2)

$w^{(i)} = exp(-\frac{(x^{(i)}-x)^2}{2\tau^2})$

classification and logistic regression

分类问题是针对离散而言的，就好比连续里边的线性回归一样。二项分类问题就是一个逻辑回归，这里提到了一个很重要的函数sigmod，关于这个函数有兴趣的自行百度，函数值在【0，1】区间内，自变量越大因变量值越趋近于1，反之趋近于-1，还有一个很好的性质是求导等于函数自身乘以一减去自身。

g' (s) = g (s) (1 - g (s))

$g'(s)=g(s)(1-g(s))$ 逻辑回归问题通过假设参数服从sigmod，然后定义取0取1的概率，

p (y | x; θ) = (h θ (x)) y (1 - h θ (x)) 1 - y

$p(y|x;\theta)=(h_\theta(x))^y(1-h_\theta(x))^{1-y}$ 求得似然函数，取对数，

l (θ) = Σ m i = 1 y (i) l o g h (x (i)) + (1 - y (i)) l o g (1 - h (x (i)))

$l(\theta)=\Sigma_{i=1}^my^{(i)}log h(x^{(i)})+(1-y^{(i)})log(1-h(x^{(i)}))$ 然后最大化对数值，

\partial \partial θ j l (θ) = (y - h θ (x)) x j

$\frac{\partial}{\partial\theta_j}l(\theta)=(y-h_\theta(x))x_j$ 得出了一个神结论，参数的更新式子与前面线性回归基本一样，除了符号。但注意其实是不一样的，因为假设函数一个是线性，一个是sigmod。进而推导得出感知机，假设条件改sigmod为一个阈值函数就可以了。