【大数据分析】课堂笔记1

最新推荐文章于 2022-05-10 16:24:30 发布

萝卜丝皮尔

最新推荐文章于 2022-05-10 16:24:30 发布

阅读量367

点赞数

分类专栏：大数据分析文章标签：大数据

本文链接：https://blog.csdn.net/qq_43448491/article/details/109515087

版权

大数据分析专栏收录该内容

1 篇文章 0 订阅

订阅专栏

参考资料：【大数据分析】台湾交通大学-李育杰教授
简单地列几个知识点，具体地，还是要看书。里面有些专业术语，我也不知道怎么翻译，所以这是一篇图文并茂的blog。。。
Principle Component Analysis
异常点对PCA中的Principal direction影响很大，或者说principal direction对异常点较为敏感。
use the leave-one-out(LOO) procedure to explore the variation of principal directions;
在这里插入图片描述
上图是在说如果复制正常点的话，对principal direction 没有什么影响；但是复制异常点的话，会加重原来principal direction的偏移。
这一点可以启示我们如何分辨异常值：通过复制（over-sampling）the target instance，会得到新的principal direction，再比较新的principal direction 和旧的 principal direction偏移量是否很大，从而判断该 the target intance 是不是异常点（abnormal points）

PCA是用来做什么的？
数据降维，简化数据结构。把数据投影到比较低维的空间，还能保留尽可能多的信息。问题会转化为求解样本协方差矩阵的特征值和特征向量。
osPCA (over-sampling PCA)
复制 the target instance n次，会有新公式产生，这一过程中需要注意避免重复计算。

解决该特征值问题的方法有很多，传统的乘幂法（power method）涉及到迭代、不停地矩阵乘向量等步骤很耗费时间。

对这一问题（over sampling PCA如何快速求解eigenvector）考虑使用最小平方法做近似。

在这里插入图片描述

Exact and approximated model for osPCA
exact model :learned at back-end server（一般做比较复杂的工作）,exact formulation of osPCA , apply the well-known power method, repreat a fixed point iteration until convergence;
approximate model : self-learning at front-end detecor（一般做比较低级的工作） , least squares approximated formulation of osPCA, apply recursive least squares for online updating(closed form solution)

the end device一般做比较低级、简单的工作；把困难的工作丢给Cloud.
cloud fuse the models comes from differet end devices and generate a more smart and exact model to feedback to the front end devices.
data can be reduced and compressed in the front end device.(compressed sensing 压缩感知)

2020.11.09

基本方法1："近墨者黑，近朱者赤"的k近邻(k-nearest neighbors algorithm)
distance and instance based algorithm;
它是一种懒惰的学习方法（lazy learning），这是因为：它并没有一个固定的model，往往只有在一个点（query point）进来需要判别的时候，才使用这种方法（预测才发生）。
基本思想：两个样本很接近的时候，他们应该有同样的标签。
fundemental philosophy : two instances that are close to each other or similar to each other they should share with the same label.
also known as memory-based learning since what they do is store the training instances in a lookup table and interpolate(插值) from these.
such methods are also called lazy learning algorithms.Because they do not compute a model when they are given a training set but postpone the computation of the model until they are given a new test instance(query point)

一个小例子：

需要调参数，决定最佳的K
k越小，则model越复杂，越有可能over fitting；
k越大，则model越简洁，decision boundary 越清楚，只是过大可能预测不准。

在这里插入图片描述

Remark：
we might need to normalize the scale between different attributes. For example, yearly income vs daliy spend.

基本方法2：online percetron algorithm
mistakes driven algorithm
the smallest unit of deep neural networks
特点：不断犯错误，不断修正
linearly separable training set，是指我们确实可以找到一条直线和一个平面，把两类点分开，如下图。

这里需要注意，荧光标记的那个函数就是用来筛选出来我们需要的（当x取负值时，y取-1；当x取正值时，y取1，这两种情况下荧光不等式均不成立），反过来说，荧光不等式成立的话，意味着分错了，从而我们需要进入循环去修正它。更新后，不等式左端的函数值增大，表明这次更新是有用的。

上述定理是说，如果是linearly separable training set ＆non-trivial ，那么该算法会在有限步停下来。

在这里插入图片描述
上面算法（dual form）在形式上有别于之前，之前更新 $w_i$ ,这里更新 $\alpha_i$ ，使用 $\alpha_i$ 记录犯了多少次错误。

需要形成一个矩阵Gram matrix，存储内积 $x^i,x^j>$ ，就可以使用这个算法。

基本方法3：navie bayes
贝叶斯公式；
navie bayes for classification ,also good for multi-class classification
it can estimate a posterior probability（后验概率） of class label.
let each attribute (variable) be a random variable. what is the probability of Pr(y=1|x1,x2,…,xn)
this method has two not reasonable assumptions: the importance of each attribute is equal; all attributes are conditional probability independent.
在这两个假设之下，我们可以对已知样品X，进行分类，如下：
$Pr(y=1|x)=\prod_{i=1}^{n}Pr(X_i=x_i|y=1)Pr(y=1)$
同样地，也可以计算该样品属于label （y=-1）的类的概率，从而判断它最可能属于哪一类。
the zero-ferquency problem:
在实施过程中，会出现一个小问题：如果上式中的一个乘数为零（也就是在样品集中出现概率为0，这可能是因为样品集太小了，并不一定真的是这种情况不会发生），那么不管其他自变量怎么取，计算出来的后验概率都是零。
if an attribute value does not occur with a class value ,then the posterior probability will all be zero, No matter how likely the other attribute values are!
Laplace estimator will fix “zero frequency” $\frac{k+\lambda}{n+a\lambda}$ .
k:事件出现的频数；
n:试验次数；
a:样本空间基本事件的个数；

Support Vector Machines , 支持向量机

在这里插入图片描述
在上面的这个小例子中，SVM建议找margin最宽的，并取中线比较好。

Why use support vector machines?
powerful tools for data mining.
SVM classifier is an optimally defined surface.
SVMs have a good geometric interpretation.
SVMs can be generated very efficiently.
it can be extended from linear to nonlinear case : typically nonlinear in the input space, linear in a higher dimensional “feature space”, implicitly defined by a kernel function (一些数据原本是非线性的，我们可以通过核函数，把它映射到高维的线性空间中处理)
have a sound theoretical function ,based on Statistical Learning Theory.

note:training error ,also called model basis.

在这里插入图片描述
note：在训练集S中，x部分是样本资料，y部分是其对应的标签。我们需要找到w,b,满足荧光黄色所表达的，即如果 $D_{ii}=y_i=+1$ ，就把它分到直线上方，否则就是直线下方。荧光黄色可以等价表示成淡红色那部分。

对于 nonseparable case,

note:引入松弛变量，如此，不等式很容易找到可行解，我们需要在其中找到最优解，也就是 $\xi_i$ ， $\xi_j$ 越小越好,如下图。

note：robust linear programming only care “the training error” without “VC bound”.

在这里插入图片描述
note: C 值需要人为给定。如果C值取得比较大，意味着需要花很多力气使 $\xi$ 变小，如此，很容易出现over fitting（在训练集内部fitting得很好，但是做prediction会很差，一般复杂度比较高的情况下，会经常出现over fitting的情况）。
在这里插入图片描述

note：feature space 是高维空间，有可能出现over fitting。

learning in feature space
could simplify the classification task ;
learning in a high dimensional space could degrade generalization performance(that is , maybe overfitting),this phenomenon is called curse of dimensionality.
by using a kernel function , that represents the inner product of training example in feature space, we never need to explicitly know the nonlinear map, even do not know the dimensionality of feature space.
there is no free lunch ,which means it needs to deal with a huge and dense kernel matrix( reduced kernel can avoid this diffculty)

在这里插入图片描述

note : 第一个荧光标记，表示某一维度 unknown；
第二个荧光标记，只需要知道内积，即可计算this classifier $f (x)$ ,也不需要知道 $\Phi(x)$ 具体表达式。
在这里插入图片描述

note：上面的这个小例子是在说明，这样的 $\Phi(x)$ 完全可以由原始数据空间的做内积再平方定义出的函数代替。

note：用 $x_1,x_2,...,x_n$ 表示成次数为d的乘项，共有p种表示方法。
比如说： $x_1,x_2$ 可以表示成次数d=2的乘项有 ${x_1^2,x_2^2,x_1x_2}$ 这三项，那么 $p=\binom{2+2-1}{2}=3$ 。
p这个组合数的计算，仔细看那个 $x_1^3x_2^1x_3^4x_4^4$ 和配套的 $\times\circ$ 图，
$\times$ 表示分隔，第一个 $\times$ 和第二个 $\times$ 中间有3个 $\circ$ 表示3个 $x_1$ ,其他依此类推；
其中组合数的第一参数含“-1”是因为，它要固定 $\times\circ$ 图的第一个为 $\times$ ，参考高中数学“隔板法”😂。
在这里插入图片描述

note：svm 把linear case 推广到 nonlinear case，很简单，只需要做荧光黄色那里的替换即可。

或许需要补充的小知识：
一篇讲过拟合&欠拟合的博客，很白话。

萝卜丝皮尔

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
【大数据分析】课堂笔记1

参考资料：【大数据分析】台湾交通大学-李育杰教授简单地列几个知识点，具体地，还是要看书。里面有些专业术语，我也不知道怎么翻译，所以这是一篇图文并茂的blog。。。Principle Component Analysis异常点对PCA中的Principal direction影响很大，或者说principal direction对异常点较为敏感。use the leave-one-out(LOO) procedure to explore the variation of principal dire
复制链接

扫一扫