数据仓库与数据挖掘（全英文）期末复习

最新推荐文章于 2023-11-05 20:27:16 发布

m0_54778759

最新推荐文章于 2023-11-05 20:27:16 发布

阅读量9.9k

点赞数 2

分类专栏：数据仓库与数据挖掘文章标签：数据仓库数据挖掘

本文链接：https://blog.csdn.net/m0_54778759/article/details/127793367

版权

数据仓库与数据挖掘专栏收录该内容

1 篇文章 0 订阅

订阅专栏

MOOC地址数据仓库与数据挖掘_北京理工大学_中国大学MOOC(慕课) (icourse163.org)https://www.icourse163.org/course/BIT-1464031178

Chapter 1. Introduction

What Is Data Mining?

Why Data Mining?

Data mining process简述知识发现和数据挖掘的过程

Data mining Tasks

Evaluation of Knowledge

Chapter 2 Data

Data Objects and Attribute Types

Basic Statistical Descriptions of Data

Measuring Data Similarity and Dissimilarity

Chapter 3: Data Preprocessing

Data Preprocessing: An Overview

Data Cleaning

How to Handle outlier?

Data Warehouse: Basic Conc

Data Warehouse Modeling: Data Cube and OLAP

Data Warehouse Design and Usage

Data Warehouse Implementation（实现

Chapter 4: Association Rule Mining

Basic Concept

Frequent Itemset Generation

Rule Generation

Factors Affecting Complexity of Aprior

Compact Representation of Frequent Itemset

Chapter 5 Classification

Classification: Basic Concepts

Decision Tree Induction

Bayes Classification Methods

Model Evaluation and Selection

Cluster Analysis

Partitioning Methods

Hierarchical Methods

Density- and Grid-Based Methods

Evaluation of Clustering

Chapter 7 Outlier Analysis

Outlier and Outlier Analysis

Outlier Detection Methods

Statistical Approaches

Proximity-Based Approaches

Clustering-Based Approaches

Classification-Based Method I: One-Class Model

Chapter 1. Introduction

要求：简述知识发现和数据挖掘的过程

What Is Data Mining?

从大量数据中提取有趣的（非琐碎的、隐含的、先前未知的、潜在有用的）和最终可理解的模式或知识。数据：数量巨大，结构不同

Why Data Mining?

Data mining process简述知识发现和数据挖掘的过程

简述知识发现和数据挖掘的过程

数据理解和数据准备花掉70%时间

Data mining Tasks

预测任务

基于其他属性（解释性变量或自变量）的值预测特定属性（目标变量或因变量）值的

描述任务

导出模式（相关性、趋势、聚类、轨迹和异常值），总结数据中的学生关系。关联规则挖掘：给定一组记录，每个记录都包含来自给定集合的一定数量的项目；生成依赖关系规则，该规则将根据其他项的发生情况预测项的发生

分类

基于一些给定的示例构建模型（函数）预测一些未知的类标签每个给定的示例包含一组属性，其中一个属性是类标签

通常，给定的数据集分为训练集和测试集，训练集用于构建模型，测试集用于验证模型

回归

1基于其他变量的值预测给定连续值变量的值，假设线性或非线性依赖模型。

聚类分析

1无监督学习（即类别标签未知）

2分组数据以形成新类别（即聚类）2假设线性或非线性依赖模型，根据其他变量的值预测给定连续值变量的值。

异常值检测

1用于识别特征与其他数据不同的观察结果

2应用：信用卡欺诈检测

Evaluation of Knowledge

1 Descriptive vs. predictive

2 Coverage覆盖率

3 Novelty新颖性

4 Accuracy准确性

5 Timeliness时效性

Chapter 2 Data

任务要求

1、数据集的主要特征有哪些？

2、识别数据属性值中的异常有哪些方法？

3、如果数据的属性包括标称属性、序数属性和数值属性，如何数据对象的相似度？

Data Objects and Attribute Types

Data Object实例Attribute列Data set Collection of data object

Types of data set

• Record
•    Data Matrix
•     Document Data
•    Transaction Data（事务数据）

Graph
•    World Wide Web
•    Molecular Structures（分子结构）
• Ordered
•    Temporal Data（时序）

• Sequence Data(顺序)

Important Characteristics of Data set

Dimensionality

维度的诅咒1.维度数可能比实例还多2.存在冗余

Sparsity（稀疏性

Resolution（分辨率）

Distribution（分布）

Nominal: categories, states, or “names of things”
Hair_color = {auburn, black, blond, brown, grey, red, white}
Binary：Nominal attribute with only 2 states (0 and 1)

对称二进制

非对称二进制

Ordinal
Age_range = {20岁以下,20-30岁,30-40岁，40-50岁}

有顺序关系

Numeric Attribute Types

等间隔：不一定能体现倍数关系

比率：0 K ̊ is twice as high as 5 K ̊

Basic Statistical Descriptions of Data

中心趋势

平均数

中位数‘

众数

正态

正偏

负偏

1.5IQR异常

Measuring Data Similarity and Dissimilarity

相似性和非相似性转换

3、如果数据的属性包括标称属性、序数属性和数值属性，如何数据对象的相似度？

Nominal Attributes

M 匹配的数据个数，p数据总数

Method 2: Use a large number of binary attributes

把每一个属性改成二分

二分属性

非对称属性不考虑t

简单匹配系数

例题

性别是等概率不考虑

Ordinal Variables

排序
归一化

r属性值对应序号

m所有属性的个数

Numeric Data

距离的3条件

d(i, j) > 0 if i ≠ j, and d(i, i) = 0 (Positive definiteness)
d(i, j) = d(j, i) (Symmetry)
d(i, j) ≤ d(i, k) + d(k, j) (Triangle Inequality)

r=1曼哈顿距离两边分别的差

r=2欧几里得距离

r=无限上确距最长那条边

文档数据

混合属性

F特征标号d对象d权重

先正则化到0-1

Chapter 3: Data Preprocessing

任务：

1、数据预处理的主要包括哪些步骤？

2、简述数据清理的主要任务、常用方法和流程。

3、常用的数据离散化方法有哪些？

4、如何识别数据集中的冗余属性？

5、数据预处理中为什么要进行数据约减？请简述常用的数据约减方法

Data Preprocessing: An Overview

数据预处理的主要包括哪些步骤？

主要步骤

Data cleaning（数据清理）

解决噪音，丢失，异常，不一致

Data integration（数据集成）

集成多数据源

Data transformation（数据转换）

标准化，概念分层

Data reduction（数据约减

维度约减，量减（抽样），压缩（维度数量都减少

Data Cleaning

脏数据

缺失，错误（薪水-10），不一致

噪声，随机产生的测量误差

异常：与大多数数据不同，可好可坏

数据清理：解决这些脏数据

How to Handle Missing Data?

忽略，比如类标签丢失，忽略元组，在丢失太多时无用
补

How to Handle outlier?

检测：clustering,3σ, Boxplo‘

解决：

How to Handle noise?

1.分箱法

2.回归拟合

3. Clustering Detect and remove noise

Data Integration

多数据源集成

Redundancy

不同的表示，不同的比例，例如公制单位与英制单位ReRedundancy

Handling Redundancy in Data Integration

探测

数值属性：关联分析，协方差

标称属性：卡方计算

假设两个值分布独立

oij是观测频率，eij是预期频率

X2值越大，变量之间的相关性越大

例题

自由度就等于（a-1）*（b-1）,a b表示这两个检验条件的对应的分类数。

Variance for Single Variable

将给定属性的整组值映射到一组新的替换值的函数。每个旧值都可以用其中一个新值来标识

Normalization

Min-max normalization

Z-score normalization (μ: mean, σ: standard deviation)

Z-score normalization

Normalization by decimal scaling

Data Discretization

Histogram analysis直方图

entropy-based methods 基于熵的方法

Binning

等宽

如果 A 和 B 是属性的最低值和最高值，则将区间划分为 N 个大小相等的区间，区间的宽度为: W = (B-A)/N。最直接，但异常值可能主导表示歪曲的数据处理得不好

N自行设定

落入四个区域值相等，基于距离的划分等宽方法将属性的值范围划分为具有相同宽度的区间，区间的数量由用户指定。由于异常值的影响，该方法可能性能较差。因此，等频率或等深度方法通常更可取，等频率方法试图在每个间隔中放置相同数量的物体

基于聚类

Data Aggregation

Combining two or more attributes value(or objects) into a single attribute value(or object)

周长，长宽

Attribute Construction

不能强相关

Data Reduction

数量vs维度

PCA（有参无监督）

原始特征空间正交变换，属性两两互不相干

·规范化输入数据：每个属性都在同一范围内

•计算k个正交（单位）向量，即主分量

•每个输入数据（向量）是k个主分量向量的线性组合

离散属性具有有限或无限可数个值

数据集成内容定义：数据集成是要将互相关联的分布式异构数据源集成到一起, 使用户能够以透明的方式访问这些数据源。集成是指维护数据源整体上的数据一致性

Data Warehousing

1、请简述数据仓库的基本架构。

2、简述数据仓库ETL软件的主要功能？

3、简述数据仓库中的数据模型及各模型特点。

4、比较分析数据仓库与数据库的区别。

Data Warehouse: Basic Conc

数据仓库是一个面向主题的(分析的内容)、集成的、时变的（有时间属性，关系数据库没有）、非易失性的数据集合（与数据库物理隔离，只有加载和访问操作），用于支持管理层的决策过程

存进去就不会改变

面向主题

只留下和主题有关的，提供视图，为决策服务

集成

来源于数据库

Data Warehouse Modeling: Data Cube and OLAP

1、请简述数据仓库的基本架构

架构

Top Tier: Front-End Tools（应用层）
Middle Tier: OLAP Server（OLAP分析）
Bottom Tier: Data
Warehouse Server数据仓库，数据集市子集（关键是建模）
Data

类型

简述数据仓库ETL软件的主要功能

模型

企业级：代价高

类型

简述数据仓库ETL软件的主要功能

模型

企业级：代价高

数据集市包含再数据仓库中

数据集市：灵活，

独立先建仓库，再建集市，独立：数据源独立

对特定用户组有价值的公司范围数据的子集q其范围仅限于特定的选定组，如营销数据集市q独立数据集市与依赖数据集市（直接来自仓库）

虚拟数据库数据库加视图，严格意义不是数据仓库

数据抽取

从多个、异构和外部源获取数据

数据清理

检测数据中的错误，并在可能时予以纠正

数据转换

将数据从旧格式或主机格式转换为仓库格式

加载

排序、汇总、合并、计算视图、检查完整性以及构建索引和分区

刷新

将更新从数据源传播到仓库

元数据是定义仓库对象的数据

结构

架构、视图、维度、层次结构、派生数据定义、数据集市位置和内容

操作

数据沿袭（迁移数据的历史记录和转换路径）、数据的货币（活动、归档或清除）、监控信息（仓库使用统计数据、错误报告、审计跟踪

用于摘要的算法
从操作环境到数据仓库的映射
与系统性能相关的数据

Data Warehouse Design and Usage

数据仓库基于多维数据模型，该模型以数据立方体的形式查看数据

维度表，如项目（项目名称、品牌、类型）或时间（日、周、月、季度、年）

事实表包含度量值（如dollars_sold）和每个相关维度表的键

数据立方体：长方体的格子

q在数据仓库文献中，n-D基本立方体称为基本立方体（n个维度）

q最上面的0-D立方体，它具有最高级别的摘要，称为顶点立方体

q长方体格子形成数据立方

数据立方体：长方体的格子

q在数据仓库文献中，n-D基本立方体称为基本立方体（n个维度）

q最上面的0-D立方体，它具有最高级别的摘要，称为顶点立方体

q长方体格子形成数据立方

多维度展示：不同维度组合，同一维度不同层次

星型模式：中间连接到一组维度表的事实表

雪花模式：星形模式的一种改进，其中一些维度层次被规范化为一组较小的维度表，形成类似雪花的形状

大维度裁成事实和小维度，太麻烦

事实星座：多个事实表共享维度表，被视为恒星的集合，因此被称为星系模式或事实星座

用的最多的

操作

上钻/向上爬升

下钻

切片，一个维度固定切块：某一段内

枢轴（旋转）：q将立方体、可视化、三维重新定向为一系列二维平面

换个角度

钻取：涉及（跨）多个事实表

钻取：通过多维数据集的底层到其后端关系表（使用SQL）

Data Warehouse Implementation（实现

关于数据仓库的设计的四个视图

q自上而下视图

允许选择数据仓库所需的相关信息

数据源视图

公开由操作系统捕获、存储和管理的信息

数据仓库视图

由事实表和维度表组成

业务查询视图

从最终用户的角度看仓库中的数据

比较分析数据仓库与数据库的区别。

数据库是面向事务的设计，数据仓库是面向主题设计的。
数据库一般存储业务数据，数据仓库存储的一般是历史数据。
数据库设计是尽量避免冗余，一般针对某一业务应用进行设计，比如一张简单的User表，记录用户名、密码等简单数据即可，符合业务应用，但是不符合分析。数据仓库在设计是有意引入冗余，依照分析需求，分析维度、分析指标进行设计。
数据库是为捕获数据而设计，数据仓库是为分析数据而设计。
操作也不一样

Chapter 4: Association Rule Mining

任务

1、请阐述支持度的反单调性及在频繁项集挖掘中的作用

2、为什么说最大频繁项集和闭频繁项集是频繁项集的压缩表示。

3、通过支持度和置信度可以有效评估关联规则的有效性吗？

Basic Concept

前界，后界，项集，k-项集，支持度，频繁项集

Rule Evaluation Metrics（度量）

Confidence

Support

目标：找到所有满足的关联规则

Support is used to delete meaningless rules（普遍性）, and confidence

measures the reliability of rule inference.（可靠性）

Two-step approach:

Frequent Itemset Generation

Rule Generation

Frequent Itemset Generation

Complexity ~ O(NMw) => Expensive since M = 2d

Reduce the number of candidates

Reduce the number of transactions

Reduce the number of comparisons (NM)

W长度

1、请阐述支持度的反单调性及在频繁项集挖掘中的作用

Reducing Number of Candidates

Apriori principle

If an itemset is frequent, then all of its subsets must also be

Frequent（反单调性）

Fk: frequent k-itemsets Lk: candidate k-itemsets

Algorithm
Let k=1 Generate F1 = {frequent 1-itemsets}
Repeat until Fk is empty
• Candidate Generation: Generate Lk+1 from Fk
• Candidate Pruning: Prune candidate itemsets in
Lk+1 containing subsets of length k that are infrequent

• Support Counting: Count the support of each
candidate in Lk+1 by scanning the DB
• Candidate Elimination: Eliminate candidates in Lk+1 that
are infrequent, leaving only those that are frequent => Fk+1

Candidate Generation

自连接

Merge two frequent (k-1)-itemsets if their first (k-2)
items are identical

Candidate pruning
Prune ABCE because ACE and BCE are infrequent
Prune ABDE because ADE is infrequent

Support Counting Using a Hash Tree

Rule Generation

confidence of rules generated from the same itemset

has an anti-monotone property

Factors Affecting Complexity of Aprior

Choice of minimum support threshold

---lowering support threshold results in more frequent itemsets

Dimensionality (number of items) of the data set

Size of database

Average transaction width

Compact Representation of Frequent Itemset

Maximal Frequent Itemset

An itemset is maximal frequent if it is frequent and
none of its immediate supersets is frequent

Closed Itemset

<闭集<=非闭集

An itemset X is closed if none of its immediate supersets
has the same support as the itemset X.
X is not closed if at least one of its immediate supersets
has support count as X.

Maximal Frequent Itemset-》subset-》closed/not closed->得到支持度

2、为什么说最大频繁项集和闭频繁项集是频繁项集的压缩表示。

通过最大平凡项集和闭集可得到所有平凡项集及其支持度

Pattern Evaluation

3、通过支持度和置信度可以有效评估关联规则的有效性吗？

To ensure that people who buy X will more likely buy Y than not buy Y
lift=p(x->y)/p(x)*p(y)

=1独立事件

>1促进

》

2k – 2

support ≥ minsup

C无

Chapter 5 Classification

1、掌握绘制ROC的曲线

2、决策树中常使用信息增益、信息增益率及基尼指数衡量属性的重要性，请比较分析这三种指标的特点。

3、贝叶斯分类器的原理

4、集成主要技术

Classification: Basic Concepts

Train set训练分类器test set验证分类器

分类预测结果：离散或标称类型

数值预测：连续

聚类和分类的区别：标签，类数目

模型构建

用于模型构建的样本集是训练集

模型：表示为决策树、规则、数学公式或其他形式

Model Validation and Testing: 模型验证和检测

Test: Estimate accuracy of the model评估精度

检测：多模型选择，做个验证框架

模型应用

有可能有参数更新

Decision Tree Induction

自上而下、递归（递归）（每个子集重复上一个子集的过程分裂）、分而治之的过程

Conditions for stopping partitioning停止划分的条件

自上而下、递归（递归）（每个子集重复上一个子集的过程分裂）、分而治之的过程

Conditions for stopping partitioning停止划分的条件

中文 (简体)翻译。

给定节点的所有样本都属于同一类

中文 (简体)翻译。

没有剩余属性可供进一步分区

中文 (简体)翻译。

没有样品了

节点划分

多路划分

二分划分（要考虑组合；）

数值类型

中文 (简体)翻译。

多向决策：范围查询 Vi ≤ A ≤ Vi+1

中文 (简体)翻译。

二元决策：（A < v）或（A ³ v）

现在要决定第一个分裂点也是最能区分两个属性

中文 (简体)翻译。

m = 2

类分布更纯净的节点是首选

判断指标；

信息增益

熵

A measure of uncertainty 不确定性 associated with a random number

信息熵

ΔGain(X)=H(Y)-H(YX)

实际情况还要考虑怎么划分age

Determine the best split point 最佳分裂点 for continuous-valued attribute A

中文 (简体)翻译。

排序 A 值 A 按递增顺序排列：例如 15、18、21、22、24、25、29、31、...

中文 (简体)翻译。

可能的分割点：每对相邻值之间的中点

中文 (简体)翻译。

选择 A 具有最大信息增益的点作为 A 的分割点

信息增益的缺陷

偏向有大量划分的属性

客户 ID 具有最高的信息增益，因为所有子项的熵为零

信息增益率

增益率GainRatio(A) = Gain(A)/SplitInfo(A)

SplitInfoincomeD=-414log2414-614log2614-414log2414=1.557

GainRatio(income) = 0.029/1.557 = 0.019

产生不平衡的分裂（一个数据集数少一个多）

Splitinfo就是该属性的熵

基尼指数Used in CART（2分）

该属性提供了要选择的最小ginisplit（D）（或杂质的最大减少量）来拆分节点（需要枚举每个属性的所有可能的拆分点）

划分越明确，基尼指数越小，明确为0，二分最大

Δ越大越好

基尼主要二分，取值的设定

分裂点的选择

根据值对属性进行排序

线性扫描这些值，每次更新计数矩阵并计算基尼指数中文 (简体)翻译。

选择基尼指数最小的拆分位置

确定

偏向于类别多的，产生平衡的分裂中文 (简体)翻译。

偏向于多值属性

过拟合分支过多，欠拟合过少，train和test错误都多

怎么解决：剪枝

预剪枝：构建节点判断：用不用分裂，若信息增益度小于预置就不分，显然预知不好得到

后剪枝：先建树然后剪枝，用一组新的带label测试数据，若去掉后准确度不变或上升就剪掉

决策树的优点：中文 (简体)翻译。

易于解释的小树;条件判断告诉你为什么选这个label，快准确度高，对噪音有很强的鲁棒性（剪枝后的

缺点：单个属性贡献，属性的相互作用没体现

Bayes Classification Methods

先验概率（prior probability）：指根据以往经验和分析。在实验或采样前就可以得到的概率。

后验概率（posterior probability）：指某件事已经发生，想要计算这件事发生的原因是由某个因素引起的概率

我们想求在a的label为b的概率,但a的label未知，所以用贝叶斯公式转换一下，

P（b|a）是后验概率，p（a|b）算的是label为b的前提下生成a的概率，p（a）相同

P（b）=p（c），最后说明blabel生成a的概率大于c’label‘

P（x）p（y）由trainset可得现在要求p（x|y）

朴素贝叶斯分类器

将每个属性和类标签视为随机变量

给定具有属性（X1、X2,...,Xd 的记录）

- Goal is to predict class Y
- Specifically, we want to find the value of Y that maximizes P(Y| X1, X2,…, Xd )
中文 (简体)翻译。
假设属性 Xi 在给定类时具有独立性：

在哪个类别下生成的可能性大就是哪个类别

概率的估计，标称：直接算

对于连续属性

假设满足正态分布

朴素贝叶斯预测要求每个条件概率都不为零。

否则，预测概率将为零

Use Laplacian correction (or Laplacian estimator) 拉普拉斯纠正

优点

鲁棒性强

缺点

各属性独立现实不一定满足

要求：

所有基本分类器都彼此独立。

所有基础分类器的性能都优于随机猜测（二元分类的错误率为 < 0.5）。

集成方法最适合不稳定的基础分类器

示例：未修剪的决策树、...（过拟合方差大）

Bagging: Bootstrap Aggregation

Training

有放回抽样

Classification

袋装分类器 M* 计算选票，并将得票最多的类分配给 X。

预测：通过获取给定测试元组的每个预测的平均值，它可以应用于连续值的预测。

（如决策树）

数据集差异大

不稳定算法：随机森林，决策树，knn

权重分配给每个训练元组中文 (简体)翻译。

迭代学习一系列 k 分类器

调整数据权重，选错数据权重增大（用的还是trainset）

调整训练器权重，训练器错误率越高权重越小

权重计算

分类器权重

Bagging串行boosting并行

都行

训练集

Z是调整完权重权重的和

如果错误率大于50%权重重调一然后再做

考虑权重再组合

Random Forest

中文 (简体)翻译。

通过对每棵树进行采样和替换来使用训练数据的子集

中文 (简体)翻译。

在每个节点使用随机选择的属性作为候选属性，并按其中的最佳属性进行拆分

中文 (简体)翻译。

与原始装袋相比，增加了生成的树木之间的多样性。

缩小分类器方差观察到各种特征

中文 (简体)翻译。

类不平衡数据集的分类

数量不平衡

传统方法假设类的平衡分布和相等的误差成本：不适用于类不平衡的数据

做法

上采样：重复添加数量少的类的数据集，但不能识别未出现过的正实例

下采样：“消除负实例

调整预置

集成

Model Evaluation and Selection

在评估准确性时使用类标记元组的验证测试集而不是训练集

中文 (简体)翻译。

估计分类器准确性的方法

中文 (简体)翻译。

维持方法

中文 (简体)翻译。

交叉验证

ROC Curves.

Classifier Evaluation Metrics: Confusion Matrix

A\P	C	¬C
C	TP	FN	P
¬C	FP	TN	N
	P’	N’	All

Accuracy = (TP + TN)/All

Error rate: 1 – accuracy, or

Error rate = (FP + FN)/All

类不平衡

精度和召回率之间的“反比”关系

中文 (简体)翻译。

F 测量（或 F 评分）：精度和召回率的谐波平均值

考虑精度和召回率的权重

分类器评估：维持和交叉验证

随机分裂数据集然后2/3做训练剩下的test最后结果取平均值

交叉验证

折有放回抽样然后平均分数据集

加起来平均分为结果为准确率

这为一次

十次十折100个分类器10个准确率

ROC

ROC 曲线下面积（AUC：曲线下面积）是模型精度的度量

一个分类器就一个TPR和一个FPR怎么画的曲线

使用为每个实例生成连续值分数的分类器

实例在 + 类中的可能性越大，分数就越高

根据分数按降序对实例进行排序

在分数的每个唯一值处应用阈值

计算每个阈值处的TP，FP，TN，FN的数量

基于规则的分类器有决策树、随机森林、Aprior。

定你现在训练了一个线性SVM并推断出这个模型出现了欠拟合现象。

16. 在下一次训练时，应该采取下列什么措施？

增加数据点
减少数据点
增加特征
减少特征

答案：C

最好的选择就是生成更多的特征。

Cluster Analysis

问题：

1、聚类和分类的区别是什么？簇的类型有哪些？

2、k-means算法中如何有效确定参数k?

3、k-means算法缺点是什么？试给出相关解决方法？

4、如何有效确定DBSCAN算法的参数？

5、阐述K-means算法原理及特点。

6、在基于层次的聚类中如何度量簇间的距离？各种度量方法的优缺点是什么？

7、基于网格的聚类如何利用构建多分辨率网络结构实现聚类？

8、如何评估聚类的有效性？

协同过滤

Types of Cluster Analysi

Partitioning criteria 划分准则
Partitional vs. hierarchical（树状）

Separation of clusters簇的分离性
Exclusive 互斥(e.g., one customer belongs to only one region)
vs. non-exclusive非互斥 (e.g., one document may belong to
more than one class)

Similarity measure
Distance-based (e.g., Euclidean, road network, vector) vs.
connectivity-based (e.g., density or contiguity)
Clustering space
Full space (often when low dimensional) vs. subspaces
(often in high-dimensional clustering)

Requirements and Challenges

Quality
Ability to deal with different types
of attributes: Numerical, categorical,
text, multimedia, networks, and
mixture of multiple types
Discovery of clusters with arbitrary
shape
Ability to deal with noisy data

Scalability
Clustering all the data instead of only on samples
High dimensionality Incremental or stream clustering and insensitivity to input order增量或流聚类和对输入顺序不敏感

Constraint-based clustering约束
User-given preferences or constraints; domain knowledge; user queries
Interpretability可解释性 and usability
The final generated clusters should be semantically
meaningful and useful

Partitioning Methods

通过优化特定的目标函数并迭代改进粒子质量来发现数据中的分组

Ck所有簇的均值

只有一个簇最大，所有点各是各的簇最小

聚集性

分离性

Ci 簇的个数

Mi每个簇中心，m整个簇的中心

SSB+SSE=常数

Problem definition: Given K, find a partition of K clusters that optimizes the chosen partitioning criterion

K—means

前提：簇可用中心点代替

平均点用均值代表

Given K, the number of clusters, the K-Means clustering algorithm is outlined as follows：

Select K points as initial centroids
Repeat
• Form K clusters by assigning（分配） each point to its closest centroid

Re-compute the centroids (i.e., mean point) of each cluste

Until convergence criterion（收敛条件如循环次数，sse值） is satisfied

遇到的困难：初始的k个点如何选，如何定k周线图

3、k-means算法缺点是什么？试给出相关解决方法

Efficiency: O(tKn) where n: # of objects, K: # of clusters, and t: # of iterations
Normally, K, t << n; thus, an efficient method
K-means clustering often terminates at a local optimal
Initialization can be important to find high-quality clusters
Need to specify K, the number of clusters, in advance
There are ways to automatically determine the “best” K
In practice, one often runs a range of values and selected the
“best” K value

Sensitive to noisy data and outliers
Variations: Using K-medians, K-medoids, etc.众数或中位数作为中心
K-means is applicable only to objects in a continuous n-dimensional space
Using the K-modes for categorical data

2、k-means算法中如何有效确定参数k?

拐点

k个初始对象怎么选

K-Means++ (Arthur & Vassilvitskii’07):

基本思路
The first centroid is selected at random
The next centroid selected is the one that is farthest from the currently selected (selection is based on a weighted probability score)
The selection continues until K centroids are obtained

具体算法

根据权重选，离已有中心越远权重越大

到选好的中心的最小值的平方/所有候选点的到选好的中心的最小值的平方=权重

空簇

解决策略

空簇SSE大可以这样使SSE快速下降

1选对SSE贡献最大的一个或几个点作为簇中心

2.选对SSR贡献最大的一个或几个簇中点作为簇中心

有多个空簇重复上述步骤

多分几个簇然后合并

看SSE是否下降，下降就换中心

COST(S)=SSE-SSE

中位数版本

Hierarchical Methods

凝聚层次聚类

分裂层次聚类

例如文本聚类

凝聚比分裂好

数据对象的聚类是通过在所需的层次上切割树状图来获得的，然后每个连接的组件形成一个聚类

Agglomerative Clustering Algorithm核心：Successively merge closest clusters

Basic algorithm is straightforward
1. Compute the proximity matrix
2. Let each data point be a cluster
3. Repeat
4. Merge the two closest clusters
5. Update the proximity matrix
6. Until only a single cluster remains

难点：邻近矩阵的计算

单连接：两个簇离得最近的点，易受噪音和异常影响，无法反映整体结构
全连接：最远的点
平均连接：任意两个对象距离的平均值，计算代价昂贵
中心连接：中心距离
Ward方法：对SSE贡献，合并两个簇SSE变大Δ就是距离优点：不受噪音影响缺点：适用于球形簇

Divisive Clustering

分裂准则无法穷举

Hierarchical Clustering: Problems and Limitations
Once a decision is made to combine two clusters, it cannot be undone（不可逆）
No global objective function is directly minimized（无法直接最小化全局函数
Different schemes have problems with one or more of the following:
Sensitivity to noise and outliers
Difficulty handling clusters of different sizes and non-
globular shapes
Breaking large clusters

Density- and Grid-Based Methods

DBSCAN

Discovers clusters of arbitrary shape: Density-Based
Spatial Clustering of Applications with Noise

边界点：不是核心但落入核心区域

几个概念

直接密度可达p是q的：（1）p在q邻域内（2）q是核心对象

密度可达

密度相连：pq都是o的密度可达点

簇被定义为密度连通点的最大集合

DBSCAN: Algorithm

找到所有核心对象，
随机找个点p找到所有密度相连的点，为一个簇
Continue the process until all of the points have been
processed

DBSCAN: Algorithm

找到所有核心对象，
随机找个点p找到所有密度相连的点，为一个簇
Continue the process until all of the points have been
processed

伪代码

Eps选择

算第k个邻居的距离

排序

选拐点对应的距离的值（后面是噪音）

k-1是民minpts

Computational complexity
If a spatial index （索引）is used, the computational complexity of
DBSCAN is O(nlogn), where n is the number of database objects
Otherwise, the complexity is O(n2)

Grid-Based Clustering Methods

Computational complexity
If a spatial index （索引）is used, the computational complexity of
DBSCAN is O(nlogn), where n is the number of database objects
Otherwise, the complexity is O(n2)

Grid-Based Clustering Methods

Evaluation of Clustering

Clustering tendency能不能成簇

Hopkins Statistic

Wi每个数据对象到他最近邻点的距离，ui人工生成满足均匀分布的数据集每个数据对象到他最近邻点的距离

Clustering evaluation质量

外部法：有label，聚类，对比

Q(C, T)

同质性:一个簇中标签大部分一样

完整性：一个类的对象分到尽可能少的类中

碎布袋：将一个异构对象放入一个纯集群应该比将其放入一个破布袋（即“杂项”或“其他”类别）受到更大的惩罚

保留小簇：大簇的分割危害没有小簇分隔大

内部：特征评估SSE,SSB

评估指标

Clustering evaluation质量

外部法：有label，聚类，对比

Q(C, T)

同质性:一个簇中标签大部分一样

完整性：一个类的对象分到尽可能少的类中

碎布袋：将一个异构对象放入一个纯集群应该比将其放入一个破布袋（即“杂项”或“其他”类别）受到更大的惩罚

保留小簇：大簇的分割危害没有小簇分隔大

内部：特征评估SSE,SSB

评估指标

Two matrices
One row and one column for each data point
Proximity Matrix近邻矩阵

An entry is 1 if the associated pair of points belong to the
same cluster
An entry is 0 if the associated pair of points belongs to
different cluste
Ideal Similarity Matrix相似性矩阵

距离之类的
计算两矩阵的相关性

绝对值高的话效果好

不适用于基于密度的聚类

可视化

尽量把一个簇的放在一起，

3．Clustering stability对算法参数的敏感，参数顺序等

轮廓系数含义： 轮廓系数（Silhouette Coefficient），是聚类效果好坏的一种评价方式。最佳值为1，最差值为-1。接近0的值表示重叠的群集。

密度直达：如果xi位于xj的Eps邻域中，且xj是核心对象，则称xi由xj密度直达。注意反之不一定成立，即此时不能说xj由xi密度直达, 除非且xi也是核心对象。不满足对称性
密度可达：对于xi和xj,如果存在样本样本序列p1,p2,...,pT,满足p1=xi,pT=xj, 且pt+1由pt密度直达，则称xj由xi密度可达。也就是说，密度可达满足传递性。此时序列中的传递样本p1,p2,...,pT−1均为核心对象，因为只有核心对象才能使其他样本密度直达。注意密度可达也不满足对称性，这个可以由密度直达的不对称性得出
密度相连：对于xi和xj,如果存在核心对象样本xk，使xi和xj均由xk密度可达，则称xi和xj密度相连。注意密度相连关系是满足对称性的。

Chapter 7 Outlier Analysis

问题

1、异常的类型

2、局部异常因子的计算方法及含义

Outlier and Outlier Analysis

异常

Outliers are different from the noise data
Noise is random error or variance in a measured variable
Noise should be removed before outlier detection

Types of Outliers

1、异常的类型

Global outlier (or point anomaly)全局离群点

和大多数点不同，如银行用户信誉检测

Contextual outlier (or conditional outlier)情境离群点

和上下文相关

Collective Outliers群体异常

异常检测在异常发生后

A data set may have multiple types of outlier
One object may belong to more than one type of outlier

Outlier Detection Methods

Based on whether user-labeled examples of outliers
can be obtained:
• Supervised, semi-supervised vs. unsupervised methods

semi-supervised：标签少/部分知道标签
Based on assumptions about normal data and outliers:
• Statistical, proximity-based, and clustering-based methods

有监督：分类问题

Challenges
• Imbalanced classes, i.e., outliers are rare
• Catch as many outliers as possible, i.e., recall is more
important than accuracy

无监督：聚类问题

Weakness:
Cannot detect collective outlier

Unsupervised methods may have a high false positive rate
but still miss many real outliers.

• Problem 1: Hard to distinguish noise from outliers
• Problem 2: Costly since first clustering: but far less
outliers than normal objects

Semi-Supervised Methods

训练模型

Statistical Approaches

Statistical approaches assume that the objects in a data
set are generated by a stochastic process (a generative
model)
2 Idea: learn a generative model fitting the given data set,
and then identify the objects in low probability regions
of the model as outliers
3 Methods are divided into two categories: parametric vs.
non-parametric （参数）

Parametric Methods I: Detection Univariate
Outliers Based on Normal Distribution

3σ原则

The Grubb's test (最大
标准残差检验)

Parametric Methods II: Detection of
Multivariate Outliers

Method 1. Compute Mahalaobis distance

• Let ō be the mean vector for a multivariate data set.
Mahalaobis distance for an object o to ō is MDist(o, ō)
= (o – ō )T S –1(o – ō) where S is the covariance（协方差） matrix
• Use the Grubb's test on this measure to detect outliers

Method 2. Use χ2 –statistic

• where Ei is the mean of the i-dimension among all
objects, and n is the dimensionality

• If X2–statistic is large, then object oi is an outlier

on-Parametric Methods: Detectio

Using Histogram

• Too small bin size → normal objects in empty/rare bins, false
positive
• Too big bin size → outliers in some frequent bins, false
negative
Problem: Hard to choose an appropriate bin size for
histogram

Problem: Hard to choose an appropriate bin size for
histogram

Proximity-Based Approaches

Distance-Based Outlier Detection

R,pi是预置，D数据数目

所有o的r邻域数据的数目

等效地，可以检查o和它的第k个最近邻居ok之间的距离，其中，如果dist（o，ok）>r，则.o是异常值

分布不均衡无法发现局部异常值

Density-Based Outlier Detection

k-distance：distance between o
and its k-th NN

k-distance neighborhood of o, Nk(o) = {o’| o’ in D,
dist(o, o’) ≤ distk(o)}
• Nk(o) could be bigger than k since multiple objects may
have identical（完全相同的） distance to o

每个点的平均可达距离的倒数

2、局部异常因子的计算方法及含义

对象o的LOF（局部离群因子）是o的局部可达性与o的k个最近邻居的可达性之比的平均值

=1正好

Clustering-Based Approaches

An object is an outlier if (1) it does not belong to any
cluster, (2) there is a large distance between the object
and its closest cluster , or (3) it belongs to a small or
sparse cluster
Case I: Not belong to any cluster
• Identify animals not part of a flock: Using a density-
based clustering method such as DBSCAN

Case 2: Far from its closest cluster
• Using k-means, partition data
points of into clusters
• For each object o, assign an
outlier score based on its distance
from its closest center
If dist(o, co)/avg_dist(co) is large,
likely an outlier