【数据挖掘】北邮国际学院大三上期末复习

cot.Y

已于 2022-02-15 12:03:28 修改

阅读量1.7k

点赞数 6

分类专栏：大三上文章标签：数据挖掘人工智能算法经验分享

于 2021-12-29 19:29:07 首次发布

本文链接：https://blog.csdn.net/qq_59362610/article/details/122223061

版权

大三上专栏收录该内容

4 篇文章 12 订阅

订阅专栏

考完来更：

2021年考的简答题是维度灾难和层次聚类，还好

大题分别是：1.相似度矩阵 MIN、MAX、组平均然后画个树状图

2.Aprior那个算法，求关联规则的

3.贝叶斯概率

4.拉普拉斯那一系列的计算

（不让用科学计算器，计算真挺麻烦的，建议买个傻瓜计算器，能算最简单的就行，log考的不多，老师说保留也可以）

总体来说挺简单的，但选择填空可能考的比较细，这个就看ppt上课听听差不多也能记住。不过因为平时分和换算的原因给分高。

这里会写一些基础知识概念题，剩下的大题范围都在老师发的课后题中。

数据预处理

Noise（噪声）

For objects, noise is an extraneous object

For attributes, noise refers to modification of original values

对于物体来说，噪声是一个无关的物体

对于属性，噪声是指对原始值的修改

Outlier（离群点）

Outliers are data objects with characteristics that are considerably different than most of the other data objects in the data set

离群点是数据对象，其特征与数据集中的大多数其他数据对象有很大不同

数据预处理

Aggregation（聚集）

Combining two or more attributes (or objects) into a single attribute (or object)

将两个或多个属性(或对象)组合为单个属性(或对象)

Purpose

Data reduction数据简化

Change of scale规模变化

More “stable” data更“稳定”的数据

Sampling（采样）

Sampling is the main technique employed for data reduction.

抽样是数据简化的主要技术。

Dimensionality Reduction（维归约）

Purpose:

Avoid curse of dimensionality避免维灾难

Reduce amount of time and memory required by data mining algorithms减少数据挖掘算法所需的时间和内存

Allow data to be more easily visualized让数据更容易可视化

May help to eliminate irrelevant features or reduce noise可能有助于消除不相关的特征或减少噪音

Feature Subset Selection（特征子集选择）

Feature Creation（特征创建）

Create new attributes that can capture the important information in a data set much more efficiently than the original attributes

创建能够比原始属性更有效地捕获数据集中重要信息的新属性

Discretization（离散化）

Discretization is the process of converting a continuous attribute into an ordinal attribute

离散化是将连续属性转化为序数属性的过程

Binarization（二元化）

Binarization maps a continuous or categorical attribute into one or more binary variables

二元化将连续或类别属性映射为一个或多个二元变量

Attribute Transformation（变量变换）

聚类概念复习

Cluster

-Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to) the objects in other groups

寻找一组物体，使一组物体彼此相似(或相关)，而不同于(或不相关)其他组中的物体

Type of clustering：
（！！）Partitional Clustering（划分型）

-A division of data objects into non-overlapping subsets (clusters) such that each data object is in exactly one subset

将数据对象划分为不重叠的子集(簇)，这样每个数据对象都恰好在一个子集中

（！！）Hierarchical clustering（层次型）

-A set of nested clusters organized as a hierarchical tree

一组嵌套的簇以层次树的形式组织

Agglomerative: 凝聚型（例：MIN、MAX、Group Average）

Start with the points as individual clusters

At each step, merge the closest pair of clusters until only one cluster (or k clusters) left

首先将这些点作为单独的集群

在每一步中，合并最近的一对集群，直到只剩下一个集群(或k个集群)

Divisive: 分裂型（例：DIANA）

Start with one, all-inclusive cluster

At each step, split a cluster until each cluster contains an individual point (or there are k clusters)

从一个包罗万象的集群开始

在每一步中，分割一个集群，直到每个集群包含一个单独的点(或者有k个集群)

Type of cluster：

Well-Separated Clusters 完全分隔（例：K-means Clustering）

A cluster is a set of points such that any point in a cluster is closer (or more similar) to every other point in the cluster than to any point not in the cluster.

簇是一组点，这样簇中的任何一点都比不属于簇的任何一点更接近(或更类似)簇中的每一个点。

（！！）Prototype-Based (Center-Based) 基于原型(中心)

A cluster is a set of objects in which each object is closer (more similar) to the prototype that defines the cluster than to the prototype of any other cluster.

簇是一组对象，其中每个对象更接近(更类似)定义簇的原型，而不是任何其他簇的原型。

Density-based基于密度（例：DBSCAN）

A cluster is a dense region of points, which is separated by low-density regions, from other regions of high density.

簇是点的密集区域，由低密度区域与其他高密度区域隔开。

grid-based

The clustering method based on grid quantifies space into a finite number of cells, which can form a grid structure, and all clustering is carried out on the grid.

基于网格的聚类方法将空间量化为有限数目的单元，可以形成一个网格结构，所有聚类都在网格上进行。

分类概念复习

overfitting过拟合原因

1. Noise噪声（导致的过拟合）

2. Lack of representative samples缺乏代表性样本

3. Multiple comparisons procedure多重比较过程造成（e.g.学委运气蒙对的）

泛化误差估计方法

1、Substituting estimates（training error）再代入估计

Substituting the estimation method assumes that the training data set can represent the overall data well, so the training error (resubstituting the error) can be used to provide an optimistic estimate of the generalization error.

再代入估计方法假设训练数据集可以很好地代表整体数据，因此，可以使用训练误差（再代入误差）提供对泛化误差的乐观估计。

2、Incorporating Model Complexity结合模型复杂度

Given two models of similar generalization errors, one should prefer the simpler model over the more complex model

给定两个具有相似泛化误差的模型，较简单的模型比较复杂的模型更可取

3、Estimating Statistical Bounds估计统计上界

4、Using Validation Set使用确认集

其他分类方法

Rule-Based Classifier基于规则分类器

PIPPER算法：To illustrate a direct approach to rule extraction, consider a widely used rule induction algorithm called the RIPPER algorithm.

RIPPER算法：为了阐明规则提取的直接方法，考虑一种广泛使用的规则归纳算法，叫作RIPPER算法。

Nearest-Neighbor Classifiers最近邻分类器

K-nearest neighbors of a record x are data points that have the k smallest distances to x