Chapter 1 Introduction

最新推荐文章于 2022-03-04 20:41:17 发布

Jason_Lu_USA

最新推荐文章于 2022-03-04 20:41:17 发布

阅读量422

点赞数

分类专栏： Introduction to data mining 文章标签：数据挖掘导论

本文链接：https://blog.csdn.net/dosion_jack/article/details/53964440

版权

Introduction to data mining 专栏收录该内容

1 篇文章 0 订阅

订阅专栏

第一章笔记：

Topics:

数据挖掘是什么？
哪些挑战促进了数据挖掘的发展？
数据挖掘的起源（以及与其他学科的交叉）
数据挖掘主要的任务是什么？
本书的组织与范围

1. 数据挖掘什么？

BI: Business Intelligence(商业智能)
Definition: Data miming is the process of automatically discovering useful information in large data repositories.(数据挖掘是从大数据集中自动提取有效数据的过程)
information retrieval v.s data mining(信息检索依靠的是传统的计算机技术，而数据挖掘却不是)
数据挖掘是数据库知识发现过程的一部分(数据预处理阶段是最耗时，工作量最大的一个阶段)

2. 哪些挑战促进了数据挖掘的发展？

Scalability(数据量大)-传统数据处理技术无法应对数据量大的数据集分析
High Dimensionality(高纬)-传统数据技术可以应对属性较少的数据分析与处理，但面对高纬往往束手无策
Heterogeneous and Complex Data(不同的种类与复杂的数据)
Data Ownership and Distribution(数据的私有化与分布)
Non-traditional Analysis(非传统分析)-传统分析往往是假设-测试，但是这种方案用于大数据集往往会出现工作量大耗时长的问题

3. 数据挖掘的起源（以及与其他学科的交叉）？

4. 数据挖掘的主要任务是什么？

主要分为两类
- Predictive tasks(预测任务)-基于自变量（explanatory or independent variable）预测因变量（target or dependent variable）
- Descriptive tasks(描述任务)-总结数据之间的关系
本书所描述的主要任务

- Predictive modeling-the task of building a model for the target variable as a function of the explanatory variables.(预测模型)
  - classification(分类)-预测离散目标变量
  - regression(递归)-预测连续性目标变量
- Association analysis-discover patterns that describe strongly associated features in the data.(关联分析)
- Cluster analysis-find groups of closely related observations so that observations that belong to the same cluster are more similar to each other than observations that belong to other clusters.(聚类分析)
- Anomaly detection-identify observations whose characteristics are significantly different from the rest of the data.(异常检测)

5. 本书的组织与范围？

chapter 1(导论)
chapter 2(数据基础) + chapter 3(数据探索)
chapter 4(分类基础) + chapter 5(分类中的关键技术)
chapter 6(关联分析基础)+chapter 7(高级关联分析)
chapter 8(不同种类的聚类与三种特殊的聚类技术)+chapter 9(高级聚类分析)
chapter 10(异常检测)