Process of knowledge discovery in databases

Data mining is an integral part of knowledge discovery in databases (KDD), which is the overal process of converting raw data into useful information.

 

The process of knowledge discovery in databases:

Input Data

-> Data Preprocessing(Feature Selection, Dimensionality Reduction, Normalization, Data Subsetting)  (the most laborious and time-consuming task)

-> Data Mining

-> Postprocessing (Filtering Patterns, Visualization, Pattern Interpretation)

-> Information

 

The purpose of preprocessing: raw input data -> appropriate format

Steps involved in data preprocessing:

1. fusing data from multiple sources;

2. cleaning data to remove noise and duplicate observatoins;

3. selecting records and features that are relevant to the data mining task at hand.

转载于:https://www.cnblogs.com/johnpher/archive/2013/01/18/2866950.html

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
* Covers over 25 new topics, as well as most updated information on topics presented in first edition Includes over 30 new world wide contributors, who are experts in this field New case studies introduced based on real world examples * Knowledge Discovery demonstrates intelligent computing at its best, and is the most desirable and interesting end-product of Information Technology. To be able to discover and to extract knowledge from data is a task that many researchers and practitioners are endeavoring to accomplish. There is a lot of hidden knowledge waiting to be discovered – this is the challenge created by today’s abundance of data. Data Mining and Knowledge Discovery Handbook, 2nd Edition organizes the most current concepts, theories, standards, methodologies, trends, challenges and applications of data mining (DM) and knowledge discovery in databases (KDD) into a coherent and unified repository. This handbook first surveys, then provides comprehensive yet concise algorithmic descriptions of methods, including classic methods plus the extensions and novel methods developed recently. This volume concludes with in-depth descriptions of data mining applications in various interdisciplinary industries including finance, marketing, medicine, biology, engineering, telecommunications, software, and security. Data Mining and Knowledge Discovery Handbook, 2nd Edition is designed for research scientists, libraries and advanced-level students in computer science and engineering as a reference. This handbook is also suitable for professionals in industry, for computing applications, information systems management, and strategic research management. Content Level » Research Keywords » Bayesian networks - KDD - algorithm - data mining - data mining applications - decision trees - ensemble method - knowledge discovery - large datasets - preprocessing method - soft computing method - statistical method - text mining - web mining
Emergence of Data Science placed knowledge discovery, machine learning, and data mining in multidimensional data, into the forefront of a wide range of current research, and application activities in computer science, and many domains far beyond it. Discovering patterns, in multidimensional data, using a combination of visual and analytical machine learning means are an attractive visual analytics opportu- nity. It allows the injection of the unique human perceptual and cognitive abilities, directly into the process of discovering multidimensional patterns. While this opportunity exists, the long-standing problem is that we cannot see the n-D data with a naked eye. Our cognitive and perceptual abilities are perfected only in the 3-D physical world. We need enhanced visualization tools (“n-D glasses”) to represent the n-D data in 2-D completely, without loss of information, which is important for knowledge discovery. While multiple visualization methods for the n-D data have been developed and successfully used for many tasks, many of them are non-reversible and lossy. Such methods do not represent the n-D data fully and do not allow the restoration of the n-D data completely from their 2-D represen- tation. Respectively, our abilities to discover the n-D data patterns, from such incomplete 2-D representations, are limited and potentially erroneous. The number of available approaches, to overcome these limitations, is quite limited itself. The Parallel Coordinates and the Radial/Star Coordinates, today, are the most powerful reversible and lossless n-D data visualization methods, while suffer from occlusion. There is a need to extend the class of reversible and lossless n-D data visual representations, for the knowledge discovery in the n-D data. A new class of such representations, called the General Line Coordinate (GLC) and several of their specifications, are the focus of this book. This book describes the GLCs, and their advantages, which include analyzing the data of the Challenger disaster, World hunger, semantic shift in humorous texts, image processing, medical computer-aided diag- nostics, stock market, and the currency exchange rate predictions. Reversible methods for visualizing the n-D data have the advantages as cognitive enhancers, of the human cognitive abilities, to discover the n-D data patterns. This book reviews the state of the vii viii Preface art in this area, outlines the challenges, and describes the solutions in the framework of the General Line Coordinates. This book expands the methods of the visual analytics for the knowledge dis- covery, by presenting the visual and hybrid methods, which combine the analytical machine learning and the visual means. New approaches are explored, from both the theoretical and the experimental viewpoints, using the modeled and real data. The inspiration, for a new large class of coordinates, is twofold. The first one is the marvelous success of the Parallel Coordinates, pioneered by Alfred Inselberg. The second inspiration is the absence of a “silver bullet” visualization, which is perfect for the pattern discovery, in the all possible n-D datasets. Multiple GLCs can serve as a collective “silver bullet.” This multiplicity of GLCs increases the chances that the humans will reveal the hidden n-D patterns in these visualizations. The topic of this book is related to the prospects of both the super-intelligent machines and the super-intelligent humans, which can far surpass the current human intelligence, significantly lifting the human cognitive limitations. This book is about a technical way for reaching some of the aspects of super-intelligence, which are beyond the current human cognitive abilities. It is to overcome the inabilities to analyze a large amount of abstract, numeric, and high-dimensional data; and to find the complex patterns, in these data, with a naked eye, supported by the analytical means of machine learning. The new algorithms are presented for the reversible GLC visual representations of high-dimensional data and knowledge discovery. The advantages of GLCs are shown, both mathematically and using the different datasets. These advantages form a basis, for the future studies, in this super-intelligence area. This book is organized as follows. Chapter 1 presents the goal, motivation, and the approach. Chapter 2 introduces the concept of the General Line Coordinates, which is illustrated with multiple examples. Chapter 3 provides the rigorous mathematical definitions of the GLC concepts along with the mathematical state- ments of their properties. A reader, interested only in the applied aspects of GLC, can skip this chapter. A reader, interested in implementing GLC algorithms, may find Chap. 3 useful for this. Chapter 4 describes the methods of the simplification of visual patterns in GLCs for the better human perception. Chapter 5 presents several GLC case studies, on the real data, which show the GLC capabilities. Chapter 6 presents the results of the experiments on discovering the visual features in the GLCs by multiple participants, with the analysis of the human shape perception capabilities with over hundred dimensions, in these experiments. Chapter 7 presents the linear GLCs combined with machine learning, including hybrid, automatic, interactive, and collaborative versions of linear GLC, with the data classification applications from medicine to finance and image pro- cessing. Chapter 8 demonstrates the hybrid, visual, and analytical knowledge dis- covery and the machine learning approach for the investment strategy with GLCs. Chapter 9 presents a hybrid, visual, and analytical machine learning approach in text mining, for discovering the incongruity in humor modeling. Chapter 10 describes the capabilities of the GLC visual means to enhance evaluation of accuracy and errors of machine learning algorithms. Chapter 11 shows an approach, Preface ix to how the GLC visualization benefits the exploration of the multidimensional Pareto front, in multi-objective optimization tasks. Chapter 12 outlines the vision of a virtual data scientist and the super-intelligence with visual means. Chapter 13 concludes this book with a comparison and the fusion of methods and the dis- cussion of the future research. The final note is on the topics, which are outside of this book. These topics are “goal-free” visualizations that are not related to the specific knowledge discovery tasks of supervised and unsupervised learning, and the Pareto optimization in the n-D data. The author’s Web site of this book is located at http://www.cwu.edu/*borisk/visualKD, where additional information and updates can be found. Ellensburg, USA Boris Kovalerchuk
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值