


  —— Gregory Piatetsky-Shapiro, KDnuggets的总裁
  —— 摘自卡内基-梅隆大学Christos Faloutsos教授为本书所作序言




  The computerization of our society has substantially enhanced our capabilities for both generating and collecting data from diverse sources. A tremendous amount of data has flooded almost every aspect of our lives. This explosive growth in stored or transient data has generated an urgent need for new techniques and automated tools that can intelligently assist us in transforming the vast amounts of data into useful information and knowledge. This has led to the generation of a promising and flourishing frontier in computer science called data mining, and its various applications. Data mining, also popularly referred to as knowledge discovery from data (KDD), is the automated or con-venient extraction of patterns representing knowledge implicitly stored or captured in large databases, data warehouses, the Web, other massive information repositories, or data streams.
  This book explores the concepts and techniques of knowledge discovery and data min- ing. As a multidisciphnary field, data mining draws on work from areas including statistics,machine learning, pattern recognition, database technology, information retrieval,network science, knowledge-based systems, artificial intelligence, high-performance computing, and data visualization. We focus on issues relating to the feasibility, use-fulness, effectiveness, and scalability of techniques for the discovery of patterns hidden in large data sets. As a result, this book is not intended as an introduction to statis-tics, machine learning, database systems, or other such areas, although we do provide some background knowledge to facilitate the reader's comprehension of their respective
  roles in data mining. Rather, the book is a comprehensive introduction to data mining.It is useful for computing science students, application developers, and business professionals, as well as researchers involved in any of the disciplines previously listed.
  Data mining emerged during the late 1980s, made great strides during the 1990s, and continues to flourish into the new millennium. This book presents an overall picture of the field, introducing interesting data mining techniques and systems and discussing applications and research directions. An important motivation for writing this book was the need to build an organized framework for the study of data mining--a challenging task, owing to the extensive multidisciplinary nature of this fast-developing field. We hope that this book will encourage people with different backgrounds and experiences to exchange their views regarding data mining so as to contribute toward the further promotion and shaping of this exciting and dynamic field.
  Organization of the Book
  Since the publication of the frst two editions of this book, great progress has been made in the field of data mining. Many new data mining methodologies, systems, and applications have been developed, especially for handling new kinds of data, includ-ing information networks, graphs, complex structures, and data streams, as well as text,Web, multimedia, time-series, and spatiotemporal data. Such fast development and rich,new technical contents make it difficult to cover the full spectrum of the field in a single book. Instead of continuously expanding the coverage of this book, we have decided to cover the core material in sufficient scope and depth, and leave the handling of complex data types to a separate forthcoming book.
  The third edition substantially revises the first two editions of the book, with numer-ous enhancements and a reorganization of the technical contents. The core technical material, which handles mining on general data types, is expanded and substantially enhanced. Several individual chapters for topics from the second edition (e.g., data pre-processing, frequent pattern mining, classification, and clustering) are now augmented and each split into two chapters for this new edition. For these topics, one chapter encap-sulates the basic concepts and techniques while the other presents advanced concepts and methods.
  Chapters from the second edition on mining complex data types (e.g., stream data,sequence data, graph-structured data, social network data, and multirelational data,as well as text, Web, multimedia, and spatiotemporal data) are now reserved for a new book that will be dedicated to advanced topics in data mining. Still, to support readers in learning such advanced topics, we have placed an electronic version of the relevant chapters from the second edition onto the book's web site as companion material for the third edition.
  The chapters of the third edition are described briefly as follows, with emphasis on the new material.
  Chapter I provides an introduction to the multidisciplinary field of data mining. It discusses the evolutionary path of information technology, which has led to the needfor data mining, and the importance of its applications. It examines the data types to be mined, including relational, transactional, and data warehouse data, as well as complex data types such as time-series, sequences, data streams, spatiotemporal data, multimedia data, text data, graphs, social networks, and Web data. The chapter presents a general dassification of data mining tasks, based on the kinds of knowledge to be mined, the kinds of technologies used, and the kinds of applications that are targeted. Finally, major challenges in the field are discussed.
  Chapter 2 introduces the general data features. It first discusses data objects and attribute types and then introduces typical measures for basic statistical data descrip-tions. It overviews data visualization techniques for various kinds of data. In addition to methods of numeric data visualization, methods for visualizing text, tags, graphs,and multidimensional data are introduced. Chapter 2 also introduces ways to measure similarity and dissimilarity for various kinds of data.
  Chapter 5 introduces techniques for data preprocessing. It first introduces the con- cept of data quality and then discusses methods for data cleaning, data integration, data reduction, data transformation, and data discretization.
  Chapters 4 and 5 provide a solid introduction to data warehouses, OLAP (online ana-lytical processing), and data cube technology. Chapter 4 introduces the basic concepts,modeling, design architectures, and general implementations of data warehouses and OLAP, as well as the relationship between data warehousing and other data generali- zation methods. Chapter 5 takes an in-depth look at data cube technology, presenting a detailed study of methods of data cube computation, including Star-Cubing and high-dimensional OLAP methods. Further explorations of data cube and OLAP technologies are discussed, such as sampling cubes, ranking cubes, prediction cubes, multifeature cubes for complex analysis queries, and discovery-driven cube exploration.
  Chapters 6 and 7 present methods for mining frequent patterns, associations, and correlations in large data sets. Chapter 6 introduces fundamental concepts, such as market basket analysis, with many techniques for frequent itemset mining presented in an organized way. These range from the basic Apriori algorithm and its vari-ations to more advanced methods that improve efficiency, including the frequent pattern growth approach, frequent pattern mining with vertical data format, and min-ing closed and max frequent itemsets. The chapter also discusses pattern evaluation methods and introduces measures for mining correlated patterns. Chapter 7 is on advanced pattern mining methods. It discusses methods for pattern mining in multi-level and multidimensional space, mining rare and negative patterns, mining colossal patterns and high-dimensional data, constraint-based pattern mining, and mining com-pressed or approximate patterns. It also introduces methods for pattern exploration and application, including semantic annotation of frequent patterns.
  Chapters 8 and 9 describe methods for data classification. Due to the importance and diversity of classification methods, the contents are partitioned into two chapters.Chapter 8 introduces basic concepts and methods for classification, including decision tree induction, Bayes classification, and rule-based classification. It also discusses model evaluation and selection methods and methods for improving classification accuracy,including ensemble methods and how to handle imbalanced data. Chapter 9 discusses advanced methods for classification, including Bayesian belief networks, the neural network technique of backpropagation, support vector machines, classification using frequent patterns, k-nearest-neighbor classifiers, case-based reasoning, genetic algo-rithms, rough set theory, and fuzzy set approaches. Additional topics include multiclass classification, semi-supervised classification, active learning, and transfer learning.
  Cluster analysis forms the topic of Chapters 10 and 11. Chapter 10 introduces the basic concepts and methods for data clustering, including an overview of basic cluster analysis methods, partitioning methods, hierarchical methods, density-based methods,and grid-based methods. It also introduces methods for the evaluation of clustering.Chapter 11 discusses advanced methods for clustering, including probabilistic model-based clustering, clustering high-dimensional data, clustering graph and network data,and clustering with constraints.
  Chapter 12 is dedicated to outlier detection. It introduces the basic concepts of out-liers and outlier analysis and discusses various outlier detection methods from the view of degree of supervision (i.e., supervised, semi-supervised, and unsupervised meth-ods), as well as from the view of approaches (i.e., statistical methods, proximity-based methods, clustering-based methods, and classification-based methods). It also discusses methods for mining contextual and collective outliers, and for outlier detection in high-dimensional data.
  Finally, in Chapter 13, we discuss trends, applications, and research frontiers in data mining. We briefly cover mining complex data types, including mining sequence data (e.g., time series, symbolic sequences, and biological sequences), mining graphs and networks, and mining spatial, multimedia, text, and Web data. In-depth treatment of data mining methods for such data is left to a book on advanced topics in data mining,the writing of which is in progress. The chapter then moves ahead to cover other data mining methodologies, including statistical data mining, foundations of data mining,visual and audio data mining, as well as data mining applications. It discusses data mining for financial data analysis, for industries like retail and telecommunication, for use in science and engineering, and for intrusion detection and prevention. It also dis-cusses the relationship between data mining and recommender systems. Because data mining is present in many aspects of daily life, we discuss issues regarding data mining and society, including ubiquitous and invisible data mining, as well as privacy, security,and the social impacts of data mining. We conclude our study by looking at data mining trends.
  Throughout the text, italic font is used to emphasize terms that are defined, while bold font is used to highlight or summarize main ideas. Sans serif font is used for reserved words. Bold italic font is used to represent multidimensional quantities.This book has several strong features that set it apart from other texts on data mining.It presents a very broad yet in-depth coverage of the principles of data mining. Thechapters are written to be as self-contained as possible, so they may be read in order of interest by the reader. Advanced chapters offer a larger-scale view and may be considered optional for interested readers. All of the major methods of data mining are presented.The book presents important topics in data mining regarding multidimensional OLAP analysis, which is often overlooked or minimally treated in other data mining books.The book also maintains web sites with a number of online resources to aid instructors,students, and professionals in the field. These are described further in the following.To the Instructor
  This book is designed to give a broad, yet detailed overview of the data mining field. It can be used to teach an introductory course on data mining at an advanced undergrad-uate level or at the first-year graduate level. Sample course syllabi are provided on the book's web sites ( www. cs. uiuc. edu/-.,hanj/bk3 and www. booksite, rnkp. corn/datarnining3e)in addition to extensive teaching resources such as lecture slides, instructors' manuals,and reading lists (see p. xiv).
.  Figure P. I A suggested sequence of chapters for a short introductory course.
  Depending on the length of the instruction period, the background of students, and your interests, you may select subsets of chapters to teach in various sequential order-ings. For example, ifyou would like to give only a short introduction to students on data mining, you may follow the suggested sequence in Figure P. 1. Notice that depending on the need, you can also omit some sections or subsections in a chapter if desired.
  Depending on the length of the course and its technical scope, you may choose to selectively add more chapters to this preliminary sequence. For example, instructors who are more interested in advanced dassification methods may first add "Chapter 9. Classification: Advanced Methods"; those more interested in pattern mining may choose to include "Chapter 7. Advanced Pattern Mining"; whereas those interested in OLAP and data cube technology may like to add "Chapter 4. Data Warehousing and Online Analytical Processing" and "Chapter 5. Data Cube Technology."
  Alternatively, you may choose to teach the whole book in a two-course sequence that covers all of the chapters in the book, plus, when time permits, some advanced topics such as graph and network mining. Material for such advanced topics may be selected from the companion chapters available from the book's web site, accompanied with a set of selected research papers.
  Individual chapters in this book can also be used for tutorials or for special topics in related courses, such as machine learning, pattern recognition, data warehousing, and intelligent data analysis.
  Each chapter ends with a set of exercises, suitable as assigned homework. The exer-cises are either short questions that test basic mastery of the material covered, longer questions that require analytical thinking, or implementation projects. Some exercises can also be used as research discussion topics. The bibliographic notes at the end of each chapter can be used to find the research literature that contains the origin of the concepts and methods presented, in-depth treatment of related topics, and possible extensions.
  To the Student
  We hope that this textbook will spark your interest in the young yet fast-evolving field of data mining. We have attempted to present the material in a clear manner, with careful explanation of the topics covered. Each chapter ends with a summary describing the main points. We have included many figures and illustrations throughout the text to make the book more enjoyable and reader-friendly. Although this book was designed as a textbook, we have tried to organize it so that it will also be useful to you as a reference book or handbook, should you later decide to perform. in-depth research in the related fields or pursue a career in data mining.
  What do you need to know to read this book?
  · You should have some knowledge of the concepts and terminology associated with statistics, database systems, and machine learning. However, we do try to provide enough background of the basics, so that if you are not so familiar with these fields or your memory is a bit rusty, you will not have trouble following the discussions in the book.
  · You should have some programming experience. In particular, you should be able to read pseudocode and understand simple data structures such as multidimensional arrays.
  TO the Professional
  This book was designed to cover a wide range of topics in the data mining field. As a result, it is an excellent handbook on the subject. Because each chapter is designed to be as standalone as possible, you can focus on the topics that most interest you. The book can be used by application programmers and information service managers who wish to learn about the key ideas of data mining on their own. The book would also be useful for technical data analysis staff in banking, insurance, medicine, and retailing industries who are interested in applying data mining solutions to their businesses. Moreover, the book may serve as a comprehensive survey of the data mining field, which may also benefit researchers who would like to advance the state-of-the-art in data mining and extend the scope of data mining applications.
  The techniques and algorithms presented are of practical utility. Rather than selecting algorithms that perform. well on small "toy" data sets, the algorithms described in the book are geared for the discovery of patterns and knowledge hidden in large, real data sets. Algorithms presented in the book are illustrated in pseudocode. The pseudocode is similar to the C programming language, yet is designed so that it should be easy to follow by programmers unfamiliar with C or C++. Ifyou wish to implement any of the algorithms, you should find the translation of our pseudocode into the programming language of your choice to be a fairly straightforward task.
  Book Web Sites with Resources
  The book has a web site at www. cs. uiuc. edu/',,hanj/bk3 and another with Morgan Kauf-mann Publishers at www. booksite, rnkp. com/datamining3e. These web sites contain many supplemental materials for readers of this book or anyone else with an interest in data mining. The resources include the following:
  · Slide presentations for each chapter. Lecture notes in Microsoft PowerPoint slides are available for each chapter.
  · Companion chapters on advanced data mining. Chapters 8 to 10 of the second edition of the book, which cover mining complex data types, are available on the book's web sites for readers who are interested in learning more about such advanced topics, beyond the themes covered in this book.
  · Instructors' manual. This complete set of answers to the exercises in the book is available only to instructors from the publisher's web site.
  · Course syllabi and lecture plans. These are given for undergraduate and graduate versions of introductory and advanced courses on data mining, which use the text and slides.
  · Supplemental reading lists with hyperlinks. Seminal papers for supplemental read-ing are organized per chapter.
  · Links to data mining data sets and software. We provide a set of links to data mining data sets and sites that contain interesting data mining software packages, such as IlliMine from the University of Illinois at Urbana-Champaign ( http://illirnine, cs. uiuc. edu).
  · Sample assignments, exams, and course projects. A set of sample assignments, exams, and course projects is available to instructors from the publisher's web site.
  · Figures from the book. This may help you to make your own slides for your classroom teaching.
  · Contents of the book in PDF format.
  · Errata on the different printings of the book. We encourage you to point out any errors in this book. Once the error is confirmed, we will update the errata list and include acknowledgment of your contribution.
  Comments or suggestions can be sent to hanj@cs, uiuc. edu. We would be happy to hear
  from you.


foreword to second edition
about the authors
chapter 1 introduction
1.1 why data mining?
1.2 what is data mining!
1.3 what kinds of data can be mined?
1.4 what kinds of patterns can be mined?
1.5 which technologies are used?
1.6 which kinds of applications are targeted?
1.7 major issues in data mining
1.8 summary
1.9 exercises
1.10 bibliographic notes
chapter 2 getting to know your data
2.1 data objects and attribute types
2.2 basic statistical descriptions of data
2.3 data visualization
.2.4 measuring data similarity and dissimilarity
2.5 summary
2.6 exercises
2.7 bibliographic notes
chapter 3 data preprocessing
3.1 data preprocessing an overview
3.2 data cleaning
3.3 data integration
3.4 data reduction
3.5 data transformation and data discretion
3.6 summary
3.7 exercises
3.8 bibliographic notes
chapter 4 data warehousing and online analytical piocessing
4.1 data warehouse: basic concepts
4.2 data warehouse modeling: data cube and olap
4.3 data warehouse design and usage
4.4 data warehouse implementation
4.5 data generalization by attribute-oriented induction
4.6 summary
4.7 exercises
4.8 bibliographic notes
chapter 5 data cube technology
5.1 data cube computation: preliminary concepts
5.2 data cube computation methods
5.3 processing advanced kinds of queries by exploring cube technology
5.4 multidimensional data analysis in cube space
5.5 summary
5.6 exercises
5.7 bibliographic notes
chapter 6 mining frequent patterns, associations, and correlations:basic concepts and methods
6.1 basic concepts
6.2 frequent itemset mining methods
6.3 which patterns are interesting?-pattern evaluation methods
6.4 summary
6.5 exercises
6.6 bibliographic notes
chapter 7 advanced pattern mining
7.1 pattern mining: a road map
7.2 pattern mining in multilevel, multidimensional space
7.3 constraint-based frequent pattern mining
7.4 mining high-dimensional data and colossal patterns
7.5 mining compressed or approximate patterns
7.6 pattern exploration and application
7.7 summary
7.8 exercises
7.9 bibliographic notes
chapter 8 classification: basic concepts
8.1 basic concepts
8.2 decision tree induction
8.3 bayes classification methods
8.4 rule-based classification
8.5 model evaluation and selectign
8,6 techniques to improve classification accuracy
8,7 summary
8.8 exercises
8.9 bibliographic notes
chapter 9 classification: advanced methods
9.1 bayesian belief networks
9.2 classification by backpropagation
9.3 support vector machines
9.4 classification using frequent patterns
9.5 lazy learners (or learning from your neighbors)
9.6 other classification methods
9.7 additional topics regarding classification
9.8 summary
9.9 exercises
9.10 bibliographic notes
chapter 10 cluster analysis: basic concepts and methods
10.1 cluster analysis
10.2 partitioning methods
10.3 hierarchical methods
10.4 density-based methods
10.5 grid-based methods
10.6 evaluation of clustering
10.7 summary
10.8 exercises
10.9 bibliographic notes
chapter 11 advanced cluster analysis
11.1 probabilistic model-based clustering
11.2 clustering high-dimensional data
11.3 clustering graph and network data
11.4 clustering with constraints
11.5 summary
11.6 exercises
11.7 bibliographic notes
chapter 12 outlier detection
12.1 outliers and outlier analysis
12.2 outlier detection methods
12.3 statistical approaches
12.4 proximity-based approaches
12.5 clustering-based approaches
12.6 classification-based approaches
12.7 mining contextual and collective outliers
12.8 outlier detection in high-dimensional data
12.9 summary
12.10 exercises
12.11 bibliographic notes
chapter 13 data mining trends and research frontiers
13.1 mining complex data types
13.2 other methodologies of data mining
13.3 data mining applications
13.4 data mining and society
13.5 data mining trends
13.6 summary
13.7 exercises
13.8 bibliographic notes


来自 “ ITPUB博客 ” ,链接:http://blog.itpub.net/16502878/viewspace-740000/,如需转载,请注明出处,否则将追究法律责任。


  • 0
  • 1
    觉得还不错? 一键收藏
  • 0




当前余额3.43前往充值 >
领取后你会自动成为博主和红包主的粉丝 规则
钱包余额 0


