摘要:
UNIVERSITY OF CALIFORNIA, SAN DIEGO Learning structure and concepts in data through data clustering A dissertation submitted in partial satisfaction of the requirements for the degree Doctor of Philosophy in Computer Science and Engineering by Gregory James Hamerly Committee in charge: Professor Charles P. Elkan, Chairperson Professor Serge Belongie Professor Garrison Cottrell Professor Sanjoy Dasgupta Professor Virginia de Sa Professor Kenneth Kreutz-Delgado 2003 Copyright Gregory James Hamerly, 2003 All rights reserved. The dissertation of Gregory Hamerly is approved, and it is acceptable in quality and form for publication on microfilm: Chair University of California, San Diego 2003 iii TABLE OF CONTENTS Signature Page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii Table of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii Vita and Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . xiv Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv I Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 A. Data clustering motivation . . . . . . . . . . . . . . . . . . . . . . 1 1. A small example: customer database . . . . . . . . . . . . . . . 1 2. Types of applications . . . . . . . . . . . . . . . . . . . . . . . 3 3. Specific applications . . . . . . . . . . . . . . . . . . . . . . . . 5 B. Terms and definitions . . . . . . . . . . . . . . . . . . . . . . . . . 6 C. Clustering algorithms . . . . . . . . . . . . . . . . . . . . . . . . . 6 1. Iterative optimization clustering algorithms . . . . . . . . . . . 6 2. Hierarchical clustering . . . . . . . . . . . . . . . . . . . . . . . 10 3. Spectral clustering . . . . . . . . . . . . . . . . . . . . . . . . . 12 4. Other clustering algorithms . . . . . . . . . . . . . . . . . . . . 13 D. Object similarity and distance metrics . . . . . . . . . . . . . . . . 14 E. Opportunities in data clustering . . . . . . . . . . . . . . . . . . . 17 F. Outline of the dissertation . . . . . . . . . . . . . . . . . . . . . . 18 II Finding high-quality clustering solutions . . . . . . . . . . . . . . . . . 19 A. Finding high-quality solutions . . . . . . . . . . . . . . . . . . . . 19 B. Center-based clustering . . . . . . . . . . . . . . . . . . . . . . . . 23 1. General iterative clustering . . . . . . . . . . . . . . . . . . . . 23 2. Initialization methods . . . . . . . . . . . . . . . . . . . . . . . 25 3. K-means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 4. Gaussian expectation-maximization . . . . . . . . . . . . . . . 28 5. Fuzzy k-means . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 6. K-harmonic means . . . . . . . . . . . . . . . . . . . . . . . . . 29 C. New clustering algorithms . . . . . . . . . . . . . . . . . . . . . . . 31 1. Hybrid 1: hard membership, varying weights . . . . . . . . . . 32 iv 2. Hybrid 2: soft membership, constant weights . . . . . . . . . . . . . . . . . . . . . . . . . . D. Experimental setup . . . . . . . . . . . . . . . . . . . . . . . . . . 1. Experiment 1: BIRCH . . . . . . . . . . . . . . . . . . . . . . . 2. Experiment 2: Pelleg and Moore data . . . . . . . . . . . . . . E. Experimental results . . . . . . . . . . . . . . . . . . . . . . . . . . 1. Experiment 1: BIRCH . . . . . . . . . . . . . . . . . . . . . . . 2. Experiment 2: Pelleg and Moore data . . . . . . . . . . . . . . F. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . G. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 33 35 36 36 36 42 43 44 III Estimating the number of clusters . . . . . . . . . . . . . . . . . . . . A. Introduction and related work . . . . . . . . . . . . . . . . . . . . B. The Gaussian-means (G-means) algorithm . . . . . . . . . . . . . C. Testing clust
展开