Introduction to Data Mining

Notes from Introduction to Data Mining, Pang-Ning Tan, Michael Steinbach, Vipin Kumar, 机械工业出版社


1. Process of KDD in databases:Input data -> Data Preprocessing -> Postprocessing-> Information   (Figure 1.1 page 3)

2. Data Mining Tasks: Cluster Analysis; Predictive Modeling; Association Analysis; Anomaly Detection  (page 7)

3. Attribute: (page 25~27)

    An attribute is property or charateristic of an object that may vary, either from one object to another or from one time to another

    An measurement scale is a rule(function) that associates a numerical or symbolic value with an attribute of an object.

    Four typical operations of numbers used to describe attributes:

Distinctness: = and ≠

Order: <, ≤, > and ≥

Addition: + and -

Multiplication: * and /

    Four types of attributes: The first two below are categorical(qualitative) and the latter two are numeric(quantitative).

     Nominal: The values of it are just different names used to tell one from each other, i.e., zip codes, employee ID numbers. Transformations on it includes any one to one mapping, a permutation(变化) of values.

     Ordinal: The values of it provide enough information to order objects, i.e., hardness of minerals, grades, {good better best}. Trans on it includes any order-preserving change of values, new_value=f(old_value) where f is a montonic(单调的) function.

     Interval: The differences between values are meaningful, i.e., calendar dates, temperature in Celsius. Trans on it encompasses new_value=a*old_value+b where a and b constants.

     Ratio: Both differences and ratios are meaningful, i.e., electrical current, length, age, mass. Trans on it encompasses new_value=a*old_value.

4. For asymmetric attributes, only presence--a non zero attribute value--is regarded as important. Asymmetric binary attributes refers those where only non-zero values are important. (page 28)

5. General Characteristics of Data Sets: Dimensionality, Sparsity, Resolution.

    Three types of Data Sets:( page 29~34)

      Record Data: Record data, Transaction data, Data matrix, Document-term matrix.

      Graph-Based Data: example:Linked Web Pages, Benzene molecule.

      Ordered Data: Sequential transaction data, Genomic sequence data, Temperature time series, Spatial temperature data.

6. A variety of problems that can result in measurement error: noise, artifacts, bias, precision and accuracy. (page37~39)

    Definition of precision is that the closeness of repeated measurements to one another, which is normally measured by the standard deviation

    Definition of bias is that a systematic variation of measurements from the quantity being measured, which is normally measured by the difference between the mean value and the real value.

7. Three major strategies to cope with the missing data: Eliminate Data Objects or AttributesEstimate Missing ValuesIgnore the Missing value during Analysis(Using those values which don’t have missing values) (page 41)

 8. Seven Steps of the Data Preprocessing:

    Aggregation;

    Sampling: the approaches of sampling encompass simple random sampling which draws each number of objects from the groups in different sizes and stratified sampling which draws objects in the number propotional to the size of the groups which they are drawn from.  (page 48)

   Dimensionality Reduction: Linear Algebra Techniques for dimensionality Reduction (Principal Components Analysis and Singular Value Decomposition)  (page 52)

   Feature Subset Selection:  (attributes)->(search strategy)->(subset of attributes)->(evaluation)->(check if it meets the stopping criterion)->(selected attributes)->(validation procedure)   (Page 54)

   Feature Creation: Fourier transform produce a new data object whose attributes are related to frequencies.

   Discretization and Binarization: how to transform a continuous attribute to a categorical attribute? Place n-1 split points into n intervals and divide it into n segments{(x0,x1], (x2,x3],….,(xn-1,xn)}, then map all the value that share one interval into the same catergorical value. Equal width, equal frequency and K-means are unsupervised discretization, meanwhile entropy-based approaches are supervised discretization.  (page 59,62)

   Variable Transformation: Simple Functions and Normalization or Standardization. Median and absolute standard deviation.  (page 63~65)

9. Proximity refers to either similarity or dissimilarity.  (page 65)

10. Euclidean distance: d(x,y)=sqrt(sum(sq(xk-yk))) (page 69 Equation 2.1)

11. Measures of distance that satisfy Positivity, Symmetry and Triangle Inequality are known as metricsand the property ofTriangle Inequality can be used to increase the efficiency of techniques.  (page 71)

12. Simple Matching Coefficient and Jaccard Coefficient are both used for similarity mearsure for binary data. Jaccard Coefficient often handle objects consisiting of asymmetric binary attributes. For example if 0-0 match are counted, most objects have a high similarity.  (page 74)

13. Cosine Similarity is one of the most common measure of document similarity. (page 75)

14. The correlation between two data objects that have binary or continuous variables is a measure of the linear relationship between attributes, which is more common.  (page 76)

15. Mahalanobis distance is useful when attributes are correlated, have different ranges of values, and the distribution of data is approximately Gaussian. But computing that is very expensive.  (page 81)

16. Data exploring is a preliminary investigation of the data in order to better understand it specific characteristics. It mainly involves three major topics: Summary statistics, Visualizaion,On-Line Analytical Processing(OLAP).  (page 97)

17. Summary statistic describes some numeric features like frequency, percentile, Mean and Median, Range and Variance, covariance.  (page 98-105) 

18. For data with outliers, the median again provides a more robust estimate of the middle of a set of value. The mean can be distorted by outliers, and since the variance is computed using the mean, it is also sensitive outliers.  (page 102,103)

19. Some techniques(charts) can be used in Visualizaion: Stem and Leaf Plots, Histograms(Two-Dimensional Histograms), Box Plots, Pie Chart, Percentile Plots and Empirical Cumulaitive Distribution Funcitons, Scatter Plots, Contour Plots, Surface Plots, Vector Field Plots, Animation, Matrices, Parallel Coordinates, Star Coordinates and Chernoff Faces.  (page 111-130)

20. OLAP not only made their way into spreadsheet programs(like Exel) but also did it devote itself into the research of interactive analysis of data and be able to visualizing the data and generate summary statistics.  (page 131)

21. Multidimensional data approach: sum over the selected dimensions to reduct dimensionality. (page 135)

22. Accuracy of a classification model=(Number of correct predictions/Total number of predictions)

     Error rate=(Number of wrong predictions/Total numbers of predictions) (Page 149)

23. Measures of impurity: Entropy, Gini, Classification error.  (page 158)

24. The gain △= Impurity(parent)-weighted mean of the impurity of all the children's nodes.  (page 160)

25. Model overfitting: As the number of Desicion Tree increases, there are fewertraining error andgeneralization error. However, once the tree becomes too large, its generalization error begins to increase even though its training error rate continues to decrease.  (page 174)

26. The major causes of model overfitting are: Presence of Noice, Lack of Representative Samples.  (page 175~177)

27. Pessimistic Error Estimate: eg(t)=(e(t)+Ω(t))/Nt  (Page 181)

 ASSOCIATION ANALYSIS:

   We could decompose the whole target into two subtasks:

    1.Frequent Itemsets Generation: Find all the itemsets that satisfy theminsupthreshold. (The Apriori Principle If an itemset is frequent, all of its subsets must also be frequent. page 333)

    2.Rule Generation: Extract all the rules that satisfy the minconf threshold. (Theorem If a rule X->Y-X doesn't satisfy the confidence threshold, then any rule X'->Y-X', where X' is a subset of X, ust not satisfy the confidence threshold as well.)

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值