Data mining

CH1 Data mining

Major data mining tasks

  1. Classication and regression

    • Classication predicts categorical attribute values;
    • regression predicts numerical attribute values
  2. Cluster analysis

    Given a set of objects, each having a set of attributes, and a
    similarity measure among them, nd clusters (i.e., groups) such

    • objects in one cluster are more similar to one another
    • objects in separate clusters are less similar to one another
      unlike classi cation, clustering analyzes objects without
      consulting a known class label
  3. Association analysis

Given a transactional database, nd the sets of objects that
frequently appear within the same transactions
also called frequent pattern mining

Various data repositories

  • relational data
  • data warehouses
  • transactional data
  • graph data
  • sequence data
  • time series
  • spatial data
  • text & multimedia data

CH2a Data preprocessing


Data preprocessing tasks

  • types of attributes
    • Categorical
      - nominal: provide enough information to distinguish one object from another
      Example zip codes, employee ID numbers, eye color, gender
      • binary: assume only two values (e.g., yes/no, true/false, 0/1)
      • ordinal: provide enough information to order objects
        Example grades, fgood,better,bestg
    • Numeric (continuous)
  • descriptive data summarization
    gives the overall picture of the data
    • measuring the central tendency
      • mean
        The mean is sensitive to extreme values
      • weighted mean
      • Trimmed mean: disregards the low and high extremes
      • a measure that is not sensitive to extreme values is the
        median, which represents the middle value of an ordered set
        of observations
      • mode: the value that occurs most frequently in the set
      • midrange: average of the largest and smallest values in the
    • measuring the dispersion
      - range: dierence between the largest and smallest value
      - kth percentile: value xi with the property that k percent of
      the data are smaller than xi (what percentile is the median?)
      - quartiles: 25th percentile (denoted by Q1), 50th percentile,
      and 75th percentile (denoted by Q3)
      - interquartile range:
      IQR = Q3 - Q1
      - five number summary: consists of minimum, Q1, median, Q3,
      - standard deviation : square root of variance 2
    • graphical display of descriptive summaries
      • boxplots
      • histograms
      • scatter plots

1. Data cleaning
fill in missing values
e.g., Occupation=\”
smooth out noise, outliers
outlier: usually, a value higher/lower than 1.5 x IQR
e.g., Salary = -10”
correct inconsistencies in the data
e.g., Age = \42”, Birthday = \03/07/2010”
e.g., discrepancy between duplicate records
2. Data integration
3. Data transformation
4. Data reduction

Data transformation
(Goal: modify the data in order to improve data mining performance)

attribute/feature construction

normalization: scaled to fall within a smaller, specied range

min-max normalization

z-score normalization

Data reduction





当前余额3.43前往充值 >
领取后你会自动成为博主和红包主的粉丝 规则
钱包余额 0


