CH1 Data mining
Major data mining tasks
Classication and regression
- Classication predicts categorical attribute values;
- regression predicts numerical attribute values
Cluster analysis
Given a set of objects, each having a set of attributes, and a
similarity measure among them, nd clusters (i.e., groups) such
that- objects in one cluster are more similar to one another
- objects in separate clusters are less similar to one another
unlike classi cation, clustering analyzes objects without
consulting a known class label
Association analysis
Given a transactional database, nd the sets of objects that
frequently appear within the same transactions
also called frequent pattern mining
Various data repositories
- relational data
- data warehouses
- transactional data
- graph data
- sequence data
- time series
- spatial data
- text & multimedia data
CH2a Data preprocessing
-noisy
-inconsistent
-redundant
Data preprocessing tasks
- types of attributes
- Categorical
- nominal: provide enough information to distinguish one object from another
Example zip codes, employee ID numbers, eye color, gender
- binary: assume only two values (e.g., yes/no, true/false, 0/1)
- ordinal: provide enough information to order objects
Example grades, fgood,better,bestg
- Numeric (continuous)
- Categorical
- descriptive data summarization
gives the overall picture of the data
involves
- measuring the central tendency
- mean
The mean is sensitive to extreme values - weighted mean
- Trimmed mean: disregards the low and high extremes
- a measure that is not sensitive to extreme values is the
median, which represents the middle value of an ordered set
of observations - mode: the value that occurs most frequently in the set
- midrange: average of the largest and smallest values in the
data
- mean
- measuring the dispersion
- range: dierence between the largest and smallest value
- kth percentile: value xi with the property that k percent of
the data are smaller than xi (what percentile is the median?)
- quartiles: 25th percentile (denoted by Q1), 50th percentile,
and 75th percentile (denoted by Q3)
- interquartile range:
IQR = Q3 - Q1
- five number summary: consists of minimum, Q1, median, Q3,
maximum
- standard deviation : square root of variance 2 - graphical display of descriptive summaries
- boxplots
- histograms
- scatter plots
- measuring the central tendency
-
1. Data cleaning
fill in missing values
e.g., Occupation=\”
smooth out noise, outliers
outlier: usually, a value higher/lower than 1.5 x IQR
e.g., Salary = -10”
correct inconsistencies in the data
e.g., Age = \42”, Birthday = \03/07/2010”
e.g., discrepancy between duplicate records
2. Data integration
3. Data transformation
4. Data reduction
Data transformation
(Goal: modify the data in order to improve data mining performance)
attribute/feature construction
normalization: scaled to fall within a smaller, specied range