Data mining

最新推荐文章于 2021-04-12 00:38:13 发布

CaffreyWu

最新推荐文章于 2021-04-12 00:38:13 发布

阅读量327

点赞数

本文链接：https://blog.csdn.net/qq_35523224/article/details/79361120

版权

CH1 Data mining

Major data mining tasks

Classication and regression
- Classication predicts categorical attribute values;
- regression predicts numerical attribute values
Cluster analysis

Given a set of objects, each having a set of attributes, and a
similarity measure among them, nd clusters (i.e., groups) such
that
- objects in one cluster are more similar to one another
- objects in separate clusters are less similar to one another
  unlike classi cation, clustering analyzes objects without
  consulting a known class label
Association analysis

Given a transactional database, nd the sets of objects that
frequently appear within the same transactions
also called frequent pattern mining

Various data repositories

relational data
data warehouses
transactional data
graph data
sequence data
time series
spatial data
text & multimedia data

CH2a Data preprocessing

-noisy
-inconsistent
-redundant

Data preprocessing tasks

types of attributes
- Categorical
  - nominal: provide enough information to distinguish one object from another
  Example zip codes, employee ID numbers, eye color, gender
  - binary: assume only two values (e.g., yes/no, true/false, 0/1)
  - ordinal: provide enough information to order objects
    Example grades, fgood,better,bestg
- Numeric (continuous)
descriptive data summarization
gives the overall picture of the data
involves
- measuring the central tendency
  - mean
    The mean is sensitive to extreme values
  - weighted mean
  - Trimmed mean: disregards the low and high extremes
  - a measure that is not sensitive to extreme values is the
    median, which represents the middle value of an ordered set
    of observations
  - mode: the value that occurs most frequently in the set
  - midrange: average of the largest and smallest values in the
    data
- measuring the dispersion
  - range: dierence between the largest and smallest value
  - kth percentile: value xi with the property that k percent of
  the data are smaller than xi (what percentile is the median?)
  - quartiles: 25th percentile (denoted by Q1), 50th percentile,
  and 75th percentile (denoted by Q3)
  - interquartile range:
  IQR = Q3 - Q1
  - five number summary: consists of minimum, Q1, median, Q3,
  maximum
  - standard deviation : square root of variance 2
- graphical display of descriptive summaries
  - boxplots
  - histograms
  - scatter plots

-
1. Data cleaning
fill in missing values
e.g., Occupation=\”
smooth out noise, outliers
outlier: usually, a value higher/lower than 1.5 x IQR
e.g., Salary = -10”
correct inconsistencies in the data
e.g., Age = \42”, Birthday = \03/07/2010”
e.g., discrepancy between duplicate records
2. Data integration
3. Data transformation
4. Data reduction