Data Quality: Why Preprocess the Data?
Accuracy: correct or wrong, accurate or not
Completeness: not recorded, unavailable
Consistency: some modified but some not, dangling
Timeliness: timely update?
Believability: how trustable the data are correct?
Interpretability: how easily the data can be understood?
Major Tasks in Data Preprocessing
Data cleaning
Data in the Real World Is Dirty: Lots of potentially incorrect data, e.g., instrument faulty, human or computer error, transmission error
Incomplete (Missing) Data
Noisy Data
Binning
Regression
Clustering
Combined computer and human inspectionData integration
Combines data from multiple sources into a coherent store
Handling Redundancy in Data Integration
Correlation Analysis
1).Nominal Data:
2).Numeric Data
- Data reduction
Obtain a reduced representation of the data set that is much smaller in volume but yet produces the same (or almost the same) analytical results
Data reduction strategies
Dimensionality reduction, e.g.,remove unimportant attributes
Wavelet transforms
Principal Components Analysis (PCA)
Feature subset selection, feature creation
Numerosity reduction(some simply call it: Data Reduction)
Regression and Log-Linear Models
Histograms, clustering, sampling
Data cube aggregation
Data compression