Data Mining2复习笔记1 Data Preprocessing

大白要努力啊

已于 2024-06-12 21:31:43 修改

阅读量684

点赞数 17

分类专栏：笔记文章标签：笔记 data mining 数据挖掘高级数据分析数据分析

于 2024-06-05 19:38:33 首次发布

本文链接：https://blog.csdn.net/weixin_45012798/article/details/139480477

版权

笔记专栏收录该内容

40 篇文章 14 订阅

订阅专栏

1. Data Preprocessing

1.1 Errors in Data

Source

malfunctioning sensors
errors in manual data processing (e.g., twisted digits)
storage/transmission errors
encoding problems, misinterpreted file formats
bugs in processing code
…

Simple remedy
remove data points outside a given interval (requires some domain knowledge)

Typical Examples
remove temperature values outside -30 and +50 °C
remove negative durations
remove purchases above 1M Euro

Advanced remedies
automatically find suspicious data points (Anomaly Detection)

1.2 Missing Values

Possible reasons

Failure of a sensor
Data loss
Information was not collected
Customers did not provide their age, sex, marital status, …
…

Treatments

Ignore records with missing values in training data
Replace missing value with…
- default or special value (e.g., 0, “missing”)
- average/median value for numerics
- most frequent value for nominals
Try to predict missing values:
- handle missing values as learning problem
- target: attribute which has missing values
- training data: instances where the attribute is present
- test data: instances where the attribute is missing

Note: values may be missing for various reasons, and more importantly: at random vs. not at random

Examples for not random

Non-mandatory questions in questionnaires
Values that are only collected under certain conditions
Values only valid for certain data sub-populations
Sensors failing under certain conditions

In those cases, averaging and imputation causes information loss -> “missing” can be information!

Missing Values vs. Missing Observations
Missing values: Typically single fields in a record + Can be handled with imputation etc.

Missing observations: Entire records missing + Various forms (Selection bias, Missing values in time series)

1.3 Unbalanced Distribution

Example: learn a model that recognizes HIV, given a set of symptoms.
Data set: records of patients who were tested for HIV
Class distribution: 99.9% negative, 0.01% positive

Learn a decision tree
Purity measure: Gini index

Gini Index

Model has very high accuracy, but 0 recall/precision on positive class, which is what we were interested in.

Remedy
re-balance dataset for training but evaluate on unbalanced dataset!

Resampling Unbalanced Data
Two conflicting goals
(1) use as much training data as possible
(2) use as diverse training data as possible

Strategies
(1) Downsampling larger class - conflicts with goal 1
(2) Upsampling smaller class - conflicts with goal 2

Consider an extreme example: 1,000 examples of class A, 10 examples of class B

Downsampling: does not use 990 examples
Upsampling: creates 100 copies of each example of B, likely for the classifier to simply memorize the 10 B cases

(3) Resampling with SMOTE (Synthetic Minority Over Sampling Technique) - creates synthetic examples of minority class
SMOTE

Stratification vs. changing the distribution

Stratified sampling: keep class distribution
Upsampling and downsampling: balance class distribution
Kennard-Stone sampling tries to select heterogenous points

Kennard-Stone Sampling

Compute pairwise distances of points
Add points with largest distance from one another
While target sample size not reached
1. For each candidate, find smallest distance to any point in the sample
2. Add candidate with largest smallest distance

This guarantees that heterogeneous data points are added, i.e., sample gets more diverse. It includes more corner cases but potentially also more outliers. Distribution may be altered

Kennard-Stone Sampling

Pro: a lot of rare cases covered
Con: original distribution gets lost

Sampling Strategies and Learning Algorithms
There are interaction effects. Some learning algorithms rely on distributions, e.g., Naive Bayes –> usually, stratified sampling works better. Some rely less on distributions, e.g., Decision Trees -> may work better if they see more corner cases

Often, the training data in a real-world project is already a sample, e.g., sales figures of last month -> to predict the sales figures for the rest of the year
How representative is that sample? What if last month was December? Or February?

Effect known as selection bias
Example: phone survey with 3,000 participants, carried out Monday, 9-17
Thought experiment: effect of selection bias for prediction,
e.g., with a Naive Bayes classifier

1.4 False Predictors

~100% accuracy are a great result and a result that should make you suspicious!
False predictor: target variable was included in attributes

Example: mark<5 → passed=true; sales>1000000 → bestseller=true

Recognizing False Predictors
(1) By analyzing models - rule sets consisting of only one rule; decision trees with only one node
Process: learn model, inspect model, remove suspect, repeat until the accuracy drops

(2) By analyzing attributes - compute correlation of each attribute with label; correlation near 1 (or -1) marks a suspect

Caution: there are also strong (but not false) predictors – it’s not always possible to decide automatically

1.5 Unsupported Data Types

Not every learning operator supports all data types
– some (e.g., ID3) cannot handle numeric data
– others (e.g., SVM) cannot nominal data
– dates are difficult for most learners

solutions
– convert nominal to numeric data
– convert numeric to nominal data (discretization, binning)
– extract valuable information from dates

1.5.1 Conversion

1.5.1.1 Binary to Numeric

Binary fields -> 0 & 1

1.5.1.2 Conversion: Nominal to Numeric

Multi-valued, unordered attributes with small no. of values -> one hot encoding: N binary or 0/1 variables, only one is “hot” (true or 1)

1.5.1.3 Conversion: Ordinal to Numeric

Some nominal attributes incorporated an order -> Ordered attributes (e.g. grade) can be converted to numbers preserving natural order.
Using such a coding schema allows learners to learn valuable rules

1.5.1.4 Conversion: Nominal to Numeric

manual, with background knowledge (e.g. group US states into west, mideast, northeast, south)
use binary attributes, then apply dimensionality reduction

1.5.2 Discretization

1.5.2.1 Discretization: Equal-width

Discretization: Equal-width

1.5.2.2 Discretization: Equal-height

Discretization: Equal-height

1.5.2.3 Discretization: Entropy

Top-down approach
Tries to minimize the entropy in each bin
Entropy (x are all the attribute values):
Entropy

Goal: make intra-bin similarity as high as possible
a bin with only equal values has entropy=0

Algorithm

Split into two bins so that overall entropy is minimized
Split each bin recursively as long as entropy decreases significantly

1.5.2.4 Trading and Test Data

Training and test data have to be equally discretized!

Learned model:
– income=high → give_credit=true
– income=low → give_credit=false

Applying model:
income=low has to have the same semantics on training and test data!
Naively applying discretization will lead to different ranges!

wrong
right

1.5.2.5 Semi-supervised Learning

Labeling data with ground truth can be expensive
Example: Medical images annotated with diagnoses by medical experts

Typical case:
Smaller subset of labeled data (gold standard). Larger subset of unlabeled data

Semi-supervised learning: Tries to combine both types of data

Semi-supervised learning can be applied to discretization
Learn distribution of an attribute on larger dataset → find better bins

1.5.3 Dealing with Data Attributes

Dates (and times) can be formatted in various ways
first step: normalize and parse

Further Datatypes: Text & Multi-modal data: Images, Videos, Audio
Typically, encoders are used to create (numeric) representations
from such data

High Dimensionality
Datasets with large number of attributes
Examples: text classification, image classification, genome classification, …
(not only a) scalability problem, e.g., decision tree: search all attributes for determining one single split

1.5.4 Curse of Dimensionality

Learning models gets more complicated in high-dimensional spaces
Higher number of observations are needed -> For covering a meaningful number of combinations (“Combinatorial Explosion”))

Distance functions collapse
i.e., all distances converge in high dimensions
Nearest neighbor classifiers are no longer meaningful

Why does Euclidean Distance Collapse维度灾难?
高维数据的距离都会变得无意义，因为彼此之间都非常远并收敛于一个值，变化趋于0，此时基于距离的算法像KNN这种会失效。

1.6 Feature Subset Selection

Preprocessing step
Idea: only use valuable features

Basic heuristics: remove nominal attributes…
– which have more than p% identical values
example: millionaire=false
– which have more than p% different values
• example: names, IDs

Basic heuristics: remove numerical attributes
– which have little variation, i.e., standard deviation <s

Basic Distinction: Filter vs. Wrapper Methods
Filter methods
Use attribute weighting criterion, e.g., Chi², Information Gain, … -> Select attributes with highest weights
Fast (linear in no. of attributes), but not always optimal

Remove redundant attributes (e.g., temperature in °C and °F, textual features “Barack” and “Obama”)
compute pairwise correlations between attributes -> remove highly correlated attributes

Recap:
Naive Bayes requires independent attributes
Will benefit from removing correlated attributes

Wrapper methods
Use classifier internally -> Run with different feature sets -> Select best feature set

Advantages: Good feature set for given classifier

Disadvantages: Expensive (naively: at least quadratic in number of attributes) & Heuristics can reduce number of classifier runs

Forward selection

Backward elimination

Further approaches

Brute Force search
Evolutionary algorithms

Trade-off

simple heuristics are fast but may not be the most effective
brute-force is most effective but the slowest
forward selection, backward elimination, and evolutionary algorithms are often a good compromise

Overfitting can happen with feature subsect selection, too
Here, name seems to be a useful feature, …but is it?

Remedies: Hard for filtering methods
• e.g., name has highest information gain!

Wrapper methods: use cross validation inside!

Principal Component Analysis (PCA)
PCA creates a (smaller set of) new attributes

artificial linear combinations of existing attributes
as expressive as possible

Idea: transform coordinate system so that each new coordinate (principal component) is as expressive as possible
expressivity: variance of the variable
the 1st, 2nd, 3rd… PC should account for as much variance as possible, further PCs can be neglected

PCA

Principal components are linear combinations of the existing features

General approach
(1) The first component should have as much variance as possible
(2)The subsequent ones should also have as much variance as possible and be perpendicular to the first one

PCA General Approach
PCA Example