exploratory data analysis (EDA)
is an approach to analyzing data sets to summarize their main characteristics, often with visual methods.
In this book, we focus on:
1) summary statistics
2) visualization
3) online analytical processing(OLAP)
UCI Machine Learning Repository
http://www.ics.uci.edu/~mlearn/MLRepository.html
1. summary statistics:
1) mean is very sensitive to outliers.Thus, the median or a trimmed mean is also commonly used.
2) variance is also sensitive to outliers.
Average absolute deviation:
2. Visualization
box plot:
Parallel Coordinates:
不使用纵轴。横轴上是很多attribute(顺序影响解读),每个样本的各属性值在横轴上方的位置标好,连线,即每个样本用一条线表示。
3. OLAP
OLAP uses a multidimensional array representation.