DataMining(1)_Know Your Data

Data Objects and Attribute Types

Record
- Relational records
- Data matrix, e.g., numerical matrix, crosstabs
- Document data: text documents: term-frequency vector
- Transaction data
Graph and network
- World Wide Web
- Social or information networks
- Molecular Structures
Ordered
- Video data: sequence of images
- Temporal data: time-series
- Sequential Data: transaction sequences
- Genetic sequence data
Spatial, image and multimedia:
- Spatial data: maps
- Image data:
- Video data

Data Objects:Data sets are made up of data objects.
A data objectrepresents an entity.

Attributes
Attribute (ordimensions, features, variables): a data field, representing a characteristic or feature of a data object.
Nominal
Binary
Ordinal
Quantity(integer or real-valued)
Interval
Ratio

Basic Statistical Descriptions of Data

  1. Motivation

    Measuring the Central Tendency
    Mean (algebraic measure) (sample vs. population)
    Median
    Mode
    这里写图片描述

    Measuring the Dispersion of Data
    Quartiles, outliers and boxplots
    Variance and standard deviation (sample:s, population: σ)
    Five-number summary
    Boxplot

Graphic Displays of Basic Statistical Descriptions
Boxplot
Histogram
Quantile plot
这里写图片描述
Quantile-quantile (q-q) plot
这里写图片描述
Scatter plot
这里写图片描述

Data dispersion characteristics
Numerical dimensions
Dispersion analysis on computed measures

Data Visualization

Categorization of visualization methods:
Pixel-oriented visualization techniques
这里写图片描述
Geometric projection visualization techniques
这里写图片描述
Icon-based visualization techniques
1.Chernoff Faces
这里写图片描述
2.Stick Figures
Hierarchical visualization techniques
1.Dimensional Stacking
2.Worlds-within-Worlds
3.Tree-Map
4.InfoCube
5.Three-D Cone Trees
Visualizing complex data and relations

Measuring Data Similarity and Dissimilarity

Similarity
Data matrix
这里写图片描述

Dissimilarity
Dissimilarity matrix
这里写图片描述
Proximity
Simple matching
这里写图片描述
A contingency table for binary data
这里写图片描述
Distance measure for symmetric binary variables
这里写图片描述
Distance measure for asymmetric binary variables
这里写图片描述
Jaccard coefficient
这里写图片描述
Jaccard coefficient is the same as “coherence”:
这里写图片描述

Standardizing Numeric Data
这里写图片描述
Minkowski distance: A popular distance measure
L-h norm:这里写图片描述
这里写图片描述
h= 1: Manhattan distance
h = 2: (L2 norm)Euclidean distance
h →∞. “supremum”(Lmax norm, L∞norm) distance.

Ordinal Variables
这里写图片描述

Attributes of Mixed Type
这里写图片描述

Cosine Similarity
cos(d1,d2)=(d1•d2)/||d1||||d2||,

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值