Data Mining
Chapter Two
Data dispersion characteristics
Center
Mean:
x
ˉ
=
1
n
∑
i
=
1
n
x
i
\bar{x} = \frac{1}{n} \sum_{i = 1}^n x_i
xˉ=n1∑i=1nxi,
μ
=
∑
x
N
\mu = \frac{\sum x}{N}
μ=N∑x
Weighted Mean:
x
ˉ
=
∑
i
=
1
n
w
i
x
i
∑
i
=
1
n
w
i
\bar{x} = \frac{\sum_{i = 1}^n w_i x_i}{\sum_{i = 1}^n w_i}
xˉ=∑i=1nwi∑i=1nwixi
Median(for grouped data): m e d i a n = L 1 + ( n / 2 − ( ∑ f r e q ) l f r e q m e d i a n ) w i d t h median = L_1 + (\frac{n / 2 - (\sum freq)l}{freq_{median}}) width median=L1+(freqmediann/2−(∑freq)l)width
Mode:
m
e
a
n
−
m
o
d
e
=
3
×
(
m
e
a
n
−
m
e
d
i
a
n
)
mean - mode = 3 \times (mean - median)
mean−mode=3×(mean−median)
mean > median, positively skewed
mean < median, negatively skewed
Quartiles:
Q
1
Q_1
Q1(25th percentile),
Q
3
Q_3
Q3(75th percentile)
Inter-quartile range:
I
Q
R
=
Q
3
−
Q
1
IQR = Q_3 - Q_1
IQR=Q3−Q1
Five number summary: min,
Q
1
Q_1
Q1, median,
Q
3
Q_3
Q3, max
Boxplot: ends of the box are the quartiles; median is marked; add whiskers, and plot outliers individually
Outlier: usually, a value higher/lower than 1.5 x IQR
Variance:
unbiased estimation:
s
2
=
1
n
−
1
∑
i
=
1
n
(
x
i
−
x
ˉ
)
2
=
1
n
−
1
[
∑
i
=
1
n
x
i
2
−
1
n
(
∑
i
=
1
n
x
i
)
2
]
s^2 = \frac{1}{n - 1} \sum_{i = 1}^n (x_i - \bar{x})^2 = \frac{1}{n - 1}[\sum_{i = 1}^n x_i^2 - \frac{1}{n}(\sum_{i = 1}^n x_i)^2]
s2=n−11∑i=1n(xi−xˉ)2=n−11[∑i=1nxi2−n1(∑i=1nxi)2]
biased estimation:
σ
2
=
1
n
∑
i
=
1
n
(
x
i
−
μ
)
2
=
1
n
∑
i
=
1
n
x
i
2
−
μ
2
\sigma^2 = \frac{1}{n} \sum_{i = 1}^n (x_i - \mu)^2 = \frac{1}{n} \sum_{i = 1}^n x_i^2 - \mu^2
σ2=n1∑i=1n(xi−μ)2=n1∑i=1nxi2−μ2
Pixel-Oriented Visualization Techniques
Similarity and Dissimilarity
1 | 0 | sum | |
---|---|---|---|
1 | q | r | q + r |
0 | s | t | s + t |
sum | q + s | r + t | p |
Distance measure for symmetric binary variables:
d
(
i
,
j
)
=
r
+
s
q
+
r
+
s
+
t
d(i, j) = \frac{r + s}{q + r + s + t}
d(i,j)=q+r+s+tr+s
Distance measure for asymmetric binary variables:
d
(
i
,
j
)
=
r
+
s
q
+
+
r
+
s
d(i, j) = \frac{r + s}{q + + r + s}
d(i,j)=q++r+sr+s
Here, the asymmetric means the loss cost is different, for some data sets, like the FP is an absolute majority.
Jaccard coefficient (similarity measure for asymmetric binary variables): s i m J a c c a r d ( i , j ) = q q + r + s sim_{Jaccard}(i, j) = \frac{q}{q + r + s} simJaccard(i,j)=q+r+sq
Minowski distance(L-h norm): d ( i , j ) = ∣ x i 1 − x j 1 ∣ h + ∣ x i 2 − x j 2 ∣ h + ⋯ + ∣ x i p − x j p ∣ h h d(i, j) = \sqrt[h]{|x_{i1} - x_{j1}|^h + |x_{i2} - x_{j2}|^h + \cdots + |x_{ip} - x_{jp}|^h} d(i,j)=h∣xi1−xj1∣h+∣xi2−xj2∣h+⋯+∣xip−xjp∣h
Properties:
- d ( i , j ) > 0 d(i, j) > 0 d(i,j)>0 if i ≠ j i \neq j i=j and d ( i , i ) = 0 d(i, i) = 0 d(i,i)=0 (Positive definiteness)
- d ( i , j ) = d ( j , i ) d(i, j) = d(j, i) d(i,j)=d(j,i) (Symmetry)
- d ( i , j ) ⩽ d ( i , k ) + d ( k , j ) d(i, j) \leqslant d(i, k) + d(k, j) d(i,j)⩽d(i,k)+d(k,j) (Triangle Inequality)
A distance that satifies these properties is a metric.
h
=
1
:
h = 1:
h=1: Mabhattan distance
d
(
i
,
j
)
=
∣
x
i
1
−
x
j
1
∣
+
∣
x
i
2
−
x
j
2
∣
+
⋯
+
∣
x
i
p
−
x
j
p
∣
d(i, j) = |x_{i1} - x_{j1}| + |x_{i2} - x_{j2}| + \cdots + |x_{ip} - x_{jp}|
d(i,j)=∣xi1−xj1∣+∣xi2−xj2∣+⋯+∣xip−xjp∣
h
=
2
:
h = 2:
h=2: Euclidean distance
d
(
i
,
j
)
=
∣
x
i
1
−
x
j
1
∣
2
+
∣
x
i
2
−
x
j
2
∣
2
+
⋯
+
∣
x
i
p
−
x
j
p
∣
2
d(i, j) = \sqrt{|x_{i1} - x_{j1}|^2 + |x_{i2} - x_{j2}|^2 + \cdots + |x_{ip} - x_{jp}|^2}
d(i,j)=∣xi1−xj1∣2+∣xi2−xj2∣2+⋯+∣xip−xjp∣2
h
→
∞
:
h \rightarrow \infty:
h→∞: supernum distance
d
(
i
,
j
)
=
l
i
m
h
→
∞
(
∑
f
=
1
p
∣
x
i
f
−
x
j
f
∣
)
1
h
=
m
a
x
f
p
∣
x
i
f
−
x
j
f
∣
d(i, j) = lim_{h \rightarrow \infty} (\sum_{f = 1}^p |x_{if} - x_{jf}|)^{\frac{1}{h}} = max_f^p |x_{if} - x_{jf}|
d(i,j)=limh→∞(∑f=1p∣xif−xjf∣)h1=maxfp∣xif−xjf∣
Ordinal Variables: z i f = r i f − 1 M f − 1 z_{if} = \frac{r_{if} - 1}{M_f - 1} zif=Mf−1rif−1
d ( i , j ) = ∑ f = 1 p δ i j ( f ) d i j ( f ) ∑ f = 1 p δ i j ( f ) d(i, j) = \frac{\sum_{f = 1}^p \delta_{ij}^{(f)} d_{ij}^{(f)}}{\sum_{f = 1}^p \delta_{ij}^{(f)}} d(i,j)=∑f=1pδij(f)∑f=1pδij(f)dij(f)
c o s ( d 1 , d 2 ) = d 1 ⋅ d 2 ∣ ∣ d 1 ∣ ∣ ⋅ ∣ ∣ d 2 ∣ ∣ cos(d_1, d_2) = \frac{d_1 \cdot d_2}{||d_1|| \cdot ||d_2||} cos(d1,d2)=∣∣d1∣∣⋅∣∣d2∣∣d1⋅d2 to evaluate the similarity of sentences.
Chapter Three
Data Processing
Data cleaning, Data integration, Data reduction, Data transformation and data discretization.
χ 2 \chi^2 χ2(chi-square) test
χ
2
=
∑
(
O
b
s
e
r
v
e
r
d
−
E
x
p
e
c
t
e
d
)
2
E
x
c
e
p
t
e
d
\chi^2 = \sum \frac{(Observerd - Expected)^2}{Excepted}
χ2=∑Excepted(Observerd−Expected)2
The larger the Χ2 value, the more likely the variables are related.
Correlation coefficient(Pearson’s product moment coefficient)
r A , B = ∑ i = 1 n ( a − A ˉ ) ( b − B ˉ ) ( n − 1 ) σ A σ B = ∑ i = 1 n ( a i b i ) − n A ˉ B ˉ ( n − 1 ) σ A σ B r_{A, B} = \frac{\sum_{i = 1}^n (a - \bar{A})(b - \bar{B})}{(n - 1) \sigma_A \sigma_B} = \frac{\sum_{i = 1}^n (a_i b_i) - n \bar{A} \bar{B}}{(n - 1) \sigma_A \sigma_B} rA,B=(n−1)σAσB∑i=1n(a−Aˉ)(b−Bˉ)=(n−1)σAσB∑i=1n(aibi)−nAˉBˉ
r A , B > 0 r_{A, B} > 0 rA,B>0 means A and B are positively correlated.
Let
a
k
′
=
(
a
k
−
m
e
a
n
(
A
)
)
/
s
t
d
(
A
)
,
b
k
′
=
(
b
k
−
m
e
a
n
(
B
)
)
/
s
t
d
(
B
)
{a_k}' = (a_k - mean(A)) / std(A), {b_k}' = (b_k - mean(B)) / std(B)
ak′=(ak−mean(A))/std(A),bk′=(bk−mean(B))/std(B),
then
c
o
r
r
e
l
a
t
i
o
n
(
A
,
B
)
=
A
′
⋅
B
′
correlation(A, B) = {A}' \cdot {B}'
correlation(A,B)=A′⋅B′
Covariance
C
o
v
(
A
,
B
)
=
E
(
(
A
−
A
ˉ
)
(
B
−
B
ˉ
)
)
=
∑
i
=
1
n
(
a
i
−
A
ˉ
)
(
b
−
B
ˉ
)
n
Cov(A, B) = E((A - \bar{A})(B - \bar{B})) = \frac{\sum_{i = 1}^n (a_i - \bar{A})(b - \bar{B})}{n}
Cov(A,B)=E((A−Aˉ)(B−Bˉ))=n∑i=1n(ai−Aˉ)(b−Bˉ)
r
A
,
B
=
C
o
v
(
A
,
B
)
σ
A
σ
B
r_{A, B} = \frac{Cov(A, B)}{\sigma_A \sigma_B}
rA,B=σAσBCov(A,B)
C
o
v
(
A
,
B
)
=
E
(
(
A
−
A
ˉ
)
(
B
−
B
ˉ
)
)
=
E
(
A
⋅
B
)
−
A
ˉ
B
ˉ
Cov(A, B) = E((A - \bar{A})(B - \bar{B})) = E(A \cdot B) - \bar{A} \bar{B}
Cov(A,B)=E((A−Aˉ)(B−Bˉ))=E(A⋅B)−AˉBˉ
Data reduction
Unsupervised:
- Latent Semantic Indexing (LSI): truncated SVD
- Principal Component Analysis (PCA)
- Independent Component Analysis (ICA)
- Canonical Correlation Analysis (CCA)
Supervised:
- Linear Discriminant Analysis (LDA)
Semi-supervised:
- Semi-supervised Discriminant Analysis (SDA)
Linear:
- Latent Semantic Indexing (LSI): truncated SVD
- Principal Component Analysis (PCA)
- Linear Discriminant Analysis (LDA)
- Canonical Correlation Analysis (CCA)
Nonlinear:
- Nonlinear feature reduction using kernels
- Manifold learning
Dimensionality reduction (Feature reduction):
- Feature extraction
- Feature selection
Selection: choose a best subset of size d from the available p features.
Extraction: given p features (set X), extract d new features (set Z) by linear or non-linear combination of all the p features.
PCA
Given { x 1 , . . . , x n } ∈ R p \{x_1, ..., x_n\} \in \mathbb{R}^p {x1,...,xn}∈Rp, target: get the a a a to maxmize the v a r ( z ) var(z) var(z), here z = a x z = ax z=ax
v a r ( z ) = E ( ( z − z ˉ ) 2 ) = 1 n ∑ i = 1 n ( a x i − a x ˉ ) 2 = 1 n ∑ i = 1 n a T ( x i − x ˉ ) ( x i − x ˉ ) T a = a T S a S = 1 n ∑ i = 1 n ( x i − x ˉ ) ( x i − x ˉ ) T \begin{aligned} var(z) &= E((z - \bar{z})^2)\\ &= \frac{1}{n} \sum_{i = 1}^n (ax_i - a\bar{x})^2\\ &= \frac{1}{n} \sum_{i = 1}^n a^T(x_i - \bar{x})(x_i - \bar{x})^Ta\\ &= a^TSa\\ S &= \frac{1}{n} \sum_{i = 1}^n (x_i - \bar{x})(x_i - \bar{x})^T \end{aligned} var(z)S=E((z−zˉ)2)=n1i=1∑n(axi−axˉ)2=n1i=1∑naT(xi−xˉ)(xi−xˉ)Ta=aTSa=n1i=1∑n(xi−xˉ)(xi−xˉ)T
which means
m
a
x
a
a
T
S
a
,
s
.
t
.
a
T
a
=
1
max_a a^TSa, s.t. a^Ta = 1
maxaaTSa,s.t.aTa=1.
We use Lagrange method to solve the problem.
L = a T S a − λ ( a T a − 1 ) ∂ L ∂ a = 2 S a − 2 λ a = 0 \begin{aligned} L = a^TSa - \lambda(a^Ta - 1)\\ \frac{\partial L}{\partial a} = 2Sa - 2\lambda a = 0\\ \end{aligned} L=aTSa−λ(aTa−1)∂a∂L=2Sa−2λa=0
So, λ \lambda λ and a a a is the pair of eigenvalue and eigenvector of S. Then v a r ( z ) = a T λ a = λ var(z) = a^T \lambda a = \lambda var(z)=aTλa=λ. So the lambda is chosen from large to small.
Next,
m
a
x
a
2
a
2
T
S
a
2
,
s
.
t
.
a
2
T
a
2
=
1
,
c
o
v
(
z
(
2
)
,
z
(
1
)
)
=
0
max_{a_2} a_2^T S a_2, s.t. a_2^T a_2 = 1, cov(z^{(2)}, z^{(1)}) = 0
maxa2a2TSa2,s.t.a2Ta2=1,cov(z(2),z(1))=0 if we want another PCA.
c
o
v
(
z
(
2
)
,
z
(
1
)
)
=
a
2
T
S
a
1
=
λ
a
2
T
a
1
cov(z^{(2)}, z^{(1)}) = a_2^T S a_1 = \lambda a_2^T a_1
cov(z(2),z(1))=a2TSa1=λa2Ta1, so(I don’t know)
S
a
2
=
λ
a
2
S a_2 = \lambda a_2
Sa2=λa2, and the
λ
\lambda
λ is the second largest eigenvalue.
Dimension reduction:
χ
∈
R
p
×
n
→
A
T
χ
∈
R
d
×
n
\chi \in \mathbb{R}^{p×n} \rightarrow A^T \chi∈\mathbb{R}^{d×n}
χ∈Rp×n→ATχ∈Rd×n
Original data(Reconstruction):
A
T
χ
∈
R
d
×
n
→
X
ˉ
=
A
(
A
T
X
)
∈
R
p
×
n
A^T \chi \in \mathbb{R}^{d×n} \rightarrow \bar{X} =A(A^TX) \in \mathbb{R}^{p×n}
ATχ∈Rd×n→Xˉ=A(ATX)∈Rp×n
Main theoretical result:
The matrix A consisting of the first d eigenvectors of the covariance matrix S solves the following optimization problem
m
i
n
A
∈
R
p
×
d
∣
∣
χ
−
A
A
T
X
∣
∣
F
2
,
s
.
t
.
A
T
A
=
I
d
min_{A \in \mathbb{R}^{p \times d}} ||\chi - AA^TX||_F^2, s.t. A^TA = I_d
minA∈Rp×d∣∣χ−AATX∣∣F2,s.t.ATA=Id
LDA(Linear Discriminant Analysis)
Find a transformation a, such that the a^TX_1 and a^TX_2 are maximally separated & each class is minimally dispersed (maximum separation).
m
a
x
(
a
(
x
1
ˉ
−
x
2
ˉ
)
)
2
,
m
i
n
v
a
r
(
z
1
)
,
m
i
n
v
a
r
(
z
2
)
max\ (a(\bar{x_1} - \bar{x_2}))^2, min\ var(z_1), min\ var(z_2)
max (a(x1ˉ−x2ˉ))2,min var(z1),min var(z2)
target:
m
a
x
J
=
(
a
(
x
1
ˉ
−
x
2
ˉ
)
)
2
v
a
r
(
z
1
)
+
v
a
r
(
z
2
)
max\ J = \frac{(a(\bar{x_1} - \bar{x_2}))^2}{var(z_1) + var(z_2)}
max J=var(z1)+var(z2)(a(x1ˉ−x2ˉ))2
Suppose there exists two class
w
1
,
w
2
w_1, w_2
w1,w2
z
=
a
T
x
z = a^Tx
z=aTx
μ
i
~
=
1
n
i
∑
z
∈
w
i
z
\tilde{\mu_i} = \frac{1}{n_i} \sum_{z \in w_i} z
μi~=ni1∑z∈wiz
μ
i
=
1
n
i
∑
x
∈
w
i
x
,
μ
i
~
=
a
T
μ
i
\mu_i = \frac{1}{n_i} \sum_{x \in w_i} x, \tilde{\mu_i} = a^T \mu_i
μi=ni1∑x∈wix,μi~=aTμi
∣
μ
1
~
−
μ
2
~
∣
=
∣
a
T
(
μ
1
−
μ
2
)
∣
|\tilde{\mu_1} - \tilde{\mu_2}| = |a^T(\mu_1 - \mu_2)|
∣μ1~−μ2~∣=∣aT(μ1−μ2)∣
s
i
~
2
=
∑
z
∈
w
i
(
z
−
μ
i
~
)
2
\tilde{s_i}^2 = \sum_{z \in w_i} (z - \tilde{\mu_i})^2
si~2=∑z∈wi(z−μi~)2
J
(
a
)
=
(
μ
1
~
−
μ
2
~
)
2
s
1
~
2
+
s
2
~
2
J(a) = \frac{(\tilde{\mu_1} - \tilde{\mu_2})^2}{\tilde{s_1}^2 + \tilde{s_2}^2}
J(a)=s1~2+s2~2(μ1~−μ2~)2
s i ~ 2 = ∑ y ∈ w i ( y − μ i ~ ) 2 = ∑ x ∈ w i ( a T x − a T μ i ) 2 = ∑ x ∈ w i ( a T x − a T μ i ) ( a T x − a T μ i ) T = ∑ x ∈ w i a T ( x − μ i ) ( x − μ i ) T a = a T S i a \tilde{s_i}^2 = \sum_{y \in w_i} (y - \tilde{\mu_i})^2 = \sum_{x \in w_i} (a^Tx - a^T\mu_i)^2 = \sum_{x \in w_i} (a^Tx - a^T\mu_i)(a^Tx - a^T\mu_i)^T = \sum_{x \in w_i} a^T(x - \mu_i)(x - \mu_i)^Ta = a^TS_ia si~2=∑y∈wi(y−μi~)2=∑x∈wi(aTx−aTμi)2=∑x∈wi(aTx−aTμi)(aTx−aTμi)T=∑x∈wiaT(x−μi)(x−μi)Ta=aTSia
within-in class scatter matrix: S W = S 1 + S 2 , s 1 ~ 2 + s 2 ~ 2 = a T S W a S_W = S_1 + S_2, \tilde{s_1}^2 + \tilde{s_2}^2 = a^TS_Wa SW=S1+S2,s1~2+s2~2=aTSWa
( μ 1 ~ − μ 2 ~ ) 2 = ( a T μ 1 − a T μ 2 ) 2 = a T ( μ 1 − μ 2 ) ( μ 1 − μ 2 ) T a = a T S B a (\tilde{\mu_1} - \tilde{\mu_2})^2 = (a^T\mu_1 - a^T\mu_2)^2 = a^T(\mu_1 - \mu_2) (\mu_1 - \mu_2)^Ta = a^TS_Ba (μ1~−μ2~)2=(aTμ1−aTμ2)2=aT(μ1−μ2)(μ1−μ2)Ta=aTSBa
between-class scatter matrix: S B = ( μ 1 − μ 2 ) ( μ 1 − μ 2 ) T S_B = (\mu_1 - \mu_2)(\mu_1 - \mu_2)^T SB=(μ1−μ2)(μ1−μ2)T
J
(
a
)
=
a
T
S
B
a
a
T
S
W
a
J(a) = \frac{a^TS_Ba}{a^TS_Wa}
J(a)=aTSWaaTSBa
S
B
a
=
λ
S
W
a
S_Ba = \lambda S_Wa
SBa=λSWa
S
W
−
1
S
B
a
=
λ
a
S_W^{-1}S_Ba = \lambda a
SW−1SBa=λa
Chapter Four
FP mining
itemset: A set of one or more items
k-itemset
X
=
{
x
1
,
…
,
x
k
}
X = \{x_1, …, x_k\}
X={x1,…,xk}
(absolute) support, or, support count of X: Frequency or occurrence of an itemset
X
X
X;
(relative) support, s, is the fraction of transactions that contains
X
X
X (i.e., the probability that a transaction contains
X
X
X).
An itemset
X
X
X is frequent if
X
X
X’s support is no less than a minsup threshold.
Find all the rules
X
→
Y
X \rightarrow Y
X→Y with minimum support and confidence.
support, s, probability that a transaction contains
X
∪
Y
X \cup Y
X∪Y;
confidence, c, conditional probability that a transaction having
X
X
X also contains
Y
Y
Y.
closed-patterns and max-patterns
An itemset
X
X
X is closed if X is frequent and there exists no super-pattern
Y
⊃
X
Y \supset X
Y⊃X, with the same support as
X
X
X;
An itemset
X
X
X is a max-pattern if
X
X
X is frequent and there exists no frequent super-pattern
Y
⊃
X
Y \supset X
Y⊃X.
So a max-pattern is a closed-pattern.
Apriori
An important property: ** Any subset of a frequent itemset must be frequent**.
Apriori pruning principle: If there is any itemset which is infrequent, its superset should not be generated/tested!
Method:
- Initially, scan DB once to get frequent 1-itemset;
- Generate length (k+1) candidate itemsets from length k frequent itemsets;
- Test the candidates against DB;
- Terminate when no frequent or candidate set can be generated;
Example:
Pseudo-code:
Ck: Candidate itemset of size k
Lk : frequent itemset of size k
L1 = {frequent items};
for (k = 1; Lk !=null; k++) do begin
Ck+1 = candidates generated from Lk;
for each transaction t in database do
increment the count of all candidates in Ck+1 that are contained in t
Lk+1 = candidates in Ck+1 with min_support
end
return union(Lk);
Major computational challenges:
- Multiple scans of transaction database
- Huge number of candidates
- Tedious workload of support counting for candidates
Improving Apriori: general ideas:
- Reduce passes of transaction database scans
- Shrink number of candidates
- Facilitate support counting of candidates
FP-growth
Here, we link an website, it[step 3, pages 28] says that “Recursively mine conditional FP‐trees and grow frequent patterns obtained so far. If the conditional FP‐tree contains a single path, simply enumerate all the patterns”.
Mining sequential patterns
sequential patterns:
GSP
Chapter Five
Decision Tree
Bayes Classification Methods
Support Vector Machines
Decision Tree
It is derivated in the aspect of propability. We can calculate the propability of every output with the given input. If we assume every condition is independent, then P ( X ∣ C ) = ∏ P ( X i ∣ C ) P(X|C) = \prod P(X_i|C) P(X∣C)=∏P(Xi∣C), then l o g P ( X ∣ C ) = ∑ l o g P ( X i ∣ C ) logP(X|C) = \sum logP(X_i|C) logP(X∣C)=∑logP(Xi∣C), so we let the cost Function be l o g log log. To understand better, we can use the concept of thermodynamics, which is called entropy.
H
(
Y
)
=
−
∑
i
=
1
m
p
i
l
o
g
(
p
i
)
H(Y) = - \sum_{i = 1}^m p_i log(p_i)
H(Y)=−∑i=1mpilog(pi) where
p
i
=
P
(
Y
=
y
i
)
p_i = P(Y = y_i)
pi=P(Y=yi)
H
(
Y
∣
X
)
=
−
∑
x
p
(
x
)
H
(
Y
∣
X
=
x
)
H(Y|X) = - \sum_x p(x)H(Y|X = x)
H(Y∣X)=−∑xp(x)H(Y∣X=x)
I
n
f
o
(
D
)
=
−
∑
i
=
1
m
p
i
l
o
g
2
(
p
i
)
Info(D) = -\sum_{i = 1}^m p_i log_2(p_i)
Info(D)=−∑i=1mpilog2(pi)
I
n
f
o
A
(
D
)
=
−
∑
j
=
1
v
∣
D
j
∣
∣
D
∣
×
I
n
f
o
(
D
j
)
Info_A(D) = -\sum_{j = 1}^v \frac{|D_j|}{|D|} \times Info(D_j)
InfoA(D)=−∑j=1v∣D∣∣Dj∣×Info(Dj)
G
a
i
n
(
A
)
=
I
n
f
o
(
D
)
−
I
n
f
o
A
(
D
)
Gain(A) = Info(D) - Info_A(D)
Gain(A)=Info(D)−InfoA(D)
Bayes Classification Methods
First, we know that
P
(
B
)
=
∑
i
=
1
M
P
(
B
∣
A
i
)
P
(
A
i
)
P(B) = \sum_{i = 1}^M P(B|A_i)P(A_i)
P(B)=∑i=1MP(B∣Ai)P(Ai), and
P
(
H
∣
X
)
=
P
(
X
∣
H
)
P
(
H
)
P
(
X
)
P(H|X) = \frac{P(X|H)P(H)}{P(X)}
P(H∣X)=P(X)P(X∣H)P(H)
Assume all condition is independent, then
P
(
X
∣
C
i
)
=
∏
k
=
1
n
P
(
x
k
∣
C
i
)
P(X|C_i) = \prod_{k = 1}^n P(x_k|C_i)
P(X∣Ci)=∏k=1nP(xk∣Ci)
Naïve Bayesian prediction requires each conditional prob. be non-zero. Otherwise, the predicted prob. will be zero.
Use Laplacian correction:
- Adding 1 to each case
- The “corrected” prob. estimates are close to their “uncorrected” counterparts
Support Vector Machines
Model Evaluation and Selection
Confusion Matrix:
Actual class/ Predicted class | C 1 C_1 C1 | ¬ C 1 \neg C_1 ¬C1 |
---|---|---|
C 1 C_1 C1 | True Positive(TP) | False Negative(FN) |
¬ C 1 \neg C_1 ¬C1 | False Positive(FP) | True Negative(TN) |
Accuracy:
T
P
+
T
N
A
L
L
\frac{TP + TN}{ALL}
ALLTP+TN
Error rate:
F
P
+
F
N
A
L
L
\frac{FP + FN}{ALL}
ALLFP+FN
Sensitivity:
T
P
P
\frac{TP}{P}
PTP
Specificity:
T
N
N
\frac{TN}{N}
NTN
Precision:
T
P
T
P
+
F
P
\frac{TP}{TP + FP}
TP+FPTP
Recall:
T
P
T
P
+
F
N
\frac{TP}{TP + FN}
TP+FNTP
F measure:
2
×
P
r
e
c
i
s
i
o
n
×
R
e
c
a
l
l
P
r
e
c
i
s
i
o
n
+
R
e
c
a
l
l
\frac{2 \times Precision \times Recall}{Precision + Recall}
Precision+Recall2×Precision×Recall
F-beta measure:
(
1
+
β
2
)
×
P
r
e
c
i
s
i
o
n
×
R
e
c
a
l
l
β
×
P
r
e
c
i
s
i
o
n
+
R
e
c
a
l
l
\frac{(1 + \beta^2) \times Precision \times Recall}{\beta \times Precision + Recall}
β×Precision+Recall(1+β2)×Precision×Recall
Holdout method
Cross-validation
Bootstrap
Estimating Confidence Intervals
t-test
ROC curves
Chapter Six
K-means
K-medoids
choose the closet point of the K-means center.