1
该决策树不能简化
2
a
G i n i = 1 − ( 0.5 ) 2 − ( 0.5 ) 2 = 0.5 Gini=1-(0.5)^2-(0.5)^2=0.5 Gini=1−(0.5)2−(0.5)2=0.5
b
对于每个顾客ID,结点 G i n i = 0 Gini=0 Gini=0,所以总体: G i n i ( C u s t o m e r I D ) = 0 Gini_{(CustomerID)}=0 Gini(CustomerID)=0
c
性别 | 男 | 女 |
---|---|---|
C0 | 6 | 4 |
C1 | 4 | 6 |
10 | 10 |
G
i
n
i
(
M
a
l
e
)
=
1
−
(
0.6
)
2
−
(
0.4
)
2
=
0.48
Gini_{(Male)}=1-(0.6)^2-(0.4)^2=0.48
Gini(Male)=1−(0.6)2−(0.4)2=0.48
G
i
n
i
(
F
e
m
a
l
e
)
=
1
−
(
0.4
)
2
−
(
0.6
)
2
=
0.48
Gini_{(Female)}=1-(0.4)^2-(0.6)^2=0.48
Gini(Female)=1−(0.4)2−(0.6)2=0.48
所以总体:
G
i
n
i
(
G
e
n
d
e
r
)
=
0.5
×
G
i
n
i
(
M
a
l
e
)
+
0.5
×
G
i
n
i
(
F
e
m
a
l
e
)
=
0.48
Gini_{(Gender)}=0.5\times Gini_{(Male)}+0.5\times Gini_{(Female)}=0.48
Gini(Gender)=0.5×Gini(Male)+0.5×Gini(Female)=0.48
d
车型 | 家用 | 运动 | 豪华 |
---|---|---|---|
C0 | 1 | 8 | 1 |
C1 | 3 | 0 | 7 |
4 | 8 | 8 |
G
i
n
i
(
F
a
m
i
l
y
)
=
1
−
(
1
4
)
2
−
(
3
4
)
2
=
0.375
Gini_{(Family)}=1-(\frac{1}{4})^2-(\frac{3}{4})^2=0.375
Gini(Family)=1−(41)2−(43)2=0.375
G
i
n
i
(
S
p
o
r
t
s
)
=
1
−
(
8
8
)
2
−
(
0
8
)
2
=
0
Gini_{(Sports)}=1-(\frac{8}{8})^2-(\frac{0}{8})^2=0
Gini(Sports)=1−(88)2−(80)2=0
G
i
n
i
(
L
u
x
u
r
y
)
=
1
−
(
1
8
)
2
−
(
7
8
)
2
=
0.21875
Gini_{(Luxury)}=1-(\frac{1}{8})^2-(\frac{7}{8})^2=0.21875
Gini(Luxury)=1−(81)2−(87)2=0.21875
所以总体:
G
i
n
i
(
C
a
r
T
y
p
e
)
=
4
20
×
G
i
n
i
(
F
a
m
i
l
y
)
+
8
20
×
G
i
n
i
(
S
p
o
r
t
s
)
+
8
20
×
G
i
n
i
(
L
u
x
u
r
y
)
=
0.1625
Gini_{(CarType)}=\frac{4}{20}\times Gini_{(Family)}+\frac{8}{20}\times Gini_{(Sports)}+\frac{8}{20}\times Gini_{(Luxury)}=0.1625
Gini(CarType)=204×Gini(Family)+208×Gini(Sports)+208×Gini(Luxury)=0.1625
e
衬衣尺码 | 加大 | 大 | 中 | 小 |
---|---|---|---|---|
C0 | 2 | 2 | 3 | 3 |
C1 | 2 | 2 | 4 | 2 |
4 | 4 | 7 | 5 |
G
i
n
i
(
E
L
)
=
1
−
(
0.5
)
2
−
(
0.5
)
2
=
0.5
Gini_{(EL)}=1-(0.5)^2-(0.5)^2=0.5
Gini(EL)=1−(0.5)2−(0.5)2=0.5
G
i
n
i
(
L
)
=
1
−
(
0.5
)
2
−
(
0.5
)
2
=
0.5
Gini_{(L)}=1-(0.5)^2-(0.5)^2=0.5
Gini(L)=1−(0.5)2−(0.5)2=0.5
G
i
n
i
(
M
)
=
1
−
(
3
7
)
2
−
(
4
7
)
2
=
0.4898
Gini_{(M)}=1-(\frac{3}{7})^2-(\frac{4}{7})^2=0.4898
Gini(M)=1−(73)2−(74)2=0.4898
G
i
n
i
(
S
)
=
1
−
(
2
5
)
2
−
(
3
5
)
2
=
0.48
Gini_{(S)}=1-(\frac{2}{5})^2-(\frac{3}{5})^2=0.48
Gini(S)=1−(52)2−(53)2=0.48
所以总体:
G
i
n
i
(
S
h
i
r
t
S
i
z
e
)
=
4
20
×
G
i
n
i
(
E
L
)
+
4
20
×
G
i
n
i
(
L
)
+
7
20
×
G
i
n
i
(
M
)
+
5
20
×
G
i
n
i
(
S
)
=
0.4914
Gini_{(ShirtSize)}=\frac{4}{20}\times Gini_{(EL)}+\frac{4}{20}\times Gini_{(L)}+\frac{7}{20}\times Gini_{(M)}+\frac{5}{20}\times Gini_{(S)}=0.4914
Gini(ShirtSize)=204×Gini(EL)+204×Gini(L)+207×Gini(M)+205×Gini(S)=0.4914
f
选择车型属性更好,因为它有最小的 G i n i Gini Gini 值
g
因为每一个顾客都有一个新的ID,所以顾客ID属性并没有预测能力。
g
因为每一个顾客都有一个新的ID,所以顾客ID属性并没有预测能力。
3
a
E n t r o p y = − 4 9 × log 2 ( 4 9 ) − 5 9 × log 2 ( 5 9 ) = 0.9911 Entropy = -\frac{4}{9}\times\log_2{(\frac{4}{9})}-\frac{5}{9}\times\log_2{(\frac{5}{9})} = 0.9911 Entropy=−94×log2(94)−95×log2(95)=0.9911
b
-
对于属性 a 1 a_1 a1
a 1 a_1 a1 + - T 3 1 F 1 4 4 5 E n t r o p y ( a 1 ) = 4 9 × [ − 1 4 × log 2 ( 1 4 ) − 3 4 × log 2 ( 3 4 ) ] + 5 9 × [ − 1 5 × log 2 ( 1 5 ) − 4 5 × log 2 ( 4 5 ) ] = 0.7616 Entropy_{(a_1)} = \frac{4}{9}\times[-\frac{1}{4}\times\log_2{(\frac{1}{4})}-\frac{3}{4}\times\log_2{(\frac{3}{4})}]+\frac{5}{9}\times[-\frac{1}{5}\times\log_2{(\frac{1}{5})}-\frac{4}{5}\times\log_2{(\frac{4}{5})}] = 0.7616 Entropy(a1)=94×[−41×log2(41)−43×log2(43)]+95×[−51×log2(51)−54×log2(54)]=0.7616
所以,信息增益为: Δ = 0.9911 − 0.7616 = 0.2294 \Delta=0.9911-0.7616 = 0.2294 Δ=0.9911−0.7616=0.2294
-
对于属性 a 2 a_2 a2
a 2 a_2 a2 + - T 2 3 F 2 2 4 5 E n t r o p y ( a 2 ) = 5 9 × [ − 2 5 × log 2 ( 2 5 ) − 3 5 × log 2 ( 3 5 ) ] + 4 9 × [ − 1 2 × log 2 ( 1 2 ) − 1 2 × log 2 ( 1 2 ) ] = 0.9839 Entropy_{(a_2)} = \frac{5}{9}\times[-\frac{2}{5}\times\log_2{(\frac{2}{5})}-\frac{3}{5}\times\log_2{(\frac{3}{5})}]+\frac{4}{9}\times[-\frac{1}{2}\times\log_2{(\frac{1}{2})}-\frac{1}{2}\times\log_2{(\frac{1}{2})}] = 0.9839 Entropy(a2)=95×[−52×log2(52)−53×log2(53)]+94×[−21×log2(21)−21×log2(21)]=0.9839
所以,信息增益为: Δ = 0.9911 − 0.9839 = 0.0072 \Delta=0.9911-0.9839 = 0.0072 Δ=0.9911−0.9839=0.0072
c
a 3 a_3 a3 | 类 | 划分点 | 熵 | 信息增益 |
---|---|---|---|---|
1.0 | + | 2.0 | 0.8484 | 0.1427 |
3.0 | - | 3.5 | 0.9885 | 0.0026 |
4.0 | + | 4.5 | 0.9183 | 0.0072 |
5.0 5.0 | - - | 5.5 | 0.9839 | 0.0072 |
6.0 | + | 6.5 | 0.9728 | 0.0183 |
7.0 7.0 | + - | 7.5 | 0.8889 | 0.1022 |
d
最佳划分是 a 1 a_1 a1,因为 a 1 a_1 a1 的信息增益更高
e
a
1
a_1
a1的分类错误率:
e
r
r
o
r
(
a
1
)
=
2
9
error_{(a_1)}=\frac{2}{9}
error(a1)=92
a
2
a_2
a2的分类错误率:
e
r
r
o
r
(
a
2
)
=
4
9
error_{(a_2)}=\frac{4}{9}
error(a2)=94
所以 a 1 a_1 a1 是最佳划分,因为 a 1 a_1 a1 的分类错误率更低
f
G
i
n
i
(
a
1
)
=
4
9
×
[
1
−
(
3
4
)
2
−
(
1
4
)
2
]
+
5
9
×
[
1
−
(
4
5
)
2
−
(
1
5
)
2
]
=
0.3444
Gini_{(a_1)} = \frac{4}{9} \times [1-(\frac{3}{4})^2-(\frac{1}{4})^2] + \frac{5}{9} \times [1-(\frac{4}{5})^2-(\frac{1}{5})^2] = 0.3444
Gini(a1)=94×[1−(43)2−(41)2]+95×[1−(54)2−(51)2]=0.3444
G
i
n
i
(
a
2
)
=
5
9
×
[
1
−
(
3
5
)
2
−
(
2
5
)
2
]
+
4
9
×
[
1
−
(
2
4
)
2
−
(
2
4
)
2
]
=
0.4889
Gini_{(a_2)} = \frac{5}{9} \times [1-(\frac{3}{5})^2-(\frac{2}{5})^2] + \frac{4}{9} \times [1-(\frac{2}{4})^2-(\frac{2}{4})^2] = 0.4889
Gini(a2)=95×[1−(53)2−(52)2]+94×[1−(42)2−(42)2]=0.4889
所以 a 1 a_1 a1 是最佳划分,因为 a 1 a_1 a1 的 G i n i Gini Gini 更低
4
a
不是互斥的
b
是完全的
c
需要排序,测试集很可能不仅由行车里程属性决定,并且会命中多条规则。
d
不需要默认类,每条测试记录都能至少命中一条规则。
5
a
-
R1: 4 5 = 0.8 \frac{4}{5}=0.8 54=0.8
-
R2: 30 40 = 0.75 \frac{30}{40}=0.75 4030=0.75
-
R3: 100 190 = 0.526 \frac{100}{190}=0.526 190100=0.526
-
最好规则:R1
-
最坏规则:R3
b
FOIL信息增益: F O I L 信息增益 = p 1 × ( l o g 2 p 1 p 1 + n 1 − l o g 2 p 0 p 0 + n 0 ) FOIL信息增益=p_1\times(log_2\frac{p_1}{p_1+n_1}-log_2\frac{p_0}{p_0+n_0}) FOIL信息增益=p1×(log2p1+n1p1−log2p0+n0p0)
-
R1: 4 × ( l o g 2 4 5 − l o g 2 100 500 ) = 8 4\times(log_2\frac{4}{5}-log_2\frac{100}{500})=8 4×(log254−log2500100)=8
-
R2: 30 × ( l o g 2 30 40 − l o g 2 100 500 ) = 57.207 30\times(log_2\frac{30}{40}-log_2\frac{100}{500})=57.207 30×(log24030−log2500100)=57.207
-
R3: 100 × ( l o g 2 100 190 − l o g 2 100 500 ) = 139.593 100\times(log_2\frac{100}{190}-log_2\frac{100}{500})=139.593 100×(log2190100−log2500100)=139.593
-
最好规则:R3
-
最坏规则:R1
c
似然比统计量: R = 2 ∑ i = 1 k f i l o g ( f i e i ) R=2\sum\limits_{i=1}^{k}f_ilog(\frac{f_i}{e_i}) R=2i=1∑kfilog(eifi)
-
R1:
正类的期望频率 e + = 5 × 100 500 = 1 e_+=5\times\frac{100}{500}=1 e+=5×500100=1
负类的期望频率 e − = 5 × 400 500 = 4 e_-=5\times\frac{400}{500}=4 e−=5×500400=4
R: 2 × ( 4 × l o g 2 4 1 + 1 × l o g 2 1 4 ) = 12 2\times(4\times log_2\frac{4}{1}+1\times log_2\frac{1}{4})=12 2×(4×log214+1×log241)=12 -
R2:
正类的期望频率 e + = 40 × 100 500 = 8 e_+=40\times\frac{100}{500}=8 e+=40×500100=8
负类的期望频率 e − = 40 × 400 500 = 32 e_-=40\times\frac{400}{500}=32 e−=40×500400=32
R: 2 × ( 30 × l o g 2 30 8 + 10 × l o g 2 10 32 ) = 80.852 2\times(30\times log_2\frac{30}{8}+10\times log_2\frac{10}{32})=80.852 2×(30×log2830+10×log23210)=80.852 -
R3:
正类的期望频率 e + = 190 × 100 500 = 38 e_+=190\times\frac{100}{500}=38 e+=190×500100=38
负类的期望频率 e − = 190 × 400 500 = 152 e_-=190\times\frac{400}{500}=152 e−=190×500400=152
R: 2 × ( 100 × l o g 2 100 38 + 90 × l o g 2 90 152 ) = 143.092 2\times(100\times log_2\frac{100}{38}+90\times log_2\frac{90}{152})=143.092 2×(100×log238100+90×log215290)=143.092 -
最好规则:R3
-
最坏规则:R1
d
拉普拉斯度量: L a p l a c e = f + + 1 n + k Laplace=\frac{f_++1}{n+k} Laplace=n+kf++1
-
R1: 4 + 1 5 + 2 = 0.714 \frac{4+1}{5+2}=0.714 5+24+1=0.714
-
R2: 30 + 1 40 + 2 = 0.738 \frac{30+1}{40+2}=0.738 40+230+1=0.738
-
R3: 100 + 1 190 + 2 = 0.526 \frac{100+1}{190+2}=0.526 190+2100+1=0.526
-
最好规则:R2
-
最坏规则:R3
e
m m m度量: m 估计 = f + + k p + n + k m估计=\frac{f_++kp_+}{n+k} m估计=n+kf++kp+
-
R1: 4 + 2 × 0.2 5 + 2 = 0.629 \frac{4+2\times0.2}{5+2}=0.629 5+24+2×0.2=0.629
-
R2: 30 + 2 × 0.2 40 + 2 = 0.724 \frac{30+2\times0.2}{40+2}=0.724 40+230+2×0.2=0.724
-
R3: 100 + 2 × 0.2 190 + 2 = 0.523 \frac{100+2\times0.2}{190+2}=0.523 190+2100+2×0.2=0.523
-
最好规则:R2
-
最坏规则:R3
6
a
P
(
A
=
0
∣
+
)
=
2
5
=
0.4
P( A = 0 | + ) = \frac{2}{5}=0.4
P(A=0∣+)=52=0.4
P
(
A
=
1
∣
+
)
=
3
5
=
0.6
P( A = 1 | + ) = \frac{3}{5}=0.6
P(A=1∣+)=53=0.6
P
(
A
=
0
∣
−
)
=
3
5
=
0.6
P( A = 0 | - ) = \frac{3}{5}=0.6
P(A=0∣−)=53=0.6
P
(
A
=
1
∣
−
)
=
2
5
=
0.4
P( A = 1 | - ) = \frac{2}{5}=0.4
P(A=1∣−)=52=0.4
P
(
B
=
0
∣
+
)
=
4
5
=
0.8
P( B = 0 | + ) = \frac{4}{5}=0.8
P(B=0∣+)=54=0.8
P
(
B
=
1
∣
+
)
=
1
5
=
0.2
P( B = 1 | + ) = \frac{1}{5}=0.2
P(B=1∣+)=51=0.2
P
(
B
=
0
∣
−
)
=
3
5
=
0.6
P( B = 0 | - ) = \frac{3}{5}=0.6
P(B=0∣−)=53=0.6
P
(
B
=
1
∣
−
)
=
2
5
=
0.4
P( B = 1 | - ) = \frac{2}{5}=0.4
P(B=1∣−)=52=0.4
P
(
C
=
0
∣
+
)
=
3
5
=
0.6
P( C = 0 | + ) = \frac{3}{5}=0.6
P(C=0∣+)=53=0.6
P
(
C
=
1
∣
+
)
=
2
5
=
0.4
P( C = 1 | + ) = \frac{2}{5}=0.4
P(C=1∣+)=52=0.4
P
(
C
=
0
∣
−
)
=
0
5
=
0
P( C = 0 | - ) = \frac{0}{5}=0
P(C=0∣−)=50=0
P
(
C
=
1
∣
−
)
=
5
5
=
1
P( C = 1 | - ) = \frac{5}{5}=1
P(C=1∣−)=55=1
b
记 P ( A = 0 , B = 1 , C = 0 ) = P b P(A=0,B=1,C=0)=P_b P(A=0,B=1,C=0)=Pb
P ( + ∣ A = 0 , B = 1 , C = 0 ) = P ( A = 0 , B = 1 , C = 0 ∣ + ) × P ( + ) P ( A = 0 , B = 1 , C = 0 ) = P ( A = 0 ∣ + ) ⋅ P ( B = 1 ∣ + ) ⋅ P ( C = 0 ∣ + ) × P ( + ) P ( A = 0 , B = 1 , C = 0 ) = 0.4 × 0.2 × 0.6 × 0.5 P b = 0.024 P b \begin{aligned}&P(+|A=0,B=1,C=0)\\=&\frac{P(A=0,B=1,C=0|+)\times P(+)}{P(A=0,B=1,C=0)}\\=&\frac{P(A=0|+)\cdot P(B=1|+)\cdot P(C=0|+)\times P(+)}{P(A=0,B=1,C=0)}\\=&\frac{0.4\times0.2\times0.6\times0.5}{P_b}\\=&\frac{0.024}{P_b}\end{aligned} ====P(+∣A=0,B=1,C=0)P(A=0,B=1,C=0)P(A=0,B=1,C=0∣+)×P(+)P(A=0,B=1,C=0)P(A=0∣+)⋅P(B=1∣+)⋅P(C=0∣+)×P(+)Pb0.4×0.2×0.6×0.5Pb0.024
P ( − ∣ A = 0 , B = 1 , C = 0 ) = P ( A = 0 , B = 1 , C = 0 ∣ − ) × P ( − ) P ( A = 0 , B = 1 , C = 0 ) = P ( A = 0 ∣ − ) ⋅ P ( B = 1 ∣ − ) ⋅ P ( C = 0 ∣ − ) × P ( − ) P ( A = 0 , B = 1 , C = 0 ) = 0 P b \begin{aligned}&P(-|A=0,B=1,C=0)\\=&\frac{P(A=0,B=1,C=0|-)\times P(-)}{P(A=0,B=1,C=0)}\\=&\frac{P(A=0|-)\cdot P(B=1|-)\cdot P(C=0|-)\times P(-)}{P(A=0,B=1,C=0)}\\=&\frac{0}{P_b}\end{aligned} ===P(−∣A=0,B=1,C=0)P(A=0,B=1,C=0)P(A=0,B=1,C=0∣−)×P(−)P(A=0,B=1,C=0)P(A=0∣−)⋅P(B=1∣−)⋅P(C=0∣−)×P(−)Pb0
所以,测试样本类标号预测为 +
c
使用
m
m
m度量(
p
=
1
2
p=\frac{1}{2}
p=21且
m
=
4
m=4
m=4)
m
估计
=
f
+
1
2
×
4
5
+
4
=
f
+
2
9
m估计=\frac{f_+\frac{1}{2}\times4}{5+4}=\frac{f_+2}{9}
m估计=5+4f+21×4=9f+2
P
(
A
=
0
∣
+
)
=
4
9
P( A = 0 | + ) = \frac{4}{9}
P(A=0∣+)=94
P
(
A
=
1
∣
+
)
=
5
9
P( A = 1 | + ) = \frac{5}{9}
P(A=1∣+)=95
P
(
A
=
0
∣
−
)
=
5
9
P( A = 0 | - ) = \frac{5}{9}
P(A=0∣−)=95
P
(
A
=
1
∣
−
)
=
4
9
P( A = 1 | - ) = \frac{4}{9}
P(A=1∣−)=94
P
(
B
=
0
∣
+
)
=
6
9
P( B = 0 | + ) = \frac{6}{9}
P(B=0∣+)=96
P
(
B
=
1
∣
+
)
=
3
9
P( B = 1 | + ) = \frac{3}{9}
P(B=1∣+)=93
P
(
B
=
0
∣
−
)
=
5
9
P( B = 0 | - ) = \frac{5}{9}
P(B=0∣−)=95
P
(
B
=
1
∣
−
)
=
4
9
P( B = 1 | - ) = \frac{4}{9}
P(B=1∣−)=94
P
(
C
=
0
∣
+
)
=
5
9
P( C = 0 | + ) = \frac{5}{9}
P(C=0∣+)=95
P
(
C
=
1
∣
+
)
=
4
9
P( C = 1 | + ) = \frac{4}{9}
P(C=1∣+)=94
P
(
C
=
0
∣
−
)
=
2
9
P( C = 0 | - ) = \frac{2}{9}
P(C=0∣−)=92
P
(
C
=
1
∣
−
)
=
7
9
P( C = 1 | - ) = \frac{7}{9}
P(C=1∣−)=97
d
计算方法同第二问
P ( + ∣ A = 0 , B = 1 , C = 0 ) = 4 9 × 3 9 × 5 9 × 0.5 P b = 0.0412 P b \begin{aligned}&P(+|A=0,B=1,C=0)\\=&\frac{\frac{4}{9}\times\frac{3}{9}\times\frac{5}{9}\times0.5}{P_b}\\=&\frac{0.0412}{P_b}\end{aligned} ==P(+∣A=0,B=1,C=0)Pb94×93×95×0.5Pb0.0412
P ( − ∣ A = 0 , B = 1 , C = 0 ) = 5 9 × 4 9 × 2 9 × 0.5 P b = 0.0274 P b \begin{aligned}&P(-|A=0,B=1,C=0)\\=&\frac{\frac{5}{9}\times\frac{4}{9}\times\frac{2}{9}\times0.5}{P_b}\\=&\frac{0.0274}{P_b}\end{aligned} ==P(−∣A=0,B=1,C=0)Pb95×94×92×0.5Pb0.0274
所以,测试样本类标号预测为 +
e
使用 m估计方法 更好
因为:应该尽量避免有条件概率为0的情况
7
a
P
(
A
=
1
∣
+
)
=
3
5
=
0.6
P(A=1|+)=\frac{3}{5}=0.6
P(A=1∣+)=53=0.6
P
(
B
=
1
∣
+
)
=
2
5
=
0.4
P(B=1|+)=\frac{2}{5}=0.4
P(B=1∣+)=52=0.4
P
(
C
=
1
∣
+
)
=
4
5
=
0.8
P(C=1|+)=\frac{4}{5}=0.8
P(C=1∣+)=54=0.8
P
(
A
=
1
∣
−
)
=
2
5
=
0.4
P(A=1|-)=\frac{2}{5}=0.4
P(A=1∣−)=52=0.4
P
(
B
=
1
∣
−
)
=
2
5
=
0.4
P(B=1|-)=\frac{2}{5}=0.4
P(B=1∣−)=52=0.4
P
(
C
=
1
∣
−
)
=
1
5
=
0.2
P(C=1|-)=\frac{1}{5}=0.2
P(C=1∣−)=51=0.2
b
记 P ( A = 1 , B = 1 , C = 1 ) = P b P(A=1,B=1,C=1)=P_b P(A=1,B=1,C=1)=Pb
P ( + ∣ A = 1 , B = 1 , C = 1 ) = 0.6 × 0.4 × 0.8 × 0.5 P b = 0.096 P b \begin{aligned}&P(+|A=1,B=1,C=1)\\=&\frac{0.6\times0.4\times0.8\times0.5}{P_b}\\=&\frac{0.096}{P_b}\end{aligned} ==P(+∣A=1,B=1,C=1)Pb0.6×0.4×0.8×0.5Pb0.096
P ( − ∣ A = 1 , B = 1 , C = 1 ) = 0.4 × 0.4 × 0.2 × 0.5 P b = 0.016 K \begin{aligned}&P(-|A=1,B=1,C=1)\\=&\frac{0.4\times0.4\times0.2\times0.5}{P_b}\\=&\frac{0.016}{K}\end{aligned} ==P(−∣A=1,B=1,C=1)Pb0.4×0.4×0.2×0.5K0.016
所以,测试样本类标号预测为 +
c
P
(
A
=
1
)
=
1
2
P ( A = 1 ) = \frac{1}{2}
P(A=1)=21
P
(
B
=
1
)
=
2
5
P ( B = 1 ) = \frac{2}{5}
P(B=1)=52
P
(
A
=
1
,
B
=
1
)
=
1
5
P ( A = 1 , B = 1 ) = \frac{1}{5}
P(A=1,B=1)=51
有: P ( A = 1 ) × P ( B = 1 ) = P ( A = 1 , B = 1 ) P ( A = 1 ) \times P ( B = 1 ) = P ( A = 1 , B = 1 ) P(A=1)×P(B=1)=P(A=1,B=1)
所以,A与B相互独立
d
P
(
A
=
1
)
=
1
2
P ( A = 1 ) = \frac{1}{2}
P(A=1)=21
P
(
B
=
0
)
=
3
5
P ( B = 0 ) = \frac{3}{5}
P(B=0)=53
P
(
A
=
1
,
B
=
0
)
=
3
10
P ( A = 1 , B = 0 ) = \frac{3}{10}
P(A=1,B=0)=103
有: P ( A = 1 ) × P ( B = 0 ) = P ( A = 1 , B = 1 ) P ( A = 1 ) \times P ( B = 0 ) = P ( A = 1 , B = 1 ) P(A=1)×P(B=0)=P(A=1,B=1)
所以,A与B相互独立
e
P
(
A
=
1
∣
+
)
=
3
5
P ( A = 1 |+) = \frac{3}{5}
P(A=1∣+)=53
P
(
B
=
1
∣
+
)
=
2
5
P ( B = 1 |+) = \frac{2}{5}
P(B=1∣+)=52
P
(
A
=
1
,
B
=
1
∣
+
)
=
1
5
P ( A = 1 , B = 1 |+) = \frac{1}{5}
P(A=1,B=1∣+)=51
而 P ( A = 1 ∣ + ) × P ( B = 1 ∣ + ) ≠ P ( A = 1 , B = 1 ∣ + ) P ( A = 1 |+) \times P ( B = 1 |+) \neq P ( A = 1 , B = 1 |+) P(A=1∣+)×P(B=1∣+)=P(A=1,B=1∣+)
所以,给定类+,A与B不独立
8
a
朴素贝叶斯分类器在这个数据集上表现不好,因为对于类A和B来说每个区分属性的条件概率都相同
b
会,因为四个子类的条件概率不同
c
在两类问题上,决策树表现不好,因为用区分属性划分后熵没有增加,而对于四个类,表现会相对提升
9
a
-
行车里程
P ( 行车里程 = 高 ) = 10 20 = 0.5 P ( 行车里程 = 高 ) = \frac{10}{20} = 0.5 P(行车里程=高)=2010=0.5
P ( 行车里程 = 低 ) = 10 20 = 0.5 P ( 行车里程 = 低 ) = \frac{10}{20} = 0.5 P(行车里程=低)=2010=0.5 -
空调
P ( 空调 = 可用 ) = 25 40 = 0.625 P ( 空调 = 可用 ) = \frac{25}{40} = 0.625 P(空调=可用)=4025=0.625
P ( 空调 = 不可用 ) = 15 40 = 0.375 P ( 空调 = 不可用 ) = \frac{15}{40} = 0.375 P(空调=不可用)=4015=0.375 -
引擎
P ( 引擎 = 好 ∣ 行车里程 = 高 ) = 10 20 = 0.5 P ( 引擎 = 好 | 行车里程 = 高 ) = \frac{10}{20} = 0.5 P(引擎=好∣行车里程=高)=2010=0.5
P ( 引擎 = 差 ∣ 行车里程 = 高 ) = 10 20 = 0.5 P ( 引擎 = 差 | 行车里程 = 高 ) = \frac{10}{20} = 0.5 P(引擎=差∣行车里程=高)=2010=0.5
P ( 引擎 = 好 ∣ 行车里程 = 低 ) = 15 20 = 0.75 P ( 引擎 = 好 | 行车里程 = 低 ) = \frac{15}{20} = 0.75 P(引擎=好∣行车里程=低)=2015=0.75
P ( 引擎 = 差 ∣ 行车里程 = 低 ) = 5 20 = 0.25 P ( 引擎 = 差 | 行车里程 = 低 ) = \frac{5}{20} = 0.25 P(引擎=差∣行车里程=低)=205=0.25 -
车的价值
P ( 车的价值 = 高 ∣ 引擎 = 好,空调 = 可用 ) = 12 16 = 0.75 P ( 车的价值 = 高 | 引擎 = 好 ,空调 = 可用 ) = \frac{12}{16} = 0.75 P(车的价值=高∣引擎=好,空调=可用)=1612=0.75
P ( 车的价值 = 低 ∣ 引擎 = 好,空调 = 可用 ) = 4 16 = 0.25 P ( 车的价值 = 低 | 引擎 = 好 ,空调 = 可用 ) = \frac{4}{16} = 0.25 P(车的价值=低∣引擎=好,空调=可用)=164=0.25
P ( 车的价值 = 高 ∣ 引擎 = 好,空调 = 不可用 ) = 6 9 = 0.667 P ( 车的价值 = 高 | 引擎 = 好 ,空调 = 不可用 ) = \frac{6}{9} = 0.667 P(车的价值=高∣引擎=好,空调=不可用)=96=0.667
P ( 车的价值 = 低 ∣ 引擎 = 好,空调 = 不可用 ) = 3 9 = 0.333 P ( 车的价值 = 低 | 引擎 = 好 ,空调 = 不可用 ) = \frac{3}{9} = 0.333 P(车的价值=低∣引擎=好,空调=不可用)=93=0.333
P ( 车的价值 = 高 ∣ 引擎 = 差,空调 = 可用 ) = 2 9 = 0.222 P ( 车的价值 = 高 | 引擎 = 差 ,空调 = 可用 ) = \frac{2}{9} = 0.222 P(车的价值=高∣引擎=差,空调=可用)=92=0.222
P ( 车的价值 = 低 ∣ 引擎 = 差,空调 = 可用 ) = 7 9 = 0.778 P ( 车的价值 = 低 | 引擎 = 差 ,空调 = 可用 ) = \frac{7}{9} = 0.778 P(车的价值=低∣引擎=差,空调=可用)=97=0.778
P ( 车的价值 = 高 ∣ 引擎 = 差,空调 = 不可用 ) = 0 P ( 车的价值 = 高 | 引擎 = 差 ,空调 = 不可用 ) = 0 P(车的价值=高∣引擎=差,空调=不可用)=0
P ( 车的价值 = 低 ∣ 引擎 = 差,空调 = 不可用 ) = 1 P ( 车的价值 = 低 | 引擎 = 差 ,空调 = 不可用 ) = 1 P(车的价值=低∣引擎=差,空调=不可用)=1
b
P ( 引擎 = 差,空调 = 不可用 ) = P ( 引擎 = 差,空调 = 不可用,行车里程 = 高,车的价值 = 高 ) + P ( 引擎 = 差,空调 = 不可用,行车里程 = 高,车的价值 = 低 ) + P ( 引擎 = 差,空调 = 不可用,行车里程 = 低,车的价值 = 高 ) + P ( 引擎 = 差,空调 = 不可用,行车里程 = 低,车的价值 = 低 ) = P ( 车的价值 = 高 ∣ 引擎 = 差,空调 = 不可用 ) × P ( 引擎 = 差 ∣ 行车里程 = 高 ) × P ( 行车里程 = 高 ) × P ( 空调 = 不可用 ) + P ( 车的价值 = 低 ∣ 引擎 = 差,空调 = 不可用 ) × P ( 引擎 = 差 ∣ 行车里程 = 高 ) × P ( 行车里程 = 高 ) × P ( 空调 = 不可用 ) + P ( 车的价值 = 高 ∣ 引擎 = 差,空调 = 不可用 ) × P ( 引擎 = 差 ∣ 行车里程 = 低 ) × P ( 行车里程 = 低 ) × P ( 空调 = 不可用 ) + P ( 车的价值 = 低 ∣ 引擎 = 差,空调 = 不可用 ) × P ( 引擎 = 差 ∣ 行车里程 = 低 ) × P ( 行车里程 = 低 ) × P ( 空调 = 不可用 ) = 0.1453 \begin{aligned}&P ( 引擎 = 差,空调 = 不可用 ) \\=&P ( 引擎 = 差,空调 = 不可用,行车里程 = 高,车的价值 = 高) \\&+ P ( 引擎 = 差,空调 = 不可用,行车里程 = 高,车的价值 = 低) \\&+ P ( 引擎 = 差,空调 = 不可用,行车里程 = 低,车的价值 = 高) \\&+ P ( 引擎 = 差,空调 = 不可用,行车里程 = 低,车的价值 = 低) \\=&P ( 车的价值 = 高 | 引擎 = 差 ,空调 = 不可用 ) \\&× P ( 引擎 = 差 | 行车里程 = 高 ) × P ( 行车里程 = 高 ) × P ( 空调 = 不可用 ) \\&+ P ( 车的价值 = 低 | 引擎 = 差 ,空调 = 不可用 ) \\&× P ( 引擎 = 差 | 行车里程 = 高 ) × P ( 行车里程 = 高 ) × P ( 空调 = 不可用 ) \\&+ P ( 车的价值 = 高 | 引擎 = 差 ,空调 = 不可用 ) \\&× P ( 引擎 = 差 | 行车里程 = 低 ) × P ( 行车里程 = 低 ) × P ( 空调 = 不可用 ) \\&+ P ( 车的价值 = 低 | 引擎 = 差 ,空调 = 不可用 ) \\&× P ( 引擎 = 差 | 行车里程 = 低 ) × P ( 行车里程 = 低 ) × P ( 空调 = 不可用 ) \\=&0.1453\end{aligned} ===P(引擎=差,空调=不可用)P(引擎=差,空调=不可用,行车里程=高,车的价值=高)+P(引擎=差,空调=不可用,行车里程=高,车的价值=低)+P(引擎=差,空调=不可用,行车里程=低,车的价值=高)+P(引擎=差,空调=不可用,行车里程=低,车的价值=低)P(车的价值=高∣引擎=差,空调=不可用)×P(引擎=差∣行车里程=高)×P(行车里程=高)×P(空调=不可用)+P(车的价值=低∣引擎=差,空调=不可用)×P(引擎=差∣行车里程=高)×P(行车里程=高)×P(空调=不可用)+P(车的价值=高∣引擎=差,空调=不可用)×P(引擎=差∣行车里程=低)×P(行车里程=低)×P(空调=不可用)+P(车的价值=低∣引擎=差,空调=不可用)×P(引擎=差∣行车里程=低)×P(行车里程=低)×P(空调=不可用)0.1453
10
a
P ( B = 好 , F = 空 , G = 空 , S = 是 ) = P ( B = 好 ) ⋅ P ( F = 空 ) ⋅ P ( G = 空 ∣ B = 好 , F = 空 ) ⋅ P ( S = 是 ∣ B = 好 , F = 空 ) = 0.9 × 0.2 × 0.8 × 0.2 = 0.0288 \begin{aligned}&P( B=好,F=空,G = 空,S=是) \\= &P( B = 好 ) \cdot P( F = 空 )\cdot P( G = 空 | B = 好,F = 空 )\cdot P ( S = 是 | B = 好,F = 空 ) \\= &0.9 × 0.2 × 0.8 × 0.2 \\= & 0.0288\end{aligned} ===P(B=好,F=空,G=空,S=是)P(B=好)⋅P(F=空)⋅P(G=空∣B=好,F=空)⋅P(S=是∣B=好,F=空)0.9×0.2×0.8×0.20.0288
b
P ( B = 差 , F = 空 , G = 非空 , S = 否 ) = P ( B = 差 ) ⋅ P ( F = 空 ) ⋅ P ( G = 非空 ∣ B = 差 , F = 空 ) ⋅ P ( S = 否 ∣ B = 差 , F = 空 ) = 0.1 × 0.2 × 0.1 × 1.0 = 0.002 \begin{aligned}&P( B = 差,F = 空,G = 非空,S = 否 ) \\= & P( B = 差 ) \cdot P( F = 空 )\cdot P( G = 非空 | B = 差,F = 空 )\cdot P ( S = 否 | B = 差,F = 空 ) \\= &0.1 × 0.2 × 0.1 × 1.0 \\= & 0.002\end{aligned} ===P(B=差,F=空,G=非空,S=否)P(B=差)⋅P(F=空)⋅P(G=非空∣B=差,F=空)⋅P(S=否∣B=差,F=空)0.1×0.2×0.1×1.00.002
c
P ( S = 是 ∣ B = 差 ) = ∑ α P ( S = 是 ∣ B = 差 , F = α ) ⋅ P ( B = 差 ) ⋅ P ( F = α ) = 0 + 0.1 × 0.1 × 0.8 = 0.008 \begin{aligned}&P( S = 是 | B = 差 ) \\= &\sum\limits_{\alpha}P(S = 是 | B = 差,F = \alpha)\cdot P( B = 差 )\cdot P( F = \alpha )\\= & 0 + 0.1 × 0.1 × 0.8 \\= & 0.008\end{aligned} ===P(S=是∣B=差)α∑P(S=是∣B=差,F=α)⋅P(B=差)⋅P(F=α)0+0.1×0.1×0.80.008
11
Boole函数 | 是否线性可分 |
---|---|
a | 是 |
b | 是 |
c | 是 |
d | 否 |
12
a
import matplotlib.pyplot as plt
import numpy as np
# 实例数据
true_labels = ['+', '+', '-', '-', '+', '+', '-', '-', '+', '-']
p_m1 = [0.73, 0.69, 0.44, 0.55, 0.67, 0.47, 0.08, 0.15, 0.45, 0.35]
p_m2 = [0.61, 0.03, 0.68, 0.31, 0.45, 0.09, 0.38, 0.05, 0.01, 0.04]
# 计算真正率(True Positive Rate)和假正率(False Positive Rate)
def calculate_roc(true_labels, probabilities):
sorted_indices = np.argsort(probabilities)[::-1] # 根据概率降序排序
sorted_labels = [true_labels[i] for i in sorted_indices]
tpr = [0] # 真正率
fpr = [0] # 假正率
tp = 0 # 真正例数量
fp = 0 # 假正例数量
pn = sorted_labels.count('-') # 负例数量
pp = sorted_labels.count('+') # 正例数量
for label in sorted_labels:
if label == '+':
tp += 1
else:
fp += 1
tpr.append(tp / pp)
fpr.append(fp / pn)
return fpr, tpr
# 绘制ROC曲线
def plot_roc_curve(fpr, tpr, model_name):
plt.plot(fpr, tpr, label=model_name)
# 计算M1的ROC曲线
fpr_m1, tpr_m1 = calculate_roc(true_labels, p_m1)
# 计算M2的ROC曲线
fpr_m2, tpr_m2 = calculate_roc(true_labels, p_m2)
# 绘制ROC曲线
plot_roc_curve(fpr_m1, tpr_m1, 'M1')
plot_roc_curve(fpr_m2, tpr_m2, 'M2')
# 绘制对角线
plt.plot([0, 1], [0, 1], 'k--')
# 设置图表标签和标题
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
# 设置图例位置
plt.legend(loc='lower right')
# 显示图表
plt.show()
ROC曲线图:
模型 M 1 M_1 M1更好,因为 M 1 M_1 M1曲线覆盖的图像下面积相比 M 2 M_2 M2更大
b
对于 M 1 M_1 M1, t = 0.5 t=0.5 t=0.5时:
实例 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
---|---|---|---|---|---|---|---|---|---|---|
类 | TP | TP | TN | FP | TP | FN | TN | TN | FN | TN |
+ | - | |
---|---|---|
+ | 3 | 2 |
- | 1 | 4 |
P r e c i s i o n = T P T P + F P = 3 3 + 1 = 0.75 Precision = \frac{TP}{TP + FP}= \frac{3}{3+1}=0.75 Precision=TP+FPTP=3+13=0.75
R e c a l l = T P T P + N F = 3 3 + 2 = 0.6 Recall = \frac{TP}{TP + NF} = \frac{3}{3+2}=0.6 Recall=TP+NFTP=3+23=0.6
F − m e a s u r e = 2 ⋅ P ⋅ R P + R = 2 × 0.75 × 0.6 0.75 + 0.6 = 0.667 F-measure = \frac{2 \cdot P \cdot R}{P + R} = \frac{2\times0.75\times0.6}{0.75+0.6}=0.667 F−measure=P+R2⋅P⋅R=0.75+0.62×0.75×0.6=0.667
c
对于 M 2 M_2 M2, t = 0.5 t=0.5 t=0.5时:
实例 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
---|---|---|---|---|---|---|---|---|---|---|
类 | TP | FN | FP | TN | FN | FN | TN | TN | FN | TN |
+ | - | |
---|---|---|
+ | 1 | 4 |
- | 1 | 4 |
P r e c i s i o n = T P T P + F P = 1 1 + 1 = 0.5 Precision = \frac{TP}{TP + FP}= \frac{1}{1+1}=0.5 Precision=TP+FPTP=1+11=0.5
R e c a l l = T P T P + N F = 1 1 + 4 = 0.2 Recall = \frac{TP}{TP + NF} = \frac{1}{1+4}=0.2 Recall=TP+NFTP=1+41=0.2
F − m e a s u r e = 2 ⋅ P ⋅ R P + R = 2 × 0.5 × 0.2 0.5 + 0.2 = 0.286 F-measure = \frac{2 \cdot P \cdot R}{P + R} = \frac{2\times0.5\times0.2}{0.5+0.2}=0.286 F−measure=P+R2⋅P⋅R=0.5+0.22×0.5×0.2=0.286
模型
M
1
M_1
M1的
F
F
F度量值相比
M
2
M_2
M2更大,说明模型
M
1
M_1
M1性能更好
结果和从ROC曲线中得到的结论一致
d
对于 M 1 M_1 M1, t = 0.1 t=0.1 t=0.1时:
实例 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
---|---|---|---|---|---|---|---|---|---|---|
类 | TP | TP | FP | FP | TP | TP | TN | FP | TP | FP |
+ | - | |
---|---|---|
+ | 5 | 0 |
- | 4 | 1 |
P r e c i s i o n = T P T P + F P = 5 5 + 4 = 0.556 Precision = \frac{TP}{TP + FP}= \frac{5}{5+4}=0.556 Precision=TP+FPTP=5+45=0.556
R e c a l l = T P T P + N F = 5 5 + 0 = 1 Recall = \frac{TP}{TP + NF} = \frac{5}{5+0}=1 Recall=TP+NFTP=5+05=1
F − m e a s u r e = 2 ⋅ P ⋅ R P + R = 2 × 0.556 × 1 0.556 + 1 = 0.714 F-measure = \frac{2 \cdot P \cdot R}{P + R} = \frac{2\times0.556\times1}{0.556+1}=0.714 F−measure=P+R2⋅P⋅R=0.556+12×0.556×1=0.714
t
=
0.1
t=0.1
t=0.1阈值更好
结果与从ROC曲线中得到的结论不一致
13
决策边界为: f ( x 1 , x 2 ) = x 1 x 2 f ( x_1 , x_2 ) = x_1x_2 f(x1,x2)=x1x2