notes of course “Introduction to Data Science” from RWTH-Aachen in semester winter 19/20, Professor van der Aalst, Willibrordus
文章目录
Event data
Every row is an event. It should include:
- case id: the “thing” used to group events. eg. student id
- activity name: description of the event, also can be an item set. eg. course name
- timestamp
- other attributes: resource, lifecycle (start, complete, etc.), costs, role, etc.
Mining sequential patterns
Input: Multiset of sequences of itemsets
- 𝓘 is the set of all items
- An itemset is a nonempty set of items, e.g., i = { i 1 , i 2 , … , i m } ⊆ ( J ) with m ≥ 1 \boldsymbol{i}=\left\{\boldsymbol{i}_{1}, \boldsymbol{i}_{2}, \ldots, \boldsymbol{i}_{m}\right\} \subseteq (\mathcal{J}) \text { with } \boldsymbol{m} \geq 1 i={i1,i2,…,im}⊆(J) with m≥1
- A sequence is a nonempty sequence of itemsets, e.g., s = ⟨ s 1 , s 2 , … , s n ⟩ ∈ ( P ( J ) ) ∗ with n ≥ 1 \boldsymbol{s}=\left\langle\boldsymbol{s}_{1}, \boldsymbol{s}_{2}, \ldots, \boldsymbol{s}_{n}\right\rangle \in(\mathcal{P}(\boldsymbol{\mathcal {J}}))^{*} \text { with } \boldsymbol{n} \geq \mathbf{1} s=⟨s1,s2,…,sn⟩∈(P(J))∗ with n≥1
- A dataset 𝑫 is a multiset of sequences. D ∈ B ( ( P ( ȷ ) ) ∗ ) \boldsymbol{D} \in \mathcal{B}\left((\mathcal{P}(\boldsymbol{\jmath}))^{*}\right) D∈B((P(ȷ))∗)
(𝕭 is the multiset operator, * is the sequence operator, and 𝓟 is the powerset operator
Example1
(The customer buy a, then buy b, then buy c and d, then buy e)
- Informal: [ a b ( c d ) e , a b ( c d ) e , a ( c d ) e , a ( b c ) ( c d e ) f ] [a b(c d) e, a b(c d) e, a(c d) e, a(b c)(c d e) f] [ab(cd)e,ab(cd)e,a(cd)e,a(bc)(cde)f]
- Formal: [ ⟨ { a } , { b } , { c , d } , { e } ⟩ , ⟨ { a } , { b } , { c , d } , { e } ⟩ ⟨ { a } , { c , d } , { e } ⟩ , ⟨ { a } , { b , c } , { c , d , e } , { f } ⟩ ] \begin{array}{l}{[\langle\{\boldsymbol{a}\},\{\boldsymbol{b}\},\{\boldsymbol{c}, \boldsymbol{d}\},\{\boldsymbol{e}\}\rangle,\langle\{\boldsymbol{a}\},\{\boldsymbol{b}\},\{\boldsymbol{c}, \boldsymbol{d}\},\{\boldsymbol{e}\}\rangle} \\ {\langle\{\boldsymbol{a}\},\{\boldsymbol{c}, \boldsymbol{d}\},\{\boldsymbol{e}\}\rangle,\langle\{\boldsymbol{a}\},\{\boldsymbol{b}, \boldsymbol{c}\},\{\boldsymbol{c}, \boldsymbol{d}, \boldsymbol{e}\},\{\boldsymbol{f}\}\rangle]}\end{array} [⟨{a},{b},{c,d},{e}⟩,⟨{a},{b},{c,d},{e}⟩⟨{a},{c,d},{e}⟩,⟨{a},{b,c},{c,d,e},{f}⟩]
Goal: Find frequent sequential patterns
- Sequential patterns are in the form of p ∈ ( P ( J ) ) ∗ p \in (\mathcal{P}(\mathcal{J}))^{*} p∈(P(J))∗.
- The support of a sequential pattern 𝒑 is the fraction (or absolute number) of sequences in dataset 𝑫 that supports pattern 𝒑 (i.e., is contained).
Containment
Let
a
=
⟨
a
1
,
a
2
,
…
,
a
n
⟩
∈
(
P
(
J
)
)
∗
a=\left\langle a_{1}, a_{2}, \ldots, a_{n}\right\rangle \in(\mathcal{P}(\mathcal{J}))^{*}
a=⟨a1,a2,…,an⟩∈(P(J))∗ and
b
=
⟨
b
1
,
b
2
,
…
,
b
m
⟩
∈
(
P
(
J
)
)
∗
b=\left\langle b_{1}, b_{2}, \ldots, b_{m}\right\rangle \in(\mathcal{P}(\mathcal{J}))^{*}
b=⟨b1,b2,…,bm⟩∈(P(J))∗ be two itemset sequences.
a
a
a is contained in
b
b
b if there exist integers
i
1
<
i
2
<
.
.
.
<
i
n
i_1 < i_2 < ... < i_n
i1<i2<...<in such that
a
1
⊆
b
i
1
,
a
2
⊆
b
i
2
,
…
,
a
n
⊆
b
i
n
\boldsymbol{a}_{1} \subseteq \boldsymbol{b}_{i_{1}}, \boldsymbol{a}_{2} \subseteq \boldsymbol{b}_{i_{2}}, \ldots, \boldsymbol{a}_{n} \subseteq \boldsymbol{b}_{i_{n}}
a1⊆bi1,a2⊆bi2,…,an⊆bin.
Examples
formal notation:
⟨
{
a
}
,
{
a
,
b
}
,
{
b
,
c
}
,
{
c
}
⟩
⊑
⟨
{
a
}
,
{
a
}
,
{
a
,
b
,
c
}
,
{
b
,
c
}
,
{
b
,
c
}
,
{
a
,
c
}
⟩
⟨
{
a
}
,
{
a
,
b
}
,
{
b
,
c
}
,
{
c
}
⟩
⫋
⟨
{
a
}
,
{
a
,
b
,
c
}
,
{
b
,
d
}
,
{
b
,
e
}
,
{
a
,
c
}
⟩
\begin{array}{l}{\langle\{\boldsymbol{a}\},\{\boldsymbol{a}, \boldsymbol{b}\},\{\boldsymbol{b}, \boldsymbol{c}\},\{\boldsymbol{c}\}\rangle \sqsubseteq\langle\{\boldsymbol{a}\},\{\boldsymbol{a}\},\{\boldsymbol{a}, \boldsymbol{b}, \boldsymbol{c}\},\{\boldsymbol{b}, \boldsymbol{c}\},\{\boldsymbol{b}, \boldsymbol{c}\},\{\boldsymbol{a}, \boldsymbol{c}\}\rangle} \\ {\langle\{\boldsymbol{a}\},\{\boldsymbol{a}, \boldsymbol{b}\},\{\boldsymbol{b}, \boldsymbol{c}\},\{\boldsymbol{c}\}\rangle \varsubsetneqq\langle\{\boldsymbol{a}\},\{\boldsymbol{a}, \boldsymbol{b}, \boldsymbol{c}\},\{\boldsymbol{b}, \boldsymbol{d}\},\{\boldsymbol{b}, \boldsymbol{e}\},\{\boldsymbol{a}, \boldsymbol{c}\}\rangle}\end{array}
⟨{a},{a,b},{b,c},{c}⟩⊑⟨{a},{a},{a,b,c},{b,c},{b,c},{a,c}⟩⟨{a},{a,b},{b,c},{c}⟩⟨{a},{a,b,c},{b,d},{b,e},{a,c}⟩
informal notation:
a
(
a
b
)
(
b
c
)
c
=
a
a
(
a
b
c
)
(
b
c
)
(
b
c
)
(
a
c
)
a
(
a
b
)
(
b
c
)
c
≠
a
(
a
b
c
)
(
b
d
)
(
b
e
)
(
a
c
)
\begin{array}{l}{a(a b)(b c) c=a a(a b c)(b c)(b c)(a c)} \\ {a(a b)(b c) c \neq a(a b c)(b d)(b e)(a c)}\end{array}
a(ab)(bc)c=aa(abc)(bc)(bc)(ac)a(ab)(bc)c=a(abc)(bd)(be)(ac)
Support
Relative
- s u p p o r t ( p ) support (p) support(p) = ∣ [ s ∈ D ∣ p ⊑ s ] ∣ / ∣ D ∣ (relative) =|[s \in \boldsymbol{D} | \boldsymbol{p} \sqsubseteq s]| /|\boldsymbol{D}| \text { (relative) } =∣[s∈D∣p⊑s]∣/∣D∣ (relative)
- Minimum support threshold m i n _ s u p min\_sup min_sup : lower bound for s u p p o r t ( p ) support(p) support(p)
Absolute
- s u p p o r t _ c o u n t ( p ) = ∥ [ s ∈ D ∣ p ⊑ s ] ∣ support\_count(p)=\|[s \in \boldsymbol{D} | \boldsymbol{p} \sqsubseteq \boldsymbol{s}] | support_count(p)=∥[s∈D∣p⊑s]∣ (absolute, also called frequency or count)
- Minimum support count threshold: lower bound for support_count§.
Example
D
=
[
a
b
c
d
,
(
a
b
c
d
)
,
(
a
b
)
(
c
d
)
,
(
a
b
)
(
b
c
)
(
c
d
)
]
D=[a b c d,(a b c d),(a b)(c d),(a b)(b c)(c d)]
D=[abcd,(abcd),(ab)(cd),(ab)(bc)(cd)]
The
s
u
p
p
o
r
t
_
c
o
u
n
t
(
p
)
support\_count(p)
support_count(p) for:
p
=
a
:
4
−
D
=
[
a
b
c
d
,
(
a
b
c
d
)
,
(
a
b
)
(
c
d
)
,
(
a
b
)
(
b
c
)
]
p
=
a
b
:
2
−
D
=
[
a
b
c
d
,
(
a
b
c
d
)
,
(
a
b
)
(
c
d
)
,
(
a
b
)
(
b
c
)
(
c
d
)
]
p
=
(
a
b
)
:
3
−
D
=
[
a
b
c
d
,
(
a
b
c
d
)
,
(
a
b
)
(
c
d
)
,
(
b
c
)
(
b
c
)
(
b
c
)
]
p
=
(
a
b
)
c
:
2
−
D
=
[
a
b
c
d
,
(
a
b
c
d
)
,
(
a
b
)
(
c
d
)
,
(
c
d
)
(
b
c
)
(
c
d
)
]
p
=
(
a
b
)
(
b
d
)
:
0
−
D
=
[
a
b
c
d
,
(
a
b
c
d
)
,
(
a
b
)
(
c
d
)
,
(
c
d
)
(
b
c
)
(
b
c
)
(
b
c
)
]
p
=
a
b
(
c
d
)
:
1
−
D
=
[
a
b
c
d
,
(
a
b
c
d
)
,
(
a
b
)
(
c
d
)
,
(
c
d
)
,
(
a
b
)
(
b
c
)
]
\begin{array}{l}{\mathrm{p}=\mathrm{a}: 4-\mathrm{D}=[\mathrm{abcd},(\mathrm{abcd}),(\mathrm{ab})(\mathrm{cd}),(\mathrm{ab})(\mathrm{bc})]} \\ {\mathrm{p}=\mathrm{ab}: 2-\mathrm{D}=[\mathrm{abcd},(\mathrm{abcd}),(\mathrm{ab})(\mathrm{cd}),(\mathrm{ab})(\mathrm{bc})(\mathrm{cd})]} \\ {\mathrm{p}=(\mathrm{ab}): 3-\mathrm{D}=[\mathrm{abcd},(\mathrm{abcd}),(\mathrm{ab})(\mathrm{cd}),(\mathrm{bc})(\mathrm{bc})(\mathrm{bc})]} \\ {\mathrm{p}=(\mathrm{ab}) \mathrm{c}: 2-\mathrm{D}=[\mathrm{abcd},(\mathrm{abcd}),(\mathrm{ab})(\mathrm{cd}),(\mathrm{cd})(\mathrm{bc})(\mathrm{cd})]} \\ {\mathrm{p}=(\mathrm{ab})(\mathrm{bd}): 0-\mathrm{D}=[\mathrm{abcd},(\mathrm{abcd}),(\mathrm{ab})(\mathrm{cd}),(\mathrm{cd})(\mathrm{bc})(\mathrm{bc})(\mathrm{bc})]} \\ {\mathrm{p}=\mathrm{ab}(\mathrm{cd}): 1-\mathrm{D}=[\mathrm{abcd},(\mathrm{abcd}),(\mathrm{ab})(\mathrm{cd}),(\mathrm{cd}),(\mathrm{ab})(\mathrm{bc})]}\end{array}
p=a:4−D=[abcd,(abcd),(ab)(cd),(ab)(bc)]p=ab:2−D=[abcd,(abcd),(ab)(cd),(ab)(bc)(cd)]p=(ab):3−D=[abcd,(abcd),(ab)(cd),(bc)(bc)(bc)]p=(ab)c:2−D=[abcd,(abcd),(ab)(cd),(cd)(bc)(cd)]p=(ab)(bd):0−D=[abcd,(abcd),(ab)(cd),(cd)(bc)(bc)(bc)]p=ab(cd):1−D=[abcd,(abcd),(ab)(cd),(cd),(ab)(bc)]
AprioriAll Algorithm
Brute force approach
- Let 𝒌 be the length of the longest sequence in D ∈ B ( ( P ( J ) ) ∗ ) \boldsymbol{D} \in \mathcal{B}\left((\mathcal{P}(\mathcal{J}))^{*}\right) D∈B((P(J))∗) and 𝒒 the largest itemset.
- Generate all sequence patterns 𝒑 having a length ≤ 𝒌 and itemsets of size ≤ 𝒒. This number is finite.
- Compute the support of each candidate pattern
- == very expensive! ==
A smarter approach based on Apriori
- If 𝒑𝟏 ⊑ 𝒑𝟐 (𝒑𝟏 is a subsequence of 𝒑𝟐), then 𝒑𝟐 cannot be frequent if 𝒑𝟏 is not frequent.
- \text { support }\left(\boldsymbol{p}{1}\right) \geq \text { support }\left(\boldsymbol{p}{2}\right) \text { if } \boldsymbol{p}{1} \sqsubseteq \boldsymbol{p}{2}
- if 𝒑𝟏 ⊑ 𝒑𝟐 and s u p p o r t ( p 1 ) < m i n − s u p support\left(\boldsymbol{p}_{1}\right)<\boldsymbol{m} \boldsymbol{i} \boldsymbol{n}_{-} \boldsymbol{s u p} support(p1)<min−sup, then s u p p o r t ( p 2 ) < m i n − s u p support\left(\boldsymbol{p}_{2}\right)<\boldsymbol{m} \boldsymbol{i} \boldsymbol{n}_{-} \boldsymbol{s u p} support(p2)<min−sup
Step 1: Determine all litemsets L
-
L
=
{
i
⊆
J
∣
support
(
⟨
i
⟩
)
≥
m
i
n
−
s
u
p
}
\boldsymbol{L}=\left\{\boldsymbol{i} \subseteq \boldsymbol{J} | \text { support }(\langle\boldsymbol{i}\rangle) \geq m i \boldsymbol{n}_{-} s u \boldsymbol{p}\right\}
L={i⊆J∣ support (⟨i⟩)≥min−sup} are all itemsets that
appear in a sufficient number of sequences. - These itemsets are called litemsets.
- To determine all litemsets, we can use a variant of the original Apriori algorithm.
Step 2: Preprocess dataset
- The set L 1 = { ⟨ i ⟩ ∣ i ∈ L } \boldsymbol{L}_{1}=\{\langle\boldsymbol{i}\rangle | \boldsymbol{i} \in \boldsymbol{L}\} L1={⟨i⟩∣i∈L} is the set of all frequent sequence patterns of length 1.
- L k ⊆ L ∗ \boldsymbol{L}_{\boldsymbol{k}} \subseteq \boldsymbol{L}^{*} Lk⊆L∗ is the set of all frequent sequence patterns of length 𝒌 (to be computed).
Transformation (only for optimization, using formal notation)
- Transform D into D T ∈ B ( ( P ( L ) ) ∗ ) \boldsymbol{D}_{\boldsymbol{T}} \in \boldsymbol{B}\left((\mathcal{P}(\boldsymbol{L}))^{*}\right) DT∈B((P(L))∗)
- Let
L
=
{
{
a
}
,
{
b
}
,
{
c
}
,
{
a
,
b
}
}
\boldsymbol{L}=\{\{\boldsymbol{a}\},\{\boldsymbol{b}\},\{\boldsymbol{c}\},\{\boldsymbol{a}, \boldsymbol{b}\}\}
L={{a},{b},{c},{a,b}}
- ⟨ { a , c } , { a , b , c } \langle\{\boldsymbol{a}, \boldsymbol{c}\},\{\boldsymbol{a}, \boldsymbol{b}, \boldsymbol{c}\} ⟨{a,c},{a,b,c} -> ⟨ { { a } { c } } , { { a } , { b } , { c } , { a , b } } ⟩ \langle\{\{\boldsymbol{a}\}\{\boldsymbol{c}\}\},\{\{\boldsymbol{a}\},\{\boldsymbol{b}\},\{\boldsymbol{c}\},\{\boldsymbol{a}, \boldsymbol{b}\}\}\rangle ⟨{{a}{c}},{{a},{b},{c},{a,b}}⟩
- ⟨ { c } , { a , c } ⟩ \langle\{c\},\{a, c\}\rangle ⟨{c},{a,c}⟩ -> ⟨ { { c } } , { { a } , { c } } ⟩ \langle\{\{\boldsymbol{c}\}\},\{\{\boldsymbol{a}\},\{\boldsymbol{c}\}\}\rangle ⟨{{c}},{{a},{c}}⟩
- The new representation makes it very easy to test whether a sequence pattern is supported by a sequence in the data set.
Step 3: Generate set of candidate sequences
- Assume we have 𝑳𝒌−𝟏, the set of all frequent sequence patterns of length 𝒌 − 𝟏. Recall that 𝑳𝟏 = { <𝒊> |𝒊 ∈ 𝑳} .
- 𝑪𝒌 is the set of all candidate sequences obtained by taking two sequences from 𝑳𝒌−𝟏 where the first 𝒌 − 2
are the same.
Step 4: Prune the set of candidate sequences
- For all
c
∈
C
k
\mathbf{c} \in \boldsymbol{C}_{\boldsymbol{k}}
c∈Ck:
- Consider all subsequences of 𝐜𝐜 of length 𝒌 − 1.
- If one of these subsequences is not in 𝑳𝒌−𝟏, then remove 𝐜 from 𝑪𝒌.
Step 5: Test all candidate sequences
- For each transformed sequence 𝒔 ∈ 𝑫𝑻: Increment the count of 𝐜 ∈ 𝑪𝒌 if 𝐜 is contained in 𝒔.
- Remove all candidates 𝐜 ∈ 𝑪𝒌 that do no meet the threshold and the result is 𝑳𝒌.
- L k = { c ∈ C k ∣ s u p p o r t ( c ) ≥ m i n _ s u p } \boldsymbol{L}_{\boldsymbol{k}}=\left\{\boldsymbol{c} \in \boldsymbol{C}_{\boldsymbol{k}} | support(c) \geq min\_sup\right\} Lk={c∈Ck∣support(c)≥min_sup}.
- ∪ k L k \cup_{k} L_{k} ∪kLk is the set of all frequent sequence patterns.
Step 6 (optional): Remove non-maximal patterns
- An sequence 𝒔 is a maximal sequence in data set 𝑫 if
𝒔 is frequent, and there is no other supersequence 𝒔′ that is also frequent (𝒔 ⊏ 𝒔’). - It is possible to keep only the maximal sequences. However, support information for the subsequences will be lost (these may be more frequent).
Additional constraints
How interesting?
- Item constraints (only consider sequences that include or exclude a set of items).
- Length constraints (only consider patterns of a given size).
- Time constraints (only consider patters that occur in a short timeframe). This includes gap and duration constraints.
- Regular expression constraints (only consider patterns that satisfy a regular expression or temporal constraint).
Episode mining
Rather than looking for sequences we look for embedded partial orders.
Example:
Then: