Sequence Mining

notes of course “Introduction to Data Science” from RWTH-Aachen in semester winter 19/20, Professor van der Aalst, Willibrordus

Event data

Every row is an event. It should include:

  1. case id: the “thing” used to group events. eg. student id
  2. activity name: description of the event, also can be an item set. eg. course name
  3. timestamp
  4. other attributes: resource, lifecycle (start, complete, etc.), costs, role, etc.

Mining sequential patterns

Input: Multiset of sequences of itemsets

  • 𝓘 is the set of all items
  • An itemset is a nonempty set of items, e.g., i = { i 1 , i 2 , … , i m } ⊆ ( J )  with  m ≥ 1 \boldsymbol{i}=\left\{\boldsymbol{i}_{1}, \boldsymbol{i}_{2}, \ldots, \boldsymbol{i}_{m}\right\} \subseteq (\mathcal{J}) \text { with } \boldsymbol{m} \geq 1 i={i1,i2,,im}(J) with m1
  • A sequence is a nonempty sequence of itemsets, e.g., s = ⟨ s 1 , s 2 , … , s n ⟩ ∈ ( P ( J ) ) ∗  with  n ≥ 1 \boldsymbol{s}=\left\langle\boldsymbol{s}_{1}, \boldsymbol{s}_{2}, \ldots, \boldsymbol{s}_{n}\right\rangle \in(\mathcal{P}(\boldsymbol{\mathcal {J}}))^{*} \text { with } \boldsymbol{n} \geq \mathbf{1} s=s1,s2,,sn(P(J)) with n1
  • A dataset 𝑫 is a multiset of sequences. D ∈ B ( ( P ( ȷ ) ) ∗ ) \boldsymbol{D} \in \mathcal{B}\left((\mathcal{P}(\boldsymbol{\jmath}))^{*}\right) DB((P(ȷ)))

(𝕭 is the multiset operator, * is the sequence operator, and 𝓟 is the powerset operator

Example1

(The customer buy a, then buy b, then buy c and d, then buy e)

  • Informal: [ a b ( c d ) e , a b ( c d ) e , a ( c d ) e , a ( b c ) ( c d e ) f ] [a b(c d) e, a b(c d) e, a(c d) e, a(b c)(c d e) f] [ab(cd)e,ab(cd)e,a(cd)e,a(bc)(cde)f]
  • Formal: [ ⟨ { a } , { b } , { c , d } , { e } ⟩ , ⟨ { a } , { b } , { c , d } , { e } ⟩ ⟨ { a } , { c , d } , { e } ⟩ , ⟨ { a } , { b , c } , { c , d , e } , { f } ⟩ ] \begin{array}{l}{[\langle\{\boldsymbol{a}\},\{\boldsymbol{b}\},\{\boldsymbol{c}, \boldsymbol{d}\},\{\boldsymbol{e}\}\rangle,\langle\{\boldsymbol{a}\},\{\boldsymbol{b}\},\{\boldsymbol{c}, \boldsymbol{d}\},\{\boldsymbol{e}\}\rangle} \\ {\langle\{\boldsymbol{a}\},\{\boldsymbol{c}, \boldsymbol{d}\},\{\boldsymbol{e}\}\rangle,\langle\{\boldsymbol{a}\},\{\boldsymbol{b}, \boldsymbol{c}\},\{\boldsymbol{c}, \boldsymbol{d}, \boldsymbol{e}\},\{\boldsymbol{f}\}\rangle]}\end{array} [{a},{b},{c,d},{e},{a},{b},{c,d},{e}{a},{c,d},{e},{a},{b,c},{c,d,e},{f}]

Goal: Find frequent sequential patterns

  • Sequential patterns are in the form of p ∈ ( P ( J ) ) ∗ p \in (\mathcal{P}(\mathcal{J}))^{*} p(P(J)).
  • The support of a sequential pattern 𝒑 is the fraction (or absolute number) of sequences in dataset 𝑫 that supports pattern 𝒑 (i.e., is contained).
Containment

Let a = ⟨ a 1 , a 2 , … , a n ⟩ ∈ ( P ( J ) ) ∗ a=\left\langle a_{1}, a_{2}, \ldots, a_{n}\right\rangle \in(\mathcal{P}(\mathcal{J}))^{*} a=a1,a2,,an(P(J)) and b = ⟨ b 1 , b 2 , … , b m ⟩ ∈ ( P ( J ) ) ∗ b=\left\langle b_{1}, b_{2}, \ldots, b_{m}\right\rangle \in(\mathcal{P}(\mathcal{J}))^{*} b=b1,b2,,bm(P(J)) be two itemset sequences.
a a a is contained in b b b if there exist integers i 1 < i 2 < . . . < i n i_1 < i_2 < ... < i_n i1<i2<...<in such that a 1 ⊆ b i 1 , a 2 ⊆ b i 2 , … , a n ⊆ b i n \boldsymbol{a}_{1} \subseteq \boldsymbol{b}_{i_{1}}, \boldsymbol{a}_{2} \subseteq \boldsymbol{b}_{i_{2}}, \ldots, \boldsymbol{a}_{n} \subseteq \boldsymbol{b}_{i_{n}} a1bi1,a2bi2,,anbin.

Examples

formal notation:
⟨ { a } , { a , b } , { b , c } , { c } ⟩ ⊑ ⟨ { a } , { a } , { a , b , c } , { b , c } , { b , c } , { a , c } ⟩ ⟨ { a } , { a , b } , { b , c } , { c } ⟩ ⫋ ⟨ { a } , { a , b , c } , { b , d } , { b , e } , { a , c } ⟩ \begin{array}{l}{\langle\{\boldsymbol{a}\},\{\boldsymbol{a}, \boldsymbol{b}\},\{\boldsymbol{b}, \boldsymbol{c}\},\{\boldsymbol{c}\}\rangle \sqsubseteq\langle\{\boldsymbol{a}\},\{\boldsymbol{a}\},\{\boldsymbol{a}, \boldsymbol{b}, \boldsymbol{c}\},\{\boldsymbol{b}, \boldsymbol{c}\},\{\boldsymbol{b}, \boldsymbol{c}\},\{\boldsymbol{a}, \boldsymbol{c}\}\rangle} \\ {\langle\{\boldsymbol{a}\},\{\boldsymbol{a}, \boldsymbol{b}\},\{\boldsymbol{b}, \boldsymbol{c}\},\{\boldsymbol{c}\}\rangle \varsubsetneqq\langle\{\boldsymbol{a}\},\{\boldsymbol{a}, \boldsymbol{b}, \boldsymbol{c}\},\{\boldsymbol{b}, \boldsymbol{d}\},\{\boldsymbol{b}, \boldsymbol{e}\},\{\boldsymbol{a}, \boldsymbol{c}\}\rangle}\end{array} {a},{a,b},{b,c},{c}{a},{a},{a,b,c},{b,c},{b,c},{a,c}{a},{a,b},{b,c},{c}{a},{a,b,c},{b,d},{b,e},{a,c}
informal notation:
a ( a b ) ( b c ) c = a a ( a b c ) ( b c ) ( b c ) ( a c ) a ( a b ) ( b c ) c ≠ a ( a b c ) ( b d ) ( b e ) ( a c ) \begin{array}{l}{a(a b)(b c) c=a a(a b c)(b c)(b c)(a c)} \\ {a(a b)(b c) c \neq a(a b c)(b d)(b e)(a c)}\end{array} a(ab)(bc)c=aa(abc)(bc)(bc)(ac)a(ab)(bc)c=a(abc)(bd)(be)(ac)

Support
Relative
  • s u p p o r t ( p ) support (p) support(p) = ∣ [ s ∈ D ∣ p ⊑ s ] ∣ / ∣ D ∣  (relative)  =|[s \in \boldsymbol{D} | \boldsymbol{p} \sqsubseteq s]| /|\boldsymbol{D}| \text { (relative) } =[sDps]/D (relative) 
  • Minimum support threshold m i n _ s u p min\_sup min_sup : lower bound for s u p p o r t ( p ) support(p) support(p)
Absolute
  • s u p p o r t _ c o u n t ( p ) = ∥ [ s ∈ D ∣ p ⊑ s ] ∣ support\_count(p)=\|[s \in \boldsymbol{D} | \boldsymbol{p} \sqsubseteq \boldsymbol{s}] | support_count(p)=[sDps] (absolute, also called frequency or count)
  • Minimum support count threshold: lower bound for support_count§.
Example

D = [ a b c d , ( a b c d ) , ( a b ) ( c d ) , ( a b ) ( b c ) ( c d ) ] D=[a b c d,(a b c d),(a b)(c d),(a b)(b c)(c d)] D=[abcd,(abcd),(ab)(cd),(ab)(bc)(cd)]
The s u p p o r t _ c o u n t ( p ) support\_count(p) support_count(p) for:
p = a : 4 − D = [ a b c d , ( a b c d ) , ( a b ) ( c d ) , ( a b ) ( b c ) ] p = a b : 2 − D = [ a b c d , ( a b c d ) , ( a b ) ( c d ) , ( a b ) ( b c ) ( c d ) ] p = ( a b ) : 3 − D = [ a b c d , ( a b c d ) , ( a b ) ( c d ) , ( b c ) ( b c ) ( b c ) ] p = ( a b ) c : 2 − D = [ a b c d , ( a b c d ) , ( a b ) ( c d ) , ( c d ) ( b c ) ( c d ) ] p = ( a b ) ( b d ) : 0 − D = [ a b c d , ( a b c d ) , ( a b ) ( c d ) , ( c d ) ( b c ) ( b c ) ( b c ) ] p = a b ( c d ) : 1 − D = [ a b c d , ( a b c d ) , ( a b ) ( c d ) , ( c d ) , ( a b ) ( b c ) ] \begin{array}{l}{\mathrm{p}=\mathrm{a}: 4-\mathrm{D}=[\mathrm{abcd},(\mathrm{abcd}),(\mathrm{ab})(\mathrm{cd}),(\mathrm{ab})(\mathrm{bc})]} \\ {\mathrm{p}=\mathrm{ab}: 2-\mathrm{D}=[\mathrm{abcd},(\mathrm{abcd}),(\mathrm{ab})(\mathrm{cd}),(\mathrm{ab})(\mathrm{bc})(\mathrm{cd})]} \\ {\mathrm{p}=(\mathrm{ab}): 3-\mathrm{D}=[\mathrm{abcd},(\mathrm{abcd}),(\mathrm{ab})(\mathrm{cd}),(\mathrm{bc})(\mathrm{bc})(\mathrm{bc})]} \\ {\mathrm{p}=(\mathrm{ab}) \mathrm{c}: 2-\mathrm{D}=[\mathrm{abcd},(\mathrm{abcd}),(\mathrm{ab})(\mathrm{cd}),(\mathrm{cd})(\mathrm{bc})(\mathrm{cd})]} \\ {\mathrm{p}=(\mathrm{ab})(\mathrm{bd}): 0-\mathrm{D}=[\mathrm{abcd},(\mathrm{abcd}),(\mathrm{ab})(\mathrm{cd}),(\mathrm{cd})(\mathrm{bc})(\mathrm{bc})(\mathrm{bc})]} \\ {\mathrm{p}=\mathrm{ab}(\mathrm{cd}): 1-\mathrm{D}=[\mathrm{abcd},(\mathrm{abcd}),(\mathrm{ab})(\mathrm{cd}),(\mathrm{cd}),(\mathrm{ab})(\mathrm{bc})]}\end{array} p=a:4D=[abcd,(abcd),(ab)(cd),(ab)(bc)]p=ab:2D=[abcd,(abcd),(ab)(cd),(ab)(bc)(cd)]p=(ab):3D=[abcd,(abcd),(ab)(cd),(bc)(bc)(bc)]p=(ab)c:2D=[abcd,(abcd),(ab)(cd),(cd)(bc)(cd)]p=(ab)(bd):0D=[abcd,(abcd),(ab)(cd),(cd)(bc)(bc)(bc)]p=ab(cd):1D=[abcd,(abcd),(ab)(cd),(cd),(ab)(bc)]

AprioriAll Algorithm

Brute force approach

  • Let 𝒌 be the length of the longest sequence in D ∈ B ( ( P ( J ) ) ∗ ) \boldsymbol{D} \in \mathcal{B}\left((\mathcal{P}(\mathcal{J}))^{*}\right) DB((P(J))) and 𝒒 the largest itemset.
  • Generate all sequence patterns 𝒑 having a length ≤ 𝒌 and itemsets of size ≤ 𝒒. This number is finite.
  • Compute the support of each candidate pattern
  • == very expensive! ==

A smarter approach based on Apriori

  • If 𝒑𝟏 ⊑ 𝒑𝟐 (𝒑𝟏 is a subsequence of 𝒑𝟐), then 𝒑𝟐 cannot be frequent if 𝒑𝟏 is not frequent.
    • \text { support }\left(\boldsymbol{p}{1}\right) \geq \text { support }\left(\boldsymbol{p}{2}\right) \text { if } \boldsymbol{p}{1} \sqsubseteq \boldsymbol{p}{2}
    • if 𝒑𝟏 ⊑ 𝒑𝟐 and s u p p o r t ( p 1 ) < m i n − s u p support\left(\boldsymbol{p}_{1}\right)<\boldsymbol{m} \boldsymbol{i} \boldsymbol{n}_{-} \boldsymbol{s u p} support(p1)<minsup, then s u p p o r t ( p 2 ) < m i n − s u p support\left(\boldsymbol{p}_{2}\right)<\boldsymbol{m} \boldsymbol{i} \boldsymbol{n}_{-} \boldsymbol{s u p} support(p2)<minsup
Step 1: Determine all litemsets L
  • L = { i ⊆ J ∣  support  ( ⟨ i ⟩ ) ≥ m i n − s u p } \boldsymbol{L}=\left\{\boldsymbol{i} \subseteq \boldsymbol{J} | \text { support }(\langle\boldsymbol{i}\rangle) \geq m i \boldsymbol{n}_{-} s u \boldsymbol{p}\right\} L={iJ support (i)minsup} are all itemsets that
    appear in a sufficient number of sequences.
  • These itemsets are called litemsets.
  • To determine all litemsets, we can use a variant of the original Apriori algorithm.
Step 2: Preprocess dataset
  • The set L 1 = { ⟨ i ⟩ ∣ i ∈ L } \boldsymbol{L}_{1}=\{\langle\boldsymbol{i}\rangle | \boldsymbol{i} \in \boldsymbol{L}\} L1={iiL} is the set of all frequent sequence patterns of length 1.
  • L k ⊆ L ∗ \boldsymbol{L}_{\boldsymbol{k}} \subseteq \boldsymbol{L}^{*} LkL is the set of all frequent sequence patterns of length 𝒌 (to be computed).
Transformation (only for optimization, using formal notation)
  • Transform D into D T ∈ B ( ( P ( L ) ) ∗ ) \boldsymbol{D}_{\boldsymbol{T}} \in \boldsymbol{B}\left((\mathcal{P}(\boldsymbol{L}))^{*}\right) DTB((P(L)))
  • Let L = { { a } , { b } , { c } , { a , b } } \boldsymbol{L}=\{\{\boldsymbol{a}\},\{\boldsymbol{b}\},\{\boldsymbol{c}\},\{\boldsymbol{a}, \boldsymbol{b}\}\} L={{a},{b},{c},{a,b}}
    • ⟨ { a , c } , { a , b , c } \langle\{\boldsymbol{a}, \boldsymbol{c}\},\{\boldsymbol{a}, \boldsymbol{b}, \boldsymbol{c}\} {a,c},{a,b,c} -> ⟨ { { a } { c } } , { { a } , { b } , { c } , { a , b } } ⟩ \langle\{\{\boldsymbol{a}\}\{\boldsymbol{c}\}\},\{\{\boldsymbol{a}\},\{\boldsymbol{b}\},\{\boldsymbol{c}\},\{\boldsymbol{a}, \boldsymbol{b}\}\}\rangle {{a}{c}},{{a},{b},{c},{a,b}}
    • ⟨ { c } , { a , c } ⟩ \langle\{c\},\{a, c\}\rangle {c},{a,c} -> ⟨ { { c } } , { { a } , { c } } ⟩ \langle\{\{\boldsymbol{c}\}\},\{\{\boldsymbol{a}\},\{\boldsymbol{c}\}\}\rangle {{c}},{{a},{c}}
  • The new representation makes it very easy to test whether a sequence pattern is supported by a sequence in the data set.
Step 3: Generate set of candidate sequences
  • Assume we have 𝑳𝒌−𝟏, the set of all frequent sequence patterns of length 𝒌 − 𝟏. Recall that 𝑳𝟏 = { <𝒊> |𝒊 ∈ 𝑳} .
  • 𝑪𝒌 is the set of all candidate sequences obtained by taking two sequences from 𝑳𝒌−𝟏 where the first 𝒌 − 2
    are the same.
Step 4: Prune the set of candidate sequences
  • For all c ∈ C k \mathbf{c} \in \boldsymbol{C}_{\boldsymbol{k}} cCk:
    • Consider all subsequences of 𝐜𝐜 of length 𝒌 − 1.
    • If one of these subsequences is not in 𝑳𝒌−𝟏, then remove 𝐜 from 𝑪𝒌.
Step 5: Test all candidate sequences
  • For each transformed sequence 𝒔 ∈ 𝑫𝑻: Increment the count of 𝐜 ∈ 𝑪𝒌 if 𝐜 is contained in 𝒔.
  • Remove all candidates 𝐜 ∈ 𝑪𝒌 that do no meet the threshold and the result is 𝑳𝒌.
  • L k = { c ∈ C k ∣ s u p p o r t ( c ) ≥ m i n _ s u p } \boldsymbol{L}_{\boldsymbol{k}}=\left\{\boldsymbol{c} \in \boldsymbol{C}_{\boldsymbol{k}} | support(c) \geq min\_sup\right\} Lk={cCksupport(c)min_sup}.
  • ∪ k L k \cup_{k} L_{k} kLk is the set of all frequent sequence patterns.
Step 6 (optional): Remove non-maximal patterns
  • An sequence 𝒔 is a maximal sequence in data set 𝑫 if
    𝒔 is frequent, and there is no other supersequence 𝒔′ that is also frequent (𝒔 ⊏ 𝒔’).
  • It is possible to keep only the maximal sequences. However, support information for the subsequences will be lost (these may be more frequent).

Additional constraints

How interesting?

  • Item constraints (only consider sequences that include or exclude a set of items).
  • Length constraints (only consider patterns of a given size).
  • Time constraints (only consider patters that occur in a short timeframe). This includes gap and duration constraints.
  • Regular expression constraints (only consider patterns that satisfy a regular expression or temporal constraint).

Episode mining

Rather than looking for sequences we look for embedded partial orders.
Example:

wine
beer
wodka
beer
wine
wodka

Then:

wine
wodka
beer
  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值