Sequence Mining

最新推荐文章于 2024-04-28 00:23:29 发布

sue_zhou

最新推荐文章于 2024-04-28 00:23:29 发布

阅读量540

点赞数

分类专栏： Introduction to Data Science 文章标签： Introduction to Data Science Sequence Mining

原文链接：https://video.fsmpi.rwth-aachen.de/18ws-ids/13757

版权

Introduction to Data Science 专栏收录该内容

4 篇文章 0 订阅

订阅专栏

notes of course “Introduction to Data Science” from RWTH-Aachen in semester winter 19/20, Professor van der Aalst, Willibrordus

Event data

Every row is an event. It should include:

case id: the “thing” used to group events. eg. student id
activity name: description of the event, also can be an item set. eg. course name
timestamp
other attributes: resource, lifecycle (start, complete, etc.), costs, role, etc.

Mining sequential patterns

Input: Multiset of sequences of itemsets

𝓘 is the set of all items
An itemset is a nonempty set of items, e.g., $\boldsymbol{i}=\left\{\boldsymbol{i}_{1}, \boldsymbol{i}_{2}, \ldots, \boldsymbol{i}_{m}\right\} \subseteq (\mathcal{J}) \text { with } \boldsymbol{m} \geq 1$
A sequence is a nonempty sequence of itemsets, e.g., $\boldsymbol{s}=\left\langle\boldsymbol{s}_{1}, \boldsymbol{s}_{2}, \ldots, \boldsymbol{s}_{n}\right\rangle \in(\mathcal{P}(\boldsymbol{\mathcal {J}}))^{*} \text { with } \boldsymbol{n} \geq \mathbf{1}$
A dataset 𝑫 is a multiset of sequences. $\boldsymbol{D} \in \mathcal{B}\left((\mathcal{P}(\boldsymbol{\jmath}))^{*}\right)$

(𝕭 is the multiset operator, * is the sequence operator, and 𝓟 is the powerset operator

Example1

(The customer buy a, then buy b, then buy c and d, then buy e)

Informal: $[a b (c d) e, a b (c d) e, a (c d) e, a (b c) (c d e) f]$
Formal: $\begin{array}{l}{[\langle\{\boldsymbol{a}\},\{\boldsymbol{b}\},\{\boldsymbol{c}, \boldsymbol{d}\},\{\boldsymbol{e}\}\rangle,\langle\{\boldsymbol{a}\},\{\boldsymbol{b}\},\{\boldsymbol{c}, \boldsymbol{d}\},\{\boldsymbol{e}\}\rangle} \\ {\langle\{\boldsymbol{a}\},\{\boldsymbol{c}, \boldsymbol{d}\},\{\boldsymbol{e}\}\rangle,\langle\{\boldsymbol{a}\},\{\boldsymbol{b}, \boldsymbol{c}\},\{\boldsymbol{c}, \boldsymbol{d}, \boldsymbol{e}\},\{\boldsymbol{f}\}\rangle]}\end{array}$

Goal: Find frequent sequential patterns

Sequential patterns are in the form of $\in (\mathcal{P}(\mathcal{J}))^{*}$ .
The support of a sequential pattern 𝒑 is the fraction (or absolute number) of sequences in dataset 𝑫 that supports pattern 𝒑 (i.e., is contained).

Containment

Let $a=\left\langle a_{1}, a_{2}, \ldots, a_{n}\right\rangle \in(\mathcal{P}(\mathcal{J}))^{*}$ and $b=\left\langle b_{1}, b_{2}, \ldots, b_{m}\right\rangle \in(\mathcal{P}(\mathcal{J}))^{*}$ be two itemset sequences.
$a$ is contained in $b$ if there exist integers $i_1 < i_2 < ... < i_n$ such that $\boldsymbol{a}_{1} \subseteq \boldsymbol{b}_{i_{1}}, \boldsymbol{a}_{2} \subseteq \boldsymbol{b}_{i_{2}}, \ldots, \boldsymbol{a}_{n} \subseteq \boldsymbol{b}_{i_{n}}$ .

Examples

formal notation:
$\begin{array}{l}{\langle\{\boldsymbol{a}\},\{\boldsymbol{a}, \boldsymbol{b}\},\{\boldsymbol{b}, \boldsymbol{c}\},\{\boldsymbol{c}\}\rangle \sqsubseteq\langle\{\boldsymbol{a}\},\{\boldsymbol{a}\},\{\boldsymbol{a}, \boldsymbol{b}, \boldsymbol{c}\},\{\boldsymbol{b}, \boldsymbol{c}\},\{\boldsymbol{b}, \boldsymbol{c}\},\{\boldsymbol{a}, \boldsymbol{c}\}\rangle} \\ {\langle\{\boldsymbol{a}\},\{\boldsymbol{a}, \boldsymbol{b}\},\{\boldsymbol{b}, \boldsymbol{c}\},\{\boldsymbol{c}\}\rangle \varsubsetneqq\langle\{\boldsymbol{a}\},\{\boldsymbol{a}, \boldsymbol{b}, \boldsymbol{c}\},\{\boldsymbol{b}, \boldsymbol{d}\},\{\boldsymbol{b}, \boldsymbol{e}\},\{\boldsymbol{a}, \boldsymbol{c}\}\rangle}\end{array}$
informal notation:
$\begin{array}{l}{a(a b)(b c) c=a a(a b c)(b c)(b c)(a c)} \\ {a(a b)(b c) c \neq a(a b c)(b d)(b e)(a c)}\end{array}$

Support

Relative

$s u p p o r t (p)$ $\in \boldsymbol{D} | \boldsymbol{p} \sqsubseteq s]| /|\boldsymbol{D}| \text { (relative) }$
Minimum support threshold $min\_sup$ : lower bound for $s u p p o r t (p)$

Absolute

$support\_count(p)=\|[s \in \boldsymbol{D} | \boldsymbol{p} \sqsubseteq \boldsymbol{s}] |$ (absolute, also called frequency or count)
Minimum support count threshold: lower bound for support_count§.

Example

$D = [a b c d, (a b c d), (a b) (c d), (a b) (b c) (c d)]$
The $support\_count(p)$ for:
$\begin{array}{l}{\mathrm{p}=\mathrm{a}: 4-\mathrm{D}=[\mathrm{abcd},(\mathrm{abcd}),(\mathrm{ab})(\mathrm{cd}),(\mathrm{ab})(\mathrm{bc})]} \\ {\mathrm{p}=\mathrm{ab}: 2-\mathrm{D}=[\mathrm{abcd},(\mathrm{abcd}),(\mathrm{ab})(\mathrm{cd}),(\mathrm{ab})(\mathrm{bc})(\mathrm{cd})]} \\ {\mathrm{p}=(\mathrm{ab}): 3-\mathrm{D}=[\mathrm{abcd},(\mathrm{abcd}),(\mathrm{ab})(\mathrm{cd}),(\mathrm{bc})(\mathrm{bc})(\mathrm{bc})]} \\ {\mathrm{p}=(\mathrm{ab}) \mathrm{c}: 2-\mathrm{D}=[\mathrm{abcd},(\mathrm{abcd}),(\mathrm{ab})(\mathrm{cd}),(\mathrm{cd})(\mathrm{bc})(\mathrm{cd})]} \\ {\mathrm{p}=(\mathrm{ab})(\mathrm{bd}): 0-\mathrm{D}=[\mathrm{abcd},(\mathrm{abcd}),(\mathrm{ab})(\mathrm{cd}),(\mathrm{cd})(\mathrm{bc})(\mathrm{bc})(\mathrm{bc})]} \\ {\mathrm{p}=\mathrm{ab}(\mathrm{cd}): 1-\mathrm{D}=[\mathrm{abcd},(\mathrm{abcd}),(\mathrm{ab})(\mathrm{cd}),(\mathrm{cd}),(\mathrm{ab})(\mathrm{bc})]}\end{array}$

AprioriAll Algorithm

Brute force approach

Let 𝒌 be the length of the longest sequence in $\boldsymbol{D} \in \mathcal{B}\left((\mathcal{P}(\mathcal{J}))^{*}\right)$ and 𝒒 the largest itemset.
Generate all sequence patterns 𝒑 having a length ≤ 𝒌 and itemsets of size ≤ 𝒒. This number is finite.
Compute the support of each candidate pattern
== very expensive! ==

A smarter approach based on Apriori

If 𝒑𝟏 ⊑ 𝒑𝟐 (𝒑𝟏 is a subsequence of 𝒑𝟐), then 𝒑𝟐 cannot be frequent if 𝒑𝟏 is not frequent.
- \text { support }\left(\boldsymbol{p}{1}\right) \geq \text { support }\left(\boldsymbol{p}{2}\right) \text { if } \boldsymbol{p}{1} \sqsubseteq \boldsymbol{p}{2}
- if 𝒑𝟏 ⊑ 𝒑𝟐 and $support\left(\boldsymbol{p}_{1}\right)<\boldsymbol{m} \boldsymbol{i} \boldsymbol{n}_{-} \boldsymbol{s u p}$ , then $support\left(\boldsymbol{p}_{2}\right)<\boldsymbol{m} \boldsymbol{i} \boldsymbol{n}_{-} \boldsymbol{s u p}$

Step 1: Determine all litemsets L

$\boldsymbol{L}=\left\{\boldsymbol{i} \subseteq \boldsymbol{J} | \text { support }(\langle\boldsymbol{i}\rangle) \geq m i \boldsymbol{n}_{-} s u \boldsymbol{p}\right\}$ are all itemsets that
appear in a sufficient number of sequences.
These itemsets are called litemsets.
To determine all litemsets, we can use a variant of the original Apriori algorithm.

Step 2: Preprocess dataset

The set $\boldsymbol{L}_{1}=\{\langle\boldsymbol{i}\rangle | \boldsymbol{i} \in \boldsymbol{L}\}$ is the set of all frequent sequence patterns of length 1.
$\boldsymbol{L}_{\boldsymbol{k}} \subseteq \boldsymbol{L}^{*}$ is the set of all frequent sequence patterns of length 𝒌 (to be computed).

Transformation (only for optimization, using formal notation)

Transform D into $\boldsymbol{D}_{\boldsymbol{T}} \in \boldsymbol{B}\left((\mathcal{P}(\boldsymbol{L}))^{*}\right)$
Let $\boldsymbol{L}=\{\{\boldsymbol{a}\},\{\boldsymbol{b}\},\{\boldsymbol{c}\},\{\boldsymbol{a}, \boldsymbol{b}\}\}$
- $\langle\{\boldsymbol{a}, \boldsymbol{c}\},\{\boldsymbol{a}, \boldsymbol{b}, \boldsymbol{c}\}$ -> $\langle\{\{\boldsymbol{a}\}\{\boldsymbol{c}\}\},\{\{\boldsymbol{a}\},\{\boldsymbol{b}\},\{\boldsymbol{c}\},\{\boldsymbol{a}, \boldsymbol{b}\}\}\rangle$
- $\langle\{c\},\{a, c\}\rangle$ -> $\langle\{\{\boldsymbol{c}\}\},\{\{\boldsymbol{a}\},\{\boldsymbol{c}\}\}\rangle$
The new representation makes it very easy to test whether a sequence pattern is supported by a sequence in the data set.

Step 3: Generate set of candidate sequences

Assume we have 𝑳_𝒌−𝟏, the set of all frequent sequence patterns of length 𝒌 − 𝟏. Recall that 𝑳_𝟏 = { <𝒊> |𝒊 ∈ 𝑳} .
𝑪_𝒌 is the set of all candidate sequences obtained by taking two sequences from 𝑳_𝒌−𝟏 where the first 𝒌 − 2
are the same.

Step 4: Prune the set of candidate sequences

For all $\mathbf{c} \in \boldsymbol{C}_{\boldsymbol{k}}$ :
- Consider all subsequences of 𝐜𝐜 of length 𝒌 − 1.
- If one of these subsequences is not in 𝑳_𝒌−𝟏, then remove 𝐜 from 𝑪_𝒌.

Step 5: Test all candidate sequences

For each transformed sequence 𝒔 ∈ 𝑫_𝑻: Increment the count of 𝐜 ∈ 𝑪_𝒌 if 𝐜 is contained in 𝒔.
Remove all candidates 𝐜 ∈ 𝑪_𝒌 that do no meet the threshold and the result is 𝑳_𝒌.
$\boldsymbol{L}_{\boldsymbol{k}}=\left\{\boldsymbol{c} \in \boldsymbol{C}_{\boldsymbol{k}} | support(c) \geq min\_sup\right\}$ .
$∪kLk \cup_{k} L_{k}$ is the set of all frequent sequence patterns.

Step 6 (optional): Remove non-maximal patterns

An sequence 𝒔 is a maximal sequence in data set 𝑫 if
𝒔 is frequent, and there is no other supersequence 𝒔′ that is also frequent (𝒔 ⊏ 𝒔’).
It is possible to keep only the maximal sequences. However, support information for the subsequences will be lost (these may be more frequent).

Additional constraints

How interesting?

Item constraints (only consider sequences that include or exclude a set of items).
Length constraints (only consider patterns of a given size).
Time constraints (only consider patters that occur in a short timeframe). This includes gap and duration constraints.
Regular expression constraints (only consider patterns that satisfy a regular expression or temporal constraint).

Episode mining

Rather than looking for sequences we look for embedded partial orders.
Example:

Then:

sue_zhou

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
Sequence Mining

notes of course “Introduction to Data Science” from RWTH-Aachen in semester winter 19/20, Professor van der Aalst, WillibrordusEvent dataEvery row is an event. It should include:case id: the “thin...
复制链接

扫一扫