【Pm4py第五讲】关于Discovery

北冥有鱼zsp

已于 2023-09-12 09:38:52 修改

阅读量1.2k

点赞数 1

分类专栏：流程挖掘知识 Python 文章标签： pandas python 数据挖掘流程挖掘 pm4py

于 2023-06-10 13:26:44 首次发布

本文链接：https://blog.csdn.net/qq_40420514/article/details/131139756

版权

流程挖掘知识同时被 2 个专栏收录

35 篇文章

订阅专栏

Python

8 篇文章

订阅专栏

本节用于介绍pm4py中的挖掘函数，包括用alpha算法、alpha+、Heuristic miner、ILP Miner、Inductive Miner发现一个petri网等。

1.函数概述

本次主要介绍Pm4py中一些常见的挖掘函数，总览如下表：

函数名	说明
derive_minimum_self_distance(log[, ...])	发现活动自身最小距离
discover_batches(log[, merge_distance, ...])	发现批处理活动
discover_bpmn_inductive(log[, ...])	Inductice miner算法发现bpmn模型
discover_declare(log[, allowed_templates, ...])	发现declare模型
discover_dfg(log[, activity_key, ...])	发现直接跟随图模型
discover_directly_follows_graph(log[, ...])	发现直接跟随图
discover_eventually_follows_graph(log[, ...])	发现最终跟随图
discover_footprints(*args)	发现足迹矩阵
discover_heuristics_net(log[, ...])	发现启发式网络
discover_log_skeleton(log[, ...])	发现日志skeleton
discover_performance_dfg(log[, ...])	发现带性能的dfg
discover_petri_net_alpha(log[, ...])	用alpha算法挖掘一个petri net
discover_petri_net_alpha_plus(log[, ...])	用alpha+算法挖掘一个petri net
discover_petri_net_heuristics(log[, ...])	用启发式挖掘发现一个petri net
discover_petri_net_ilp(log[, alpha, ...])	用ILP算法发现一个petri net
discover_petri_net_inductive(log[, ...])	用Inductive miner发现一个petri net
discover_prefix_tree(log[, activity_key, ...])	发现一个前缀树
discover_process_tree_inductive(log[, ...])	用inductive miner发现一个流程树
discover_temporal_profile(log[, ...])	发现临时配置文件
discover_transition_system(log[, direction, ...])	发现变迁系统

2.函数方法介绍

2.1 发现直接跟随图

pm4py.discovery.discover_dfg(log: Union[EventLog, DataFrame], activity_key: str = 'concept:name', timestamp_key: str = 'time:timestamp', case_id_key: str = 'case:concept:name') → Tuple[dict, dict, dict]

说明：此方法返回一个字典，其中（日志中）直接跟随的活动对作为关键字，关系频率作为值。

输入参数：

log– 事件日志；

activity_key (str) – attribute to be used for the activity

timestamp_key (str) – attribute to be used for the timestamp

case_id_key (str) – attribute to be used as case identifier

返回对象：

Tuple[dict, dict, dict]

示例代码：

import pm4py

dfg, start_activities, end_activities = pm4py.discover_dfg(dataframe, case_id_key='case:concept:name', activity_key='concept:name', timestamp_key='time:timestamp')

2.2 发现带性能的直接跟随图

pm4py.discovery.discover_performance_dfg(log: Union[EventLog, DataFrame], business_hours: bool = False, business_hour_slots=[(25200, 61200), (111600, 147600), (198000, 234000), (284400, 320400), (370800, 406800)], workcalendar=None, activity_key: str = 'concept:name', timestamp_key: str = 'time:timestamp', case_id_key: str = 'case:concept:name') → Tuple[dict, dict, dict]

说明：从事件日志中直接查找性能直接跟随关系图。此方法返回一个字典，其中（日志中）直接跟随的活动对作为关键字，关系的性能作为值

输入参数：

log – event log / Pandas dataframe

business_hours (bool) – enables/disables the computation based on the business hours (default: False)

business_hour_slots – work schedule of the company, provided as a list of tuples where each tuple represents one time slot of business hours. One slot i.e. one tuple consists of one start and one end time given in seconds since week start, e.g. [(7 * 60 * 60, 17 * 60 * 60), ((24 + 7) * 60 * 60, (24 + 12) * 60 * 60), ((24 + 13) * 60 * 60, (24 + 17) * 60 * 60),] meaning that business hours are Mondays 07:00 - 17:00 and Tuesdays 07:00 - 12:00 and 13:00 - 17:00

activity_key (str) – attribute to be used for the activity

timestamp_key (str) – attribute to be used for the timestamp

case_id_key (str) – attribute to be used as case identifier

返回对象：

Tuple[dict, dict, dict]

示例代码：

import pm4py

performance_dfg, start_activities, end_activities = pm4py.discover_performance_dfg(dataframe, case_id_key='case:concept:name', activity_key='concept:name', timestamp_key='time:timestamp')

2.3 Alpha算法

pm4py.discovery.discover_petri_net_alpha(log: Union[EventLog, DataFrame], activity_key: str = 'concept:name', timestamp_key: str = 'time:timestamp', case_id_key: str = 'case:concept:name') → Tuple[PetriNet, Marking, Marking]

说明：使用Alpha Miner发现一个Petri网。

输入参数：

log – event log / Pandas dataframe

activity_key (str) – attribute to be used for the activity

timestamp_key (str) – attribute to be used for the timestamp

case_id_key (str) – attribute to be used as case identifier

输出：

Tuple[PetriNet, Marking, Marking]

示例代码：

import pm4py

net, im, fm = pm4py.discover_petri_net_alpha(dataframe, activity_key='concept:name', case_id_key='case:concept:name', timestamp_key='time:timestamp')

2.4 Inductive Miner

pm4py.discovery.discover_petri_net_inductive(log: Union[EventLog, DataFrame, DirectlyFollowsGraph], multi_processing: bool = False, noise_threshold: float = 0.0, activity_key: str = 'concept:name', timestamp_key: str = 'time:timestamp', case_id_key: str = 'case:concept:name') → Tuple[PetriNet, Marking, Marking]
说明：使用IM算法发现一个Petri网。Inductive miner算法的基本思想是检测日志中的“切割”（如顺序切割、排他切割、并发切割和循环切割），然后在应用切割发现的子日志上重复，直到找到基本案例。Directly Follows变量避免了子日志上的递归，但使用了Directly Folows图。归纳挖掘器模型通常广泛使用静默变迁，尤其是在模型的某个部分上跳过/循环。此外，每个可见的变迁都有一个唯一的标签（模型中没有共享相同标签的变迁）。

输入参数：

log – event log / Pandas dataframe / typed DFG

noise_threshold (float) – noise threshold (default: 0.0)

multi_processing (bool) – boolean that enables/disables multiprocessing in inductive miner

activity_key (str) – attribute to be used for the activity

timestamp_key (str) – attribute to be used for the timestamp

case_id_key (str) – attribute to be used as case identifier

返回类型：

Tuple[PetriNet, Marking, Marking]

示例代码：

import pm4py

net, im, fm = pm4py.discover_petri_net_inductive(dataframe, activity_key='concept:name', case_id_key='case:concept:name', timestamp_key='time:timestamp')

2.5 Heustics Miner

pm4py.discovery.discover_petri_net_heuristics(log: Union[EventLog, DataFrame], dependency_threshold: float = 0.5, and_threshold: float = 0.65, loop_two_threshold: float = 0.5, activity_key: str = 'concept:name', timestamp_key: str = 'time:timestamp', case_id_key: str = 'case:concept:name') → Tuple[PetriNet, Marking, Marking]
说明：使用启发式Miner发现Petri网。启发式Miner是一种作用于Direct Follows Graph的算法，提供了处理噪声和查找常见结构（两个活动之间的依赖关系，and）的方法。启发式矿工的输出是一个启发式网，因此是一个包含活动及其之间关系的对象。然后可以将启发式网络转换为Petri网。）。

输入参数：

log – event log / Pandas dataframe

dependency_threshold (float) – dependency threshold (default: 0.5)

and_threshold (float) – AND threshold (default: 0.65)

loop_two_threshold (float) – loop two threshold (default: 0.5)

activity_key (str) – attribute to be used for the activity

timestamp_key (str) – attribute to be used for the timestamp

case_id_key (str) – attribute to be used as case identifier

返回类型：

Tuple[PetriNet, Marking, Marking]

示例代码：

import pm4py

net, im, fm = pm4py.discover_petri_net_heuristics(dataframe, activity_key='concept:name', case_id_key='case:concept:name', timestamp_key='time:timestamp')

2.6 ILP Miner

pm4py.discovery.discover_petri_net_ilp(log: Union[EventLog, DataFrame], alpha: float = 1.0, activity_key: str = 'concept:name', timestamp_key: str = 'time:timestamp', case_id_key: str = 'case:concept:name') → Tuple[PetriNet, Marking, Marking]
说明：使用ILP Miner 发现一个petri网

输入参数：

log – event log / Pandas dataframe

alpha (float) – noise threshold for the sequence encoding graph (1.0=no filtering, 0.0=greatest filtering)

activity_key (str) – attribute to be used for the activity

timestamp_key (str) – attribute to be used for the timestamp

case_id_key (str) – attribute to be used as case identifier

返回类型：

Tuple[PetriNet, Marking, Marking]

示例代码：

import pm4py

net, im, fm = pm4py.discover_petri_net_ilp(dataframe, activity_key='concept:name', case_id_key='case:concept:name', timestamp_key='time:timestamp')

2.7 IM算法发现流程树

pm4py.discovery.discover_process_tree_inductive(log: Union[EventLog, DataFrame, DirectlyFollowsGraph], noise_threshold: float = 0.0, multi_processing: bool = False, activity_key: str = 'concept:name', timestamp_key: str = 'time:timestamp', case_id_key: str = 'case:concept:name') → ProcessTree

说明：使用IM算法发现流程树。

输入参数:

log – event log / Pandas dataframe / typed DFG

noise_threshold (float) – noise threshold (default: 0.0)

activity_key (str) – attribute to be used for the activity

multi_processing (bool) – boolean that enables/disables multiprocessing in inductive miner

timestamp_key (str) – attribute to be used for the timestamp

case_id_key (str) – attribute to be used as case identifier

返回类型：

流程树

示例代码：

import pm4py

process_tree = pm4py.discover_process_tree_inductive(dataframe, activity_key='concept:name', case_id_key='case:concept:name', timestamp_key='time:timestamp')

2.8 Heuristic Miner发现启发式网

pm4py.discovery.discover_heuristics_net(log: Union[EventLog, DataFrame], dependency_threshold: float = 0.5, and_threshold: float = 0.65, loop_two_threshold: float = 0.5, min_act_count: int = 1, min_dfg_occurrences: int = 1, activity_key: str = 'concept:name', timestamp_key: str = 'time:timestamp', case_id_key: str = 'case:concept:name', decoration: str = 'frequency') → HeuristicsNet

说明：发现启发式网。

输入参数:

log – event log / Pandas dataframe

dependency_threshold (float) – dependency threshold (default: 0.5)

and_threshold (float) – AND threshold (default: 0.65)

loop_two_threshold (float) – loop two threshold (default: 0.5)

min_act_count (int) – minimum number of occurrences per activity in order to be included in the discovery

min_dfg_occurrences (int) – minimum number of occurrences per arc in the DFG in order to be included in the discovery

activity_key (str) – attribute to be used for the activity

timestamp_key (str) – attribute to be used for the timestamp

case_id_key (str) – attribute to be used as case identifier

decoration (str) – the decoration that should be used (frequency, performance)

返回类型：HeuristicsNet

示例代码：

import pm4py

heu_net = pm4py.discover_heuristics_net(dataframe, activity_key='concept:name', case_id_key='case:concept:name', timestamp_key='time:timestamp')

2.9 计算自身最小距离

pm4py.discovery.derive_minimum_self_distance(log: Union[DataFrame, EventLog, EventStream], activity_key: str = 'concept:name', timestamp_key: str = 'time:timestamp', case_id_key: str = 'case:concept:name') → Dict[str, int]

说明：该算法计算在事件日志中观察到的每个活动的最小自身距离。a在<a>中的自距离是无穷大，a在<a，a>中是0，在<a、b、a>中为1，等等。使用活动键“concept:name”。

输入参数:

log – event log / Pandas dataframe

activity_key (str) – attribute to be used for the activity

timestamp_key (str) – attribute to be used for the timestamp

case_id_key (str) – attribute to be used as case identifier

返回类型：

Dict[str, int]

示例代码：

import pm4py

msd = pm4py.derive_minimum_self_distance(dataframe, activity_key='concept:name', case_id_key='case:concept:name', timestamp_key='time:timestamp')

2.10 发现足迹矩阵

pm4py.discovery.discover_footprints(*args: Union[EventLog, Tuple[PetriNet, Marking, Marking], ProcessTree]) → Union[List[Dict[str, Any]], Dict[str, Any]]

说明：从提供的事件日志/流程模型中查找足迹

输入参数:
args – event log / process model
返回类型：

Union[List[Dict[str, Any]], Dict[str, Any]]

示例：

import pm4py

footprints = pm4py.discover_footprints(dataframe, activity_key='concept:name', case_id_key='case:concept:name', timestamp_key='time:timestamp')

2.11 发现最终跟随图

pm4py.discovery.discover_eventually_follows_graph(log: Union[EventLog, DataFrame], activity_key: str = 'concept:name', timestamp_key: str = 'time:timestamp', case_id_key: str = 'case:concept:name') → Dict[Tuple[str, str], int]

说明：从日志对象中获取最终跟随的图。最终跟随图是一个字典，它与最终相互跟随的每一对活动相关联，即这种关系的出现次数。

输入参数:

        log – event log / Pandas dataframe

        activity_key (str) – attribute to be used for the activity

        timestamp_key (str) – attribute to be used for the timestamp

        case_id_key (str) – attribute to be used as case identifier

返回类型：

Dict[Tuple[str, str], int]

示例代码：

import pm4py

efg = pm4py.discover_eventually_follows_graph(dataframe, activity_key='concept:name', case_id_key='case:concept:name', timestamp_key='time:timestamp')

2.12 IM算法发现BPMN模型

pm4py.discovery.discover_bpmn_inductive(log: Union[EventLog, DataFrame, DirectlyFollowsGraph], noise_threshold: float = 0.0, multi_processing: bool = False, activity_key: str = 'concept:name', timestamp_key: str = 'time:timestamp', case_id_key: str = 'case:concept:name') → BPMN

说明：使用Inductive Miner算法发现BPMN

输入参数:

        log – event log / Pandas dataframe / typed DFG

        noise_threshold (float) – noise threshold (default: 0.0)

        multi_processing (bool) – boolean that enables/disables multiprocessing in inductive miner

        activity_key (str) – attribute to be used for the activity

        timestamp_key (str) – attribute to be used for the timestamp

        case_id_key (str) – attribute to be used as case identifier

返回类型：

BPMN

示例代码：

import pm4py

bpmn_graph = pm4py.discover_bpmn_inductive(dataframe, activity_key='concept:name', case_id_key='case:concept:name', timestamp_key='time:timestamp')

2.13 发现变迁系统

pm4py.discovery.discover_transition_system(log: Union[EventLog, DataFrame], direction: str = 'forward', window: int = 2, view: str = 'sequence', activity_key: str = 'concept:name', timestamp_key: str = 'time:timestamp', case_id_key: str = 'case:concept:name') → TransitionSystem

说明：发现流程挖掘书籍《过程挖掘：数据科学在行动》中描述的变迁系统

输入参数:

        log – event log / Pandas dataframe

        direction (str) – direction in which the transition system is built (forward, backward)

        window (int) – window (2, 3, …)

        view (str) – view to use in the construction of the states (sequence, set, multiset)

        activity_key (str) – attribute to be used for the activity

        timestamp_key (str) – attribute to be used for the timestamp

        case_id_key (str) – attribute to be used as case identifier

返回类型：

TransitionSystem

示例代码：

import pm4py

transition_system = pm4py.discover_transition_system(dataframe, activity_key='concept:name', case_id_key='case:concept:name', timestamp_key='time:timestamp')

2.14 发现前缀树

pm4py.discovery.discover_prefix_tree(log: Union[EventLog, DataFrame], activity_key: str = 'concept:name', timestamp_key: str = 'time:timestamp', case_id_key: str = 'case:concept:name') → Trie

说明：从提供的日志对象中发现一个前缀树

输入参数:

        log – event log / Pandas dataframe

        activity_key (str) – attribute to be used for the activity

        timestamp_key (str) – attribute to be used for the timestamp

        case_id_key (str) – attribute to be used as case identifier

返回类型：

Trie

示例代码：

import pm4py

prefix_tree = pm4py.discover_prefix_tree(dataframe, activity_key='concept:name', case_id_key='case:concept:name', timestamp_key='time:timestamp')

2.15 发现临时配置文件

pm4py.discovery.discover_temporal_profile(log: Union[EventLog, DataFrame], activity_key: str = 'concept:name', timestamp_key: str = 'time:timestamp', case_id_key: str = 'case:concept:name') → Dict[Tuple[str, str], Tuple[float, float]]

说明：从日志对象中发现临时配置文件。实施Stertz、Florian、Jürgen Mangler和Stefanie Rinderle Ma中描述的方法。“Temporal Conformance Checking at Runtime based on Time-infused Process Models.” arXiv preprint arXiv:2008.07262 (2020).。
输出是一个字典，至少在日志的情况下，对于最终发生的每两个活动，都包含时间戳差的平均值和标准差。
例如，如果日志有两种情况：
A (timestamp: 1980-01) B (timestamp: 1980-03) C (timestamp: 1980-06) A (timestamp: 1990-01) B (timestamp: 1990-02) D (timestamp: 1990-03)
返回的字典将包含：｛（'A'，'B'）：（1.5个月，0.5个月），（'A'、'C'）：

输入参数:

        log – event log / Pandas dataframe

        activity_key (str) – attribute to be used for the activity

        timestamp_key (str) – attribute to be used for the timestamp

        case_id_key (str) – attribute to be used as case identifier

返回类型：

Dict[Tuple[str, str], Tuple[float, float]]

示例代码：

import pm4py

temporal_profile = pm4py.discover_temporal_profile(dataframe, activity_key='concept:name', case_id_key='case:concept:name', timestamp_key='time:timestamp')

2.16 发现DECLARE模型

pm4py.discovery.discover_declare(log: Union[EventLog, DataFrame], allowed_templates: Optional[Set[str]] = None, considered_activities: Optional[Set[str]] = None, min_support_ratio: Optional[float] = None, min_confidence_ratio: Optional[float] = None, activity_key: str = 'concept:name', timestamp_key: str = 'time:timestamp', case_id_key: str = 'case:concept:name') → Dict[str, Dict[Any, Dict[str, int]]]

说明：从事件日志中发现一个declare模型，参考文献：F. M. Maggi, A. J. Mooij and W. M. P. van der Aalst, “User-guided discovery of declarative process models,” 2011 IEEE Symposium on Computational Intelligence and Data Mining (CIDM), Paris, France, 2011, pp. 192-199, doi: 10.1109/CIDM.2011.5949297.

输入参数:

        log – event log / Pandas dataframe

        allowed_templates – (optional) collection of templates to consider for the discovery

        considered_activities – (optional) collection of activities to consider for the discovery

        min_support_ratio – (optional, decided automatically otherwise) minimum percentage of cases (over the entire set of cases of the log) for which the discovered rules apply

        min_confidence_ratio – (optional, decided automatically otherwise) minimum percentage of cases (over the rule’s support) for which the discovered rules are valid

        activity_key (str) – attribute to be used for the activity

        timestamp_key (str) – attribute to be used for the timestamp

        case_id_key (str) – attribute to be used as case identifier

返回类型：

Dict[str, Any]

示例代码：

import pm4py

declare_model = pm4py.discover_declare(log)

2.17 发现日志skeleton

pm4py.discovery.discover_log_skeleton(log: Union[EventLog, DataFrame], noise_threshold: float = 0.0, activity_key: str = 'concept:name', timestamp_key: str = 'time:timestamp', case_id_key: str = 'case:concept:name') → Dict[str, Any]

说明：从事件日志中查找日志skeleton。日志skeleton是一个声明性模型，它由六个不同的约束组成：-“directly_follows”：为某些活动指定了对直接跟随的活动的一些严格限制。例如：

“A 应紧跟 B”和“B 应紧跟 C”。

“always_before”：指定某些活动仅在之前某个时间执行其他活动时才能执行

在案件的历史中。例如，“C 应始终以 A 开头”

“always_after”：指定某些活动应始终触发某些其他活动的执行

在案件的未来历史中。例如，“A 应始终跟在 C 后面”

“等效性”：指定给定的几个活动应该在内部发生相同的次数

一个案例。例如，“B 和 C 应始终出现相同的次数”。

“never_together”：指定给定的几个活动在案例历史记录中不应同时发生。

例如，“不应该同时包含 C 和 D 的情况”。

“activ_occurrences”：指定每个活动允许的发生次数：

例如，A 允许执行 1 或 2 次，B 允许执行 1 或 2 或 3 或 4 次。

参考文献： Verbeek, H. M. W., and R. Medeiros de Carvalho. “Log skeletons: A classification approach to process discovery.” arXiv preprint arXiv:1806.08247 (2018).

输入参数:

log – event log / Pandas dataframe

noise_threshold (float) – noise threshold, acting as described in the paper.

activity_key (str) – attribute to be used for the activity

timestamp_key (str) – attribute to be used for the timestamp

case_id_key (str) – attribute to be used as case identifier

返回类型：

Dict[str, Any]

示例代码：

import pm4py

log_skeleton = pm4py.discover_log_skeleton(dataframe, noise_threshold=0.1, activity_key='concept:name', case_id_key='case:concept:name', timestamp_key='time:timestamp')

2.18 发现批处理活动

pm4py.discovery.discover_batches(log: Union[EventLog, DataFrame], merge_distance: int = 900, min_batch_size: int = 2, activity_key: str = 'concept:name', timestamp_key: str = 'time:timestamp', case_id_key: str = 'case:concept:name', resource_key: str = 'org:resource') → List[Tuple[Tuple[str, str], int, Dict[str, Any]]

说明：从提供的日志对象中发现批处理活动。我们说，当给定的资源在短时间内多次执行同一个活动时，该资源会批量执行该活动。识别这样的活动可以识别过程中可以自动化的点，因为人的活动可能是重复的。检测到以下类别的批：

-Simultaneous （批中的所有事件都具有相同的开始和结束时间戳）;

-Batching at start（批中所有事件的开始时间戳相同）;

-Batching at end（对于所有连续事件，第一个事件的结束等于第二个事件的开始;

-Concurrent batching（对于未按顺序匹配的所有连续事件）

这个方法被下述论文所描述：“Martin, N., Swennen, M., Depaire, B., Jans, M., Caris, A., & Vanhoof, K. (2015, December). Batch Processing: Definition and Event Log Identification. In SIMPDA (pp. 137-140).”

        输出是一个包含元组的（排序的）列表。每个元组包含：
        索引0:已检测到至少一个批的活动资源
        索引1：给定活动资源的批次数
        索引2：包含所有批次的列表。每个批次的描述如下：
                #批的开始时间戳#批的完整时间戳#在批中执行的事件列表

输入参数:

        log – event log / Pandas dataframe

        merge_distance (int) – the maximum time distance between non-overlapping intervals in order for them to be considered belonging to the same batch (default: 15*60 15 minutes)

        min_batch_size (int) – the minimum number of events for a batch to be considered (default: 2)

        activity_key (str) – attribute to be used for the activity

        timestamp_key (str) – attribute to be used for the timestamp

        case_id_key (str) – attribute to be used as case identifier

        resource_key (str) – attribute to be used as resource

返回类型：

List[Tuple[Tuple[str, str], int, Dict[str, Any]]]

示例代码：

import pm4py

batches = pm4py.discover_log_skeleton(dataframe, activity_key='concept:name', case_id_key='case:concept:name', timestamp_key='time:timestamp', resource_key='org:resource')

如需了解更多，欢迎加入流程挖掘交流群QQ:671290481.