PM4PY - Filtering Event Data

摘要:过滤EVENT DATA

随笔

trace:路径,表示图上的路径,从头走到尾算一次trace
Variant:变体,表示同一类traces,同一种路径为一个Variant
case:方案/情况,表示事件日志里的一次走法的记录(对应路径trace)
activity:动作/活动,表示过程中的一个动作(动作名称)。
event:事件,一个动作的记录,包括activity动作名称、发生时间、发生地点等信息的记录。

(不理解)按时间范围过滤(Filtering on timeframe)

不确定)如果只对某段时间范围内的traces感兴趣,即时间包含(contain)在开始与结束时间内。例如:2011-03-09到2012-01-18这段时间内。第一段代码用于log对象,第二段代码用于dataframe对象(后面的代码示例都是如此)。

from pm4py.algo.filtering.log.timestamp import timestamp_filter
filtered_log = timestamp_filter.filter_traces_contained
               (log, "2011-03-09 00:00:00", "2012-01-18 23:59:59")
from pm4py.algo.filtering.pandas.timestamp import timestamp_filter
df_timest_intersecting = timestamp_filter.filter_traces_contained
               (dataframe, "2011-03-09 00:00:00", "2012-01-18 23:59:59",
                                          parameters={timestamp_filter.Parameters.CASE_ID_KEY: "case:concept:name",
                                                      timestamp_filter.Parameters.TIMESTAMP_KEY: "time:timestamp"})

不确定)intersecting(相交),不知道如何理解,猜测可能是等于这两个时间?但如果是这样应该不止两个时间参数,应该给个列表参数;或者可能是。对应如下代码:

from pm4py.algo.filtering.log.timestamp import timestamp_filter
filtered_log = timestamp_filter.filter_traces_intersecting
               (log, "2011-03-09 00:00:00", "2012-01-18 23:59:59")
from pm4py.algo.filtering.pandas.timestamp import timestamp_filter
df_timest_intersecting = timestamp_filter.filter_traces_intersecting
               (dataframe, "2011-03-09 00:00:00", "2012-01-18 23:59:59",
                                          parameters={timestamp_filter.Parameters.CASE_ID_KEY: "case:concept:name",
                                                      timestamp_filter.Parameters.TIMESTAMP_KEY: "time:timestamp"})

按开始动作筛选(Filter on start activities)

首先需要知道开始动作是哪个,再进行筛选。
log_start是key为动作名称,value为出现次数的字典。

from pm4py.algo.filtering.log.start_activities import start_activities_filter
log_start = start_activities_filter.get_start_activities(log)
filtered_log = start_activities_filter.apply(log, ["S1"]) #suppose "S1" is the start activity you want to filter on
from pm4py.algo.filtering.pandas.start_activities import start_activities_filter
log_start = start_activities_filter.get_start_activities(dataframe)
df_start_activities = start_activities_filter.apply(dataframe, ["S1"],
                                          parameters={start_activities_filter.Parameters.CASE_ID_KEY: "case:concept:name",
                                                      start_activities_filter.Parameters.ACTIVITY_KEY: "concept:name"}) #suppose "S1" is the start activity you want to filter on

还有一个方法是根据开始动作出现频率筛选。DECREASING_FACTOR 默认为0.6。

from pm4py.algo.filtering.log.start_activities import start_activities_filter
log_af_sa = start_activities_filter.apply_auto_filter
               (log, parameters={start_activities_filter.Parameters.DECREASING_FACTOR: 0.6})
from pm4py.algo.filtering.pandas.start_activities import start_activities_filter
df_auto_sa = start_activities_filter.apply_auto_filter
               (dataframe, parameters={start_activities_filter.Parameters.DECREASING_FACTOR: 0.6})

按结束动作筛选(Filter on end activities)

首先也要知道结束动作名称。

from pm4py.algo.filtering.log.end_activities import end_activities_filter
end_activities = end_activities_filter.get_end_activities(log)
filtered_log = end_activities_filter.apply(log, ["pay compensation"])
from pm4py.algo.filtering.pandas.end_activities import end_activities_filter
end_activities = end_activities_filter.get_end_activities(df)
filtered_df = end_activities_filter.apply(df, ["pay compensation"],
                                          parameters={end_activities_filter.Parameters.CASE_ID_KEY: "case:concept:name",
                                                      end_activities_filter.Parameters.ACTIVITY_KEY: "concept:name"})

根据变体筛选(Filter on variants)

为了得到所给日志(log)里包含的变体列表。返回结果是个字典,key为变体,value为共享该变体的case列表。

from pm4py.algo.filtering.log.variants import variants_filter
variants = variants_filter.get_variants(log)
from pm4py.statistics.traces.generic.pandas import case_statistics
variants = case_statistics.get_variants_df(df,
                                          parameters={case_statistics.Parameters.CASE_ID_KEY: "case:concept:name",
                                                      case_statistics.Parameters.ACTIVITY_KEY: "concept:name"})

如果想获取变体出现次数,以下代码返回一个变体列表及其计数(所以,一个字典key为变体,value为出现次数)

from pm4py.statistics.traces.generic.log import case_statistics
variants_count = case_statistics.get_variant_statistics(log)
variants_count = sorted(variants_count, key=lambda x: x['count'], reverse=True)
from pm4py.statistics.traces.generic.pandas import case_statistics
variants_count = case_statistics.get_variant_statistics(df,
                                          parameters={case_statistics.Parameters.CASE_ID_KEY: "case:concept:name",
                                                      case_statistics.Parameters.ACTIVITY_KEY: "concept:name",
                                                      case_statistics.Parameters.TIMESTAMP_KEY: "time:timestamp"})
variants_count = sorted(variants_count, key=lambda x: x['case:concept:name'], reverse=True)

为了基于变体筛选,假设variants是个列表,每个元素是个variant。

from pm4py.algo.filtering.log.variants import variants_filter
filtered_log1 = variants_filter.apply(log, variants)
from pm4py.algo.filtering.pandas.variants import variants_filter
             filtered_df1 = variants_filter.apply(df, variants,
                                          parameters={variants_filter.Parameters.CASE_ID_KEY: "case:concept:name",
                                                      variants_filter.Parameters.ACTIVITY_KEY: "concept:name"})

与上一个示例相反,如果想将给的变体过滤掉。假设variants依然是个列表,每个元素还是是一个variant。

filtered_log2 = variants_filter.apply(log, variants, parameters={variants_filter.Parameters.POSITIVE: False})
filtered_df2 = variants_filter.apply(df, variants,
                                          parameters={variants_filter.Parameters.POSITIVE: False, variants_filter.Parameters.CASE_ID_KEY: "case:concept:name",
                                                      variants_filter.Parameters.ACTIVITY_KEY: "concept:name"})

一个过滤器要自动保留最普遍的variants可以用apply_auto_filter方法。这个方法接收一个参数parameter叫DECREASING_FACTOR,与start activities filter的一样,默认0.6。

auto_filtered_log = variants_filter.apply_auto_filter(log)
auto_filtered_df = variants_filter.apply_auto_filter(df,
                                          parameters={variants_filter.Parameters.POSITIVE: False, variants_filter.Parameters.CASE_ID_KEY: "case:concept:name",
                                                      variants_filter.Parameters.ACTIVITY_KEY: "concept:name"})

对于event log对象,可以用变体百分比过滤器。要保留的变体必须指定一个百分比参数,参数值从0-1,0表示只保存最频繁的变体,1表示保留所有变体。

from pm4py.algo.filtering.log.variants import variants_filter

filtered_log = variants_filter.filter_log_variants_percentage(log, percentage=0.5)

为了更明确,见如下解释。
在这里插入图片描述
其他基于变体的过滤器:
top-k过滤器,只保留k个最常见的变体

import pm4py
log = pm4py.read_xes("tests/input_data/receipt.xes")
k = 2
filtered_log = pm4py.filter_variants_top_k(log, k)

变体覆盖率过滤器,根据指定条件的百分比保留。
假如min_coverage_percentage=0.4,我们有个log有1000个cases,500个variant1,400个variant2,100个variant3,过滤器只会保留variant1和variant2。

import pm4py
log = pm4py.read_xes("tests/input_data/receipt.xes")
perc = 0.1
filtered_log = pm4py.filter_variants_by_coverage_percentage(log, perc)

按属性值筛选(Filter on attributes values)

不理解

(Filter on numeric attribute values)

(Between Filter)

用于识别当前cases中,从源动作到目标动作的所有子案例,转换成事件日志event log

import pm4py

log = pm4py.read_xes("tests/input_data/running-example.xes")

filtered_log = pm4py.filter_between(log, "check ticket", "decide")

(Case Size Filter)

保留指定范围数量的事件

import pm4py

log = pm4py.read_xes("tests/input_data/running-example.xes")

filtered_log = pm4py.filter_case_size(log, 5, 10)

(Rework Filter)

用于识别有重复动作的case。
下面示例,我们查找reinitiate request动作至少出现两次的所有case

import pm4py

log = pm4py.read_xes("tests/input_data/running-example.xes")

filtered_log = pm4py.filter_activities_rework(log, "reinitiate request", 2)

(Paths Performance Filter)

用于识别两个指定动作经过指定时间范围内的case
下面示例,我们要找decide 和 pay compensation在两天到10天间至少出现一次的case

import pm4py

log = pm4py.read_xes("tests/input_data/running-example.xes")

filtered_log = pm4py.filter_paths_performance(log, ("decide", "pay compensation"), 2*86400, 10*86400)

(Generic Filtering on Event Log)

如果以上过滤器都无法满足需求,我们提供一个基于通用boolean表达式的方法
下面示例,我们保留在事件日志中事件数量大于6的case

import pm4py

log = pm4py.read_xes("tests/input_data/running-example.xes")

filtered_log = pm4py.filter_log(lambda x: len(x) > 6, log)

*若对本文有疑问(例如:笔记中知识点或表达有误),欢迎指出,共同学习进步。

  • 2
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值