带有scikit-learn的基于规则的混合机器学习

最新推荐文章于 2024-08-15 09:26:45 发布

weixin_26632369

最新推荐文章于 2024-08-15 09:26:45 发布

阅读量774

点赞数 3

文章标签： python 机器学习人工智能

原文链接：https://towardsdatascience.com/hybrid-rule-based-machine-learning-with-scikit-learn-9cb9841bebf2

版权

TL;DR scikit-learn does not allow you to add hard-coded rules to your machine learning model, but for many use cases, you should! This article explores how you can leverage domain knowledge and object-oriented programming (OOP) to build hybrid rule-based machine learning models on top of scikit-learn.

TL; DR scikit-learn 不允许您在机器学习模型中添加硬编码规则，但是对于许多用例，应该这样做！ 本文探讨了如何利用领域知识和面向对象编程(OOP)在scikit-learn之上构建基于混合规则的机器学习模型。

介绍 (Introduction)

Supervised machine learning models are great for making predictions under uncertainty; they pick up patterns in past data and accurately extrapolate them into the future. Machine learning has pushed the frontier in fields where determining the most likely outcome, whether a class or specific value, has historically been challenging, prone to error, or too time-consuming or expensive at scale.

小号 upervised机器学习模型是非常适合做下预测的不确定性; 他们从过去的数据中挑选出模式，并将其准确地推断到未来。机器学习在确定最可能的结果(类或特定值)历来具有挑战性，易于出错，过于耗时或规模庞大的领域中一直处于前沿。

Still, there exist many domains in which some of all possible outcomes are not ambiguous but certain by definition. You might encounter them in the form of rules embedded in business-specific processes or regulations. In such an environment, it seems inefficient to have an ML model asymptotically guess pre-formulated rules using implicit learning. Instead, we want the model to focus its attention on all cases where no pre-defined rules exist.

但是，仍然存在许多领域，其中所有可能的结果中的某些不是模棱两可的，而是根据定义确定的。您可能会以嵌入业务特定流程或法规中的规则的形式遇到它们。在这样的环境中，使用隐式学习使ML模型渐近地猜测预先制定的规则似乎效率低下。相反，我们希望模型将注意力集中在不存在预定义规则的所有情况下。

In this article, you will learn that incorporating pre-defined domain rules into machine learning models has many benefits. To get more hands-on, we will build a simple wrapper-class for scikit-learn estimators, which takes into account explicit rules and leaves the model to sort out the hard cases.

在本文中，您将学到将预定义的域规则合并到机器学习模型中会带来很多好处。为了获得更多的实践知识，我们将为scikit-learn估计器构建一个简单的包装器类，该类考虑了显式规则，并让模型对困难的情况进行了分类。

If you cannot wait, skip to the fully documented implementation in Python.

如果您不能等待，请跳至Python中完整记录的实现。

领域知识的重要性 (The Importance of Domain Knowledge)

Any good machine learning project starts with the aggregation of domain knowledge — the process of gathering relevant information and expertise about a business problem. Typically, we talk with industry practitioners, research online, and conduct data exploration to expose specific trends, patterns, or hints which facilitate machine learning model building.

任何好的机器学习项目都应从领域知识的聚合开始-收集有关业务问题的相关信息和专业知识的过程。通常，我们与行业从业人员交谈，在线研究并进行数据探索，以揭示有助于机器学习模型构建的特定趋势，模式或提示。

Domain knowledge is incredibly helpful for many reasons: it helps us balance stakeholder requirements, understand our target audience, but most importantly, it gives us vital clues about feature engineering. While these clues are self-explanatory when attempting to identify a cat in a photograph, in many sectors of industry such as law, insurance, or medical diagnosis, feature engineering is far from intuitive.

领域知识由于许多原因而非常有用：它可以帮助我们平衡涉众的需求，了解目标受众，但是最重要的是，它为我们提供了有关要素工程的重要线索。尽管这些线索在尝试识别照片中的猫时是不言而喻的，但在法律，保险或医学诊断等许多行业中，要素工程远非直观。

Illustrating this point, suppose your objective is to build an ML model to predict customer churn (plan cancellation rates) for a telecommunication corporate. Before engineering and iterating on probable features, it certainly helps to gather the opinions of experts in the field regarding factors influencing churn — the retention department being a logical starting point. Possibly, the retention department can even back up their opinions with data, using customer surveys conducted to expose specific pain points. In any case, data and industry practitioners can point you in the right direction, saving time and potentially unveiling previously unconsidered data sources and feature combinations.

举例说明这一点，假设您的目标是建立一个ML模型来预测电信公司的客户流失(计划取消率)。在对可能的功能进行工程设计和迭代之前，无疑会有助于收集该领域专家对影响客户流失的因素的意见-保留部门是一个合理的起点。保留部门甚至可以使用数据进行客户调查，以揭示特定的痛点，从而用数据备份他们的意见。无论如何，数据和行业从业人员都可以为您指明正确的方向，从而节省时间，并有可能揭示以前未考虑的数据源和功能组合。

如何从领域知识中得出规则 (How to Derive Rules from Domain Knowledge)

In many domains of industry, you can derive simple deterministic rules from processes that are already in place. For example, in specific legal proceedings, a claim for reputational damages may never be granted because the legal code simply states so. Similarly, an insurance company might not pay out damage claims below $1.000 because, by contract with the insured, they are not liable to do so. If we wanted to predict say litigation outcomes or insurance losses, such simple rules could be built directly into machine learning models to enhance performance.

在许多工业领域中，您可以从已经存在的流程中得出简单的确定性规则。例如，在特定的法律程序中，可能永远不会授予声誉损失索赔，因为法律法规只是简单地说明了这一点。同样，一家保险公司也可能不会支付低于$ 1.000的损害赔偿，因为与被保险人订立的合同不承担赔偿责任。如果我们要预测诉讼结果或保险损失，则可以将这些简单规则直接构建到机器学习模型中以提高性能。

As machine learning is exceptional in figuring out ambiguous and challenging cases, it mostly only makes sense to incorporate deterministic rules into models if they are applicable in every circumstance and are not too numerous or complex. Yet, other use cases exist, so I compiled a complete list on when you might want to consider deploying hybrid, rule-based models:

由于机器学习在解决模棱两可和具有挑战性的案例方面非常出色，因此，只有在每种情况下都适用确定性规则并且模型不太多也不复杂的情况下，才有意义地将确定性规则纳入模型中。但是，还存在其他用例，因此我汇总了有关何时考虑考虑部署基于规则的混合模型的完整列表：

Deterministic rules for the prediction process already existAs mentioned before, depending on what you are attempting to predict, deterministic rules might already exist for the evaluation process. If the rule is simple, applicable in every case, and there are not too many rules overall, hard-coding it into your machine learning model would guarantee you can already predict a subset of cases with perfect accuracy.
预测过程的确定性规则已经存在如前所述，根据您要预测的内容，评估过程可能已经存在确定性规则。如果该规则很简单，适用于所有情况，并且总体上没有太多规则，那么将其硬编码到您的机器学习模型中，可以确保您已经可以准确预测出一部分情况。
Lack of data for specific types of prediction casesIn cases where data is sparse for certain types of prediction cases, your model might struggle to develop correct implicit rules to classify or estimate the data points correctly. Frequently, this occurs if the model cannot accurately infer the target variable solely from other features. In the legal example from above, there might be infrequent claim categories, such as the reimbursement of expenses, for which only a few data points exist. As the claim category is of critical importance to determine the litigation outcome, it is unlikely the model can correctly predict it without further knowledge of reimbursement claims. In this case, performance might be enhanced by simply predicting the average target variable of all instances of the category, say the mean success rate of all expense claims.
缺少针对特定类型的预测案例的数据在某些类型的预测案例的数据稀疏的情况下，您的模型可能难以开发正确的隐式规则来正确分类或估计数据点。通常，如果模型无法仅从其他特征准确地推断目标变量，则会发生这种情况。在上面的法律示例中，可能不常见的索赔类别(例如费用报销)仅存在几个数据点。由于索赔类别对于确定诉讼结果至关重要，因此，如果不进一步了解报销索赔，该模型就不太可能正确预测。在这种情况下，可以通过简单地预测该类别所有实例的平均目标变量(例如所有费用索赔的平均成功率)来提高绩效。
High feature cardinalityHigh feature cardinality (the number of possible values for a feature) is a problem for almost all machine learning models. Especially for categorical data, which needs to be encoded, a high number of unique possible values hampers model performance. Thus, if an appropriate rule of thumb or statistical parameter exists, approximating the target variable might make for an attractive trade-off as it leaves the remaining data with lower cardinality to aid model training.
高功能基数高功能基数(功能的可能值数量)是几乎所有机器学习模型的问题。特别是对于需要编码的分类数据，大量唯一可能的值会影响模型性能。因此，如果存在适当的经验法则或统计参数，则近似目标变量可能会产生有吸引力的折衷，因为它会使其余数据具有较低的基数，从而有助于模型训练。
Actively fighting biases in the dataMachine learning predictions are inherently biased because of real-world patterns reflected in our training data. In some cases, we can forestall biased predictions by taking matters in our own hands and hard-coding rules which overwrite model behavior.
积极应对数据中的偏见由于训练数据中反映的真实世界模式，机器学习预测固有地存在偏见。在某些情况下，我们可以通过自己处理问题并覆盖模型行为的硬编码规则来阻止有偏见的预测。

如何将确定性规则硬编码为逻辑公式 (How to Hard-Code Deterministic Rules as Logical Formulae)

As mentioned before, machine learning models learn rules implicitly. The epitomes of such learning are decision-tree-based algorithms such as scikit-learn’s DecisionTreeClassifier or GradientBoostingRegressor, the latter being an ensemble of decision trees.

如前所述，机器学习模型隐式地学习规则。这种学习的代表是基于决策树的算法，例如scikit-learn的DecisionTreeClassifier或GradientBoostingRegressor ，后者是决策树的集合。

Decision-tree-based algorithms attempt to predict the target variable by learning decision rules inferred from the supplied data. The decision rules itself are incredibly simple; they are a sequence of splits on the data using only the basic logical operators =, <, >, ≤, ≥. Nevertheless, all splits only approximate any explicit rules and consequently may not be as accurate.

基于决策树的算法尝试通过学习从提供的数据推断出的决策规则来预测目标变量。决策规则本身非常简单；它们是仅使用基本逻辑运算符=，<，>，≤，≥对数据进行的拆分的序列。但是，所有拆分仅近似于任何显式规则，因此可能不那么准确。

We can use the same approach to build simple deterministic rules as logical formulae which we can hence translate into code. For instance, let us again suppose we want to design a predictive model to estimate the total losses of an insurance company, and we know the company rejects claims less or equal to $1.000. One way of hard-coding this rule would be:

我们可以使用与逻辑公式相同的方法来构建简单的确定性规则，从而将其转换为代码。例如，让我们再次假设我们要设计一个预测模型来估计一家保险公司的总损失，并且我们知道该公司拒绝小于或等于$ 1.000的索赔。硬编码此规则的一种方法是：

if claim_amount <= 1000:
   # reject claimelse:
   # use machine learning model

让我们开始编码 (Let Us Start Coding)

There are many ways in which we can integrate deterministic rules into our machine learning pipeline. Adding rules progressively as data pre-processing steps might seem intuitive, but this would not suit our goal. Preferably, we aim to leverage the concept of abstraction by adopting object-oriented programming (OOP) to generate a novel ML model class. This hybrid model will then encompass all deterministic rules, enabling us to train it like any other machine learning model.

我们可以通过多种方式将确定性规则集成到我们的机器学习管道中。逐步添加规则作为数据预处理步骤可能看起来很直观，但这并不适合我们的目标。优选地，我们旨在通过采用面向对象的编程(OOP)来利用抽象的概念来生成新颖的ML模型类。然后，该混合模型将包含所有确定性规则，使我们能够像其他任何机器学习模型一样对其进行训练。

Conveniently, scikit-learn provides a BaseEstimator class which we can inherit to build scikit-learn models ourselves without much effort. The advantage of building a new estimator is that we can blend our rules directly with the model logic while leveraging an underlying machine learning model for all data to which the rules don’t apply.

方便地， scikit-learn提供了一个BaseEstimator 我们可以继承该类以自行构建scikit学习模型。建立一个新的估算器的优点是，我们可以将规则直接与模型逻辑混合，同时针对不适用规则的所有数据利用底层的机器学习模型。

Let us start by building our new hybrid model class and adding an init method to it. As an underlying model, we will use the scikit-learn implementation of a GradientBoostingClassifier; we will call it the base_model.

让我们首先构建新的混合模型类，然后向其添加init方法。作为基础模型，我们将使用GradientBoostingClassifier的scikit-learn实现；我们将其称为base_model 。

import pandas as pd
from typing import Dict, Tuple
from sklearn.base import BaseEstimatorclass RuleAugmentedGBC(BaseEstimator):
   
  def __init__(self, base_model: BaseEstimator, rules: Dict, **base_params):
    
    self.rules = rules
    self.base_model = base_model
    self.base_model.set_params(**base_params)

We created the RuleAugmentedGBC class which inherits from BaseEstimator. Our class is not complete yet and is still missing some essential methods, but it is now technically a scikit-learn estimator. The init method initializes our estimator utilizing a base_model and a dictionary of rules. We can set additional parameters in the init method which are then directly passed to the underlying base_model. In our case, we will use a GradientBoostingClassifier as the base_model.

我们创建了RuleAugmentedGBC类，该类继承自BaseEstimator 。我们的课程尚未完成，仍然缺少一些基本方法，但从技术上讲，它现在是scikit-learn估计器。 init方法使用base_model和规则字典初始化我们的估算器。我们可以在init方法中设置其他参数，然后将这些参数直接传递给基础base_model 。在本例中，我们将使用GradientBoostingClassifier作为base_model 。

规则的通用格式 (A Common Format for Rules)

In this article’s implementation, we will supply rules to the model in the following format:

在本文的实现中，我们将以以下格式为模型提供规则：

{"House Price": [
    ("<", 1000.0, 0.0),
    (">=", 500000.0, 1.0)
],
 "...": [
    ...
    ...
]}

As illustrated above, we format rules as a Python dictionary. The dictionary keys represent the feature column names to which we want to apply our rules. The values of the dictionary are lists of tuples, each tuple representing a unique rule. The first element of the tuple is the logical operator of the rule, the second the split criterion, and the last object is the value which the model should return if the rule is applicable.

如上所示，我们将规则格式化为Python字典。字典键代表我们要对其应用规则的要素列名称。字典的值是元组列表，每个元组代表一个唯一的规则。元组的第一个元素是规则的逻辑运算符，第二个元素是拆分条件，最后一个对象是模型(如果适用规则)应返回的值。

For instance, the first rule in the example above would indicate that if any value in the House Price feature column is less than 1000.0, the model should return the value 0.0.

例如，上面示例中的第一条规则将指示，如果“ 房屋价格”特征列中的任何值小于1000.0，则模型应返回值0.0。

拟合方法 (The fit Method)

We proceed to code a fit method (within our RuleAugmentedGBC class) to allow our model to train on data. What is important to notice here is that we want to use our deterministic rules wherever possible, and train the base_model only on data which is not affected by the rules. We will decompose this step by formulating a private helper method called _get_base_model_data to filter out the data necessary to train our base_model.

我们继续编写一个fit方法(在RuleAugmentedGBC类中)，以使我们的模型能够训练数据。这里要注意的重要一点是，我们希望尽可能使用确定性规则，并仅对不受规则影响的数据训练base_model 。我们将通过制定称为_get_base_model_data的私有帮助器方法来分解此步骤，以筛选出训练我们的base_model所需的数据。

def fit(self, X: pd.DataFrame, y: pd.Series, **kwargs):  train_x, train_y = self._get_base_model_data(X, y)
  self.base_model.fit(train_x, train_y, **kwargs)

The fit method is pretty straightforward: it first applies the to be coded _get_base_model_data method to distill the training features and labels for our underlying base_model and then fits the model to the data. Similar to before, we can set additional parameters which we subsequently pass to the fit method of the base_model. Let us now implement the _get_base_model_data method:

fit方法非常简单：首先应用_get_base_model_data方法进行编码，以提取基础base_model的训练功能和标签，然后将模型拟合至数据。与之前类似，我们可以设置其他参数，然后将其传递给base_model的fit方法。现在让我们实现_get_base_model_data方法：

def _get_base_model_data(self, X: pd.DataFrame, y: pd.Series) -> Tuple[pd.DataFrame, pd.Series]:  train_x = X
  
  for category, rules in self.rules.items():    if category not in train_x.columns.values: continue
    for rule in rules:      if rule[0] == "=":
        train_x = train_x.loc[train_x[category] != rule[1]]      elif rule[0] == "<":
        train_x = train_x.loc[train_x[category] >= rule[1]]      elif rule[0] == ">":
        train_x = train_x.loc[train_x[category] <= rule[1]]      elif rule[0] == "<=":
        train_x = train_x.loc[train_x[category] > rule[1]]      elif rule[0] == ">=":
        train_x = train_x.loc[train_x[category] < rule[1]]      else:
        print("Invalid rule detected: {}".format(rule))  indices = train_x.index.values
  train_y = y.iloc[indices]
  train_x = train_x.reset_index(drop=True)
  train_y = train_y.reset_index(drop=True)  return train_x, train_y

Our private _get_base_model_data method iterates through the rule dictionary keys and finally through every unique rule. At every rule, depending on the logical operator, it narrows down the train_x pandas dataframe to only include the data points not affected by the rule. Once we have applied all rules, we match the corresponding labels via indices and return the residual data for the base_model.

我们私有的_get_base_model_data方法遍历规则字典键，最后遍历每个唯一规则。在每条规则上，根据逻辑运算符，它会缩小train_x pandas数据框的范围，使其仅包括不受该规则影响的数据点。一旦应用了所有规则，就可以通过索引匹配相应的标签，并返回base_model的剩余数据。

预测方法 (The predict Method)

The predict method works in like manner to the fit method. Wherever possible, rules should be applied; if no rules are applicable, the base_model should produce a prediction.

预测方法的工作方式与拟合方法类似。在任何可能的地方，都应应用规则；如果没有适用的规则，则base_model应该产生一个预测。

def predict(self, X: pd.DataFrame) -> np.array:
  
  p_X = X.copy()
  p_X['prediction'] = np.nan  for category, rules in self.rules.items():    if category not in p_X.columns.values: continue
    for rule in rules:      if rule[0] == "=":
        p_X.loc[p_X[category] == rule[1], 'prediction'] = rule[2]      elif rule[0] == "<":
        p_X.loc[p_X[category] < rule[1], 'prediction'] = rule[2]      elif rule[0] == ">":
        p_X.loc[p_X[category] > rule[1], 'prediction'] = rule[2]      elif rule[0] == "<=":
        p_X.loc[p_X[category] <= rule[1], 'prediction'] = rule[2]      elif rule[0] == ">=":
        p_X.loc[p_X[category] >= rule[1], 'prediction'] = rule[2]      else:
        print("Invalid rule detected: {}".format(rule))  if len(p_X.loc[p_X['prediction'].isna()].index != 0):    base_X = p_X.loc[p_X['prediction'].isna()].copy()
    base_X.drop('prediction', axis=1, inplace=True)
    p_X.loc[p_X['prediction'].isna(), 'prediction'] = self.base_model.predict(base_X)  return p_X['prediction'].values

The predict method copies our input pandas dataframe in order not to change the input data. We then add a prediction column in which we gather all our hybrid model’s predictions. Just as in the _get_base_model_data method, we iterate through all rules and, wherever applicable, record the corresponding return value in the prediction column. Once we have applied all rules, we check whether any predictions are still missing. If this is the case, we revert to our base_model to generate the remaining predictions.

预测方法将复制我们的输入熊猫数据框，以便不更改输入数据。然后，我们添加一个预测列，在其中收集所有混合模型的预测。就像_get_base_model_data方法一样，我们迭代所有规则，并在适用时将相应的返回值记录在预测列中。应用所有规则后，我们将检查是否仍然缺少任何预测。如果是这种情况，我们将恢复为base_model以生成其余的预测。

其他必要方法 (Other Required Methods)

To get a working model that inherits from the BaseEstimator class, we need to implement two more simple methods — get_params and set_params. These allow us to set and read the parameters of our new model. As these two methods are not integral to the topic of this article, please have a look at the fully documented implementation below if you want to know more.

为了获得从BaseEstimator类继承的工作模型，我们需要实现两个简单的方法-get_params和set_params 。这些使我们能够设置和读取新模型的参数。由于这两种方法不是本文主题不可或缺的，因此，如果您想了解更多信息，请查看下面完整记录的实现。

完整记录的实施 (Fully Documented Implementation)

Below, you will find the full code of the scikit-learn wrapper class that we built in this article, complete with documentation. You might find it useful for one of your use cases.

在下面，您将找到我们在本文中构建的scikit-learn包装类的完整代码，并附带文档。您可能会发现它对您的一种用例有用。

基于混合规则的模型用法示例 (Hybrid Rule-Based Model Usage Example)

Here is a short code snippet to illustrate how you could utilize the RuleAugmentedEstimator wrapper class to add rules to a GradientBoostingClassifier. This example assumes you have already initialized the variables rules, train_X, train_y, and test_X. Please refer to the section A Common Format for Rules to examine how any rules should be employed.

这是一个简短的代码段，以说明如何利用RuleAugmentedEstimator包装器类将规则添加到GradientBoostingClassifier中 。本示例假定您已经初始化了规则rule ， train_X ， train_y和test_X 。请参阅“规则的通用格式”部分，以检查应如何使用任何规则。

gbc = GradientBoostingClassifier(n_estimators=50)hybrid_model = RuleAugmentedEstimator(gbc, rules)hybrid_model.fit(train_X, train_y)predictions = hybrid_model.predict(test_X)

结论 (Conclusion)

Congratulations on making it this far! I hope this article has helped you to leverage domain knowledge and object-oriented programming (OOP) to build hybrid rule-based machine learning models. As you have seen, the concept of abstraction can be incredibly helpful to directly incorporate rules into an ML model while keeping your data pipeline clean.

恭喜！我希望本文能帮助您利用领域知识和面向对象的编程(OOP)来构建基于规则的混合机器学习模型。如您所见，抽象的概念对于将规则直接合并到ML模型中，同时保持数据管道的清洁非常有用。

Writing this article has helped me to explore the topic in-depth as well as its applications. While I attempted to check my work for errors, please let me know should you find any.

撰写本文有助于我深入探讨该主题及其应用。当我尝试检查我的工作是否有错误时，请告诉我是否有任何错误。

I am always happy for feedback and open to discussions about topics in the fields of data science, machine learning, and technology in general. I would love to hear from you, so please feel free to connect with me on LinkedIn.

我总是很高兴获得反馈，并愿意就数据科学，机器学习和一般技术领域的话题进行讨论。 我希望收到您的来信，请随时在 LinkedIn 上与我联系 。