面板模型混合效应模型_树助混合效应模型

面板模型混合效应模型

This article shows how tree-boosting (sometimes also referred to as “gradient tree-boosting”) can be combined with mixed effects models using the GPBoost algorithm. Background is provided on both the methodology as well as on how to apply the GPBoost library using Python. We show how (i) models are trained, (ii) parameters tuned, (iii) model are interpreted, and (iv) predictions are made. Further, we do a comparison of several alternative approaches.

本文展示了如何使用GPBoost算法将树增强(有时也称为“梯度树增强”)与混合效果模型结合使用。 提供了方法论以及如何使用Python应用GPBoost库背景知识 。 我们展示了如何(i)训练模型,(ii)调整参数,(iii)解释模型,以及(iv)进行预测。 此外,我们对几种替代方法进行了比较。

介绍 (Introduction)

Tree-boosting with its well-known implementations such as XGBoost, LightGBM, and CatBoost, is widely used in applied data science. Besides state-of-the-art predictive accuracy, tree-boosting has the following advantages:

借助XGBoost,LightGBM和CatBoost等著名的实现来增强树性能在应用数据科学中得到了广泛的应用。 除具有最新的预测准确性外,增强树功能还具有以下优点:

  • Automatic modeling of non-linearities, discontinuities, and complex high-order interactions

    自动建模非线性,不连续和复杂的高阶相互作用
  • Robust to outliers in and multicollinearity among predictor variables

    稳健的预测变量中的异常值和多重共线性
  • Scale-invariance to monotone transformations of the predictor variables

    尺度不变性到预测变量的单调变换
  • Automatic handling of missing values in predictor variables

    自动处理预测变量中的缺失值

Mixed effects models are a modeling approach for clustered, grouped, longitudinal, or panel data. Among other things, they have the advantage that they allow for more efficient learning of the chosen model for the regression function (e.g. a linear model or a tree ensemble).

混合效果模型是针对聚类,分组,纵向或面板数据的建模方法。 除其他优点外,它们还具有以下优点:允许更有效地学习所选的回归函数模型(例如线性模型或树集合)。

As outlined in Sigrist (2020), combined gradient tree-boosting and mixed effects models often performs better than (i) plain vanilla gradient boosting, (ii) standard linear mixed effects models, and (iii) alternative approaches for combing machine learning or statistical models with mixed effects models.

Sigrist(2020)所述, 结合梯度树增强和混合效应模型的性能通常比(i)普通香草梯度增强,(ii)标准线性混合效应模型和(iii)结合机器学习或统计的替代方法要好。具有混合效果模型的模型。

建模分组数据 (Modeling grouped data)

Grouped data (aka clustered data, longitudinal data, panel data) occurs naturally in many applications when there are multiple measurements for different units of a variable of interest. Examples include:

当对感兴趣变量的不同单位进行多次测量时,在许多应用程序中自然会出现分组数据(又名聚类数据,纵向数据,面板数据) 。 示例包括:

  • One wants to investigate the impact of some factors (e.g. learning technique, nutrition, sleep, etc.) on students’ test scores and every student does several tests. In this case, the units, i.e. the grouping variable, are the students and the variable of interest is the test score.

    一个人想调查某些因素(例如学习技术,营养,睡眠等)对学生考试成绩的影响,而每个学生都进行几次考试。 在这种情况下,单位,即分组变量,是学生,而感兴趣的变量是测试分数。
  • A company gathers transaction data about its customers. For every customer, there are several transactions. The units are then the customers and the variable of interest can be any attribute of the transactions such as prices.

    公司收集有关其客户的交易数据。 对于每个客户,都有几笔交易。 单位就是客户,兴趣变量可以是交易的任何属性,例如价格。

Basically, such grouped data can be modeled using four different approaches:

基本上,可以使用四种不同的方法对此类分组数据进行建模:

  1. Ignore the grouping structure. This is rarely a good idea since important information is neglected.

    忽略分组结构 。 因为忽略了重要信息,所以这很少是一个好主意。

  2. Model each group (i.e. each student or each customer) separately. This is also rarely a good idea as the number of measurements per group is often small relative to the number of different groups.

    分别对每个小组(即每个学生或每个客户)建模 。 这也不是一个好主意,因为每组的测量数量相对于不同组的数量通常很小。

  3. Include the grouping variable (e.g. student or customer ID) in your model of choice and treat it as a categorical variable. While this is a viable approach, it has the following disadvantages. Often, the number of measurements per group (e.g. number of tests per student, number of transactions per customer) is relatively small and the number of different groups is large (e.g. number of students, customers, etc.). In this case, the model needs to learn many parameters (one for every group) based on relatively little data which can make the learning inefficient. Further, for trees, high cardinality categorical variables can be problematic.

    在您选择的模型中包括分组变量(例如,学生或客户ID),并将其视为分类变量。 尽管这是一种可行的方法,但它具有以下缺点。 通常,每组的测量数量(例如,每个学生

  • 2
    点赞
  • 11
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值