前置交换机数据交换_我们的数据科学交换所

最新推荐文章于 2023-04-24 11:33:18 发布

weixin_26746401

最新推荐文章于 2023-04-24 11:33:18 发布

阅读量718

点赞数

原文链接：https://medium.com/democratictech/our-data-science-clearinghouse-e9f12fd4a86

版权

前置交换机数据交换

The DNC Data Science team builds and manages dozens of models that support a broad range of campaign activities. Campaigns rely on these model scores to optimize contactability, volunteer recruitment, get-out-the-vote, and many other pieces of modern campaigning. One of our responsibilities is to deliver the best available model scores in an accessible, actionable form.

DNC数据科学团队构建和管理数十种模型，以支持广泛的竞选活动。竞选活动依靠这些模型评分来优化联系能力，志愿者招募，投票表决和现代竞选活动的许多其他方面。我们的职责之一是以可访问，可操作的形式提供最佳的可用模型评分。

As part of Phoenix, the DNC’s data warehouse, we developed infrastructure that keeps our focus on delivering products to win elections instead of on ever-growing technical complexity. In this post, we’ll walk through the infrastructure that manages over 70 billion (and counting!) model scores for the country’s 200+ million registered voters.

作为DNC数据仓库Phoenix的一部分，我们开发了基础架构，使我们始终专注于交付赢得选举的产品，而不是不断增长的技术复杂性。在这篇文章中，我们将介绍为该国200亿以上注册选民管理的700亿(甚至更多)模型评分的基础架构。

挑战 (The Challenge)

At this point in the 2020 cycle, we’re managing about 20 different models. These models come from a mix of sources (our internal modeling infrastructure and multiple vendor syncs), and are a mix of regression, binary classification, and multi-class classification model types. Across those 20 models, we have around 80 distinct model versions that comprise more than 70 billion point estimates.

在2020年周期的这一点上，我们正在管理约20种不同的模型。这些模型来自多种来源(我们的内部建模基础结构和多个供应商同步)，并且是回归，二进制分类和多分类分类模型类型的混合。在这20个模型中，我们有大约80个不同的模型版本，其中包括超过700亿个点估计。

So, how do we get from the complexity of mixing so many model sources, model types, and versions-per-model to the clean, accessible set of scores our users can seamlessly integrate into their campaign programs?

那么，如何从混合这么多模型源，模型类型和每个模型版本的复杂性，到用户可以无缝地集成到他们的广告系列计划中的干净，可访问的分数集，如何变得复杂呢？

模型分数交换所 (A Clearinghouse for Model Scores)

Our solution is a pair of carefully designed tables for model versions and model scores, and an accompanying code base to cleanly manage model and score life-cycles.

我们的解决方案是为模型版本和模型评分精心设计的一对表格，以及用于干净地管理模型和评分生命周期的随附代码库。

Together, these tables are a clearinghouse for model score publication. Just as a financial clearinghouse ensures a clean exchange between parties in a transaction, our model score clearinghouse sits between a model’s source data and its downstream pipelines to ensure a clean hand-off from one to the other.

这些表格一起构成了模型评分发布的交换所 。就像金融票据交换所确保交易双方之间的干净交换一样，我们的模型分数票据交换所也位于模型的源数据及其下游管道之间，以确保从一个人到另一个人的彻底交接。

模型版本分类帐 (Model Version Ledger)

The first table of our clearinghouse, model_versions, keeps metadata on model versions. Vendor-sourced model version metadata is merged to this table as part of loading processes. Models we score in-house have their versions checked against and merged into this table as part of every scoring job. With many models and versions spread across our small data science team, we are thrilled to have this bookkeeping maintained programmatically.

我们的票据交换所的第一个表model_versions保留了模型版本的元数据。供应商来源的模型版本元数据在加载过程中会合并到此表中。我们内部评分的模型会对照其版本进行检查，并作为每次评分工作的一部分合并到此表中。我们的小型数据科学团队拥有许多模型和版本，我们很高兴以编程方式维护此簿记。

权威的模型预测表 (An Authoritative Table of Model Predictions)

The second table, scores, holds, well, scores. As we load incoming model scores, we tag both the model version and scoring job that generated them. For multi-class classification, we store scores in a normalized form and note the predicted class label in the score_name column.

第二张表， scores ，保持得分。加载传入的模型评分时，我们会同时标记模型版本和生成评分的评分工作。对于多类别分类，我们以标准化形式存储分数，并在score_name列中注明预测的类别标签。

A key part of this table’s architecture is that it’s partitioned by the model version’s date. This keeps all of the scores for a given model version on the same partition, allowing us to query just the few gigabytes of data for the scores we need instead of scanning the entire multi-terabyte table.

该表的体系结构的关键部分是按模型版本的日期进行分区。这样可以将给定模型版本的所有分数保留在同一分区上，从而使我们可以仅查询几GB数据以获得所需的分数，而不用扫描整个多TB的表。

Screenshot of scores table in database — Second table of our clearinghouse, scores.

Once the new scores have passed automated checks, their current_score_flag is flipped to TRUE and their datetime_approved field is set to the current timestamp. In the same step, previous scores of the same model_version that overlap with the new scores by external_id are flipped to FALSE and have their datetime_deprecated set to the current timestamp.

新分数通过自动检查后，它们的current_score_flag会翻转为TRUE并且datetime_approved字段会设置为当前时间戳。在相同的步骤，相同的分数以前model_version与新成绩通过重叠external_id翻转到FALSE ，并有自己的datetime_deprecated设置为当前的时间戳。

This operation makes the current_score_flag an authoritative marker for which scores are current within the model_run_id, a key assumption for when it’s time to materialize the scores for downstream use.

此操作使current_score_flag成为权威性标记，其分数在model_run_id内为当前分数，这是何时将分数具体化以供下游使用的关键假设。

最后-简化模型版本的操作 (Finally — Simplified Operations on Model Versions)

All of the bookkeeping in the tables above pays huge dividends when it comes time to query our model scores. This short script extracts the current scores for the current production model version of “My Model” with only the name of the model as an input!

上表中的所有簿记都是在查询我们的模型分数时要付出的巨大努力。这个简短的脚本仅使用模型名称作为输入来提取当前生产模型版本“My Model”的当前分数！

DECLARE CURRENT_MODEL_VERSION_DATE DATE;
DECLARE CURRENT_MODEL_RUN_ID STRING;


-- check the model version ledger for the current model metadata
SET (CURRENT_MODEL_VERSION_DATE, CURRENT_MODEL_RUN_ID) = (
 SELECT AS STRUCT model_version_date, model_run_id
   FROM `modeling.model_versions`
  WHERE model_name = "My Model"
    AND current_model_flag is TRUE
);


-- pull the model's current scores
SELECT external_id, score_name, score_value, datetime_approved
  FROM `modeling.scores`
 WHERE model_version_date = CURRENT_MODEL_VERSION_DATE
   AND model_run_id = CURRENT_MODEL_RUN_ID
   AND current_score_flag is TRUE
;

我们如何使用它 (How We Use It)

Having a strong, consolidated schema for model versions and scores makes downstream use cases much cleaner and simpler. Here are a few examples of how this infrastructure is used in our work in the 2020 cycle:

具有用于模型版本和评分的强大，统一的架构，可以使下游用例更加简洁。以下是在2020年周期的工作中如何使用此基础架构的一些示例：

采购发布管道 (Sourcing Publishing Pipelines)

The most mission-critical use of this infrastructure is to serve the most current scores of each model to downstream pipelines in a consistent location and format. Our scoring and loading pipelines run a “materialize model” task that writes a query similar to the one above to a dataset holding one “current” table per model.

此基础架构最关键的用途是以一致的位置和格式将每个模型的最新分数提供给下游管道。我们的计分和加载流水线运行“物化模型”任务，该任务将与上面类似的查询写入到每个模型包含一个“当前”表的数据集中。

With our model and score version bookkeeping managed upstream, our downstream code can always find the model’s current scores in the same place. But more importantly, this approach insulates downstream processes from modeling issues: If a scoring job fails for some reason, or if newly loaded scores do not pass automated quality checks, the model version will not be re-materialized with the problematic scores.

通过在上游管理我们的模型和分数版本簿记，我们的下游代码始终可以在同一位置找到模型的当前分数。但更重要的是，这种方法将下游流程与建模问题隔离开来：如果计分工作由于某种原因而失败，或者如果新加载的分数未通过自动质量检查，则模型版本将不会与有问题的分数重新实现。

模型生命周期的面包屑 (Breadcrumbs for a Model’s Life Cycle)

With Election Day right around the corner, we have to respond quickly and confidently to potential problems with our models. We maintain a complete picture of a model and its scores by linking the model_run_id and scoring_run_id fields to the metadata and artifacts we store in an MLflow tracking instance.

即将到来的选举日，我们必须对我们模型的潜在问题做出Swift而自信的回应。通过将model_run_id和scoring_run_id字段链接到我们存储在MLflow跟踪实例中的元数据和工件，我们可以维护模型及其分数的完整图片。

This allows us to trace a published score back to its scoring job, its training job, and, through our other metadata systems, the exact training data that built the underlying model. This piece is critical for diagnosing issues and anomalies users encounter in the field.

这使我们可以将已发布的分数追溯到评分工作，培训工作，以及通过我们的其他元数据系统，来构建基础模型的确切培训数据。这对于诊断用户在现场遇到的问题和异常现象至关重要。

It’s also a gift to our future selves: when it’s time to revisit our models and improve their future versions, we’ll be working with a complete understanding of their inputs and outputs.

这也是我们未来自我的礼物：当需要重新审视我们的模型并改进其未来版本时，我们将全面了解其输入和输出。

模型还原 (Model Reversions)

Sometimes, we detect a problem with a model after it’s been published. When this happens, we can adjust the current_model_flag in the model_versions table to toggle a model version out of production, or simply suppress scores from a problematic scoring job. After adjusting the metadata in the clearinghouse, we can then re-materialize the model for downstream pipelines and address the lingering issues.

有时，我们会在模型发布后检测出问题。发生这种情况时，我们可以调整model_versions表中的current_model_flag来切换模型版本的停产状态，或仅抑制评分工作有问题的分数。在票据交换所中调整了元数据之后，我们可以为下游管道重新实现模型并解决长期存在的问题。

时间点快照 (Point-In-Time Snapshots)

Political data folks are fanatics for historical analysis, and model estimates can be a key part of that. With the datetime_approved and datetime_deprecated fields, we can reconstruct a model version’s ‘current’ scores from any point in time just as we materialize ‘current’ tables. This can give us a quick view into a model’s predictions over time, or even a snapshot of predictions as of a past Election Day.

政治数据人员是历史分析的狂热者，模型估计可能是其中的关键部分。使用datetime_approved和datetime_deprecated字段，我们可以在实现“当前”表的同时从任何时间点重建模型版本的“当前”分数。这可以使我们快速查看模型随时间变化的预测，甚至可以追溯到过去选举日的预测快照。

结论 (Conclusion)

Our investment in this infrastructure has allowed us to develop and deliver data science products that are more transparent and sustainable than ever before, and we’ll continue building on this infrastructure for the 2022 election cycle and beyond.

我们在基础设施方面的投资使我们能够开发和交付比以往任何时候都更加透明和可持续的数据科学产品，并且我们将在2022年选举周期及以后的基础设施上继续建设。

Interested in making a difference? Join our team.

有兴趣改变吗？加入我们的团队。