语义漂移
Sometimes your models fail because the world breaks. Sometimes your models die because your observations of the world break. The distinction is harder to discern than you think.
有时您的模型会因为世界崩溃而失败。 有时,您的模型会因为对世界的观察中断而死亡。 这种区别比您想象的要难辨认。
Understanding the difference and how to conquer semantic drift leads to much more powerful models.
了解差异以及如何克服语义漂移会导致功能更强大的模型。
什么是概念漂移? (What Is Concept Drift?)
I see many articles online about concept drift assuming that the reader has a background in statistics. Here’s an alternate approach:
我在网上看到许多关于概念漂移的文章,假设读者具有统计学背景。 这是另一种方法:
A complex world exists outside of your organization. Change in this external world often breaks your models — COVID-19 and our collective response, changes in USA Federal Reserve policy, or the rise of TikTok. We call these changes concept drift.
组织外部存在一个复杂的世界。 外部世界的变化通常会破坏您的模型-COVID-19和我们的集体React,美国联邦储备委员会政策的变化或TikTok的崛起。 我们称这些变化为概念漂移。
Concept is the relationship between inputs and outputs of a model, with the outputs often being the model predictions. For example:
概念是模型的输入和输出之间的关系,而输出通常是模型的预测。 例如:
Show a prospective customer two sweaters, sweater A and sweater B. Assuming that this prospect is a female between the ages of 30 and 35 in the Northeast USA in a summer month, the prospect chooses sweater A 82% of the time.
向潜在客户展示两件毛衣,即毛衣A和毛衣B。假设该潜在客户是美国东北部一个夏季月份在30至35岁之间的女性,则该潜在客户在82%的时间内选择了毛衣A。
All of a sudden, the balance between sweater A and sweater B shifts to 27% sweater A and 73% sweater B. What happened?
突然之间,毛衣A和毛衣B之间的平衡转移到毛衣A和毛衣B分别为27%和73%。发生了什么?
In the context of a predictive model, the product choices, demographics, geography, and seasonality are inputs. The output is the predicted likelihood of purchasing each product. This relationship is the concept.
在预测模型的上下文中,产品选择,人口统计,地理和季节性是输入。 输出是购买每种产品的预计可能性。 这种关系就是概念。
If this shift in purchasing patterns occurs out of the blue, we assume something unseen about the world has changed. Maybe the color is no longer in vogue. Perhaps animal rights campaigns have affected the consumption of wool. This drift is concept drift.
如果这种购买方式的转变是突然发生的,那么我们就认为世界发生了一些看不见的事情。 也许颜色不再流行。 也许动物权益运动已经影响了羊毛的消费。 这种漂移就是概念漂移。
适应不断变化的世界 (Adapting to a Changing World)
When the concept of a model drifts, models have to be rebuilt. There are two options:
当模型的概念发生变化时,必须重新构建模型。 有两种选择:
- Use the same model type and parameters (and code), and train it from scratch on new data that represents the latest state of the world. This data should be unpolluted by the old world. An analogy is taking an infant and training it on a world that has always had TikTok. 使用相同的模型类型和参数(和代码),并从头开始对代表世界最新状态的新数据进行训练。 这些数据应该不受旧世界的污染。 打个比喻是要带一个婴儿,并在一个一直有TikTok的世界上对其进行培训。
- Failing the first option, throw away the code and go through the painstaking process of data science research to discover the new model type and parameters that fit the latest state of the world. This is like designing a cyborg optimized for TikTok. 如果没有第一个选择,则丢弃代码,并经历数据科学研究的艰辛过程,以发现适合世界最新状况的新模型类型和参数。 这就像设计针对TikTok优化的机器人。
Note that, in both cases, the new model wouldn’t work on the old state of the world. Old world? Old model. New world? New model.
请注意,在两种情况下,新模型都无法在世界的旧状态下使用。 古老的世界? 旧模型。 新世界? 新模型。
The only time to use the same model on both the new world and the old world is when you seek a model that is stable across both. In other words, you desire a model based on inputs unaffected by the shifting winds of TikTok, colors from Milan, or animal rights campaigns.
在新世界和旧世界上都使用同一模型的唯一时间是当您寻求在两者之间都稳定的模型时。 换句话说,您希望模型基于不受TikTok变换风,米兰的颜色或动物权利运动影响的输入。
世界变了吗? (Has the World Changed?)
Whenever models break, data scientists often start with the assumption that the underlying concept must have drifted. After all, it’s easier than trying to gather somehow some magical inputs that may not exist and may be impossible (or too expensive) to acquire.
每当模型破裂时,数据科学家通常会从以下前提开始:基本概念一定已经漂移。 毕竟,这比尝试以某种方式收集一些可能不存在并且可能无法(或太昂贵)的神奇输入要容易。
There is a crucial caveat to this assumption: Most organizations do the bulk of their data science research on data that has to do with a specific product, service, or platform. Often, the product captures that data itself. And when this is the case, changes in the platform and changes in how it captures data about itself can introduce drift that has nothing to do with the outside world.
这个假设有一个重要的警告:大多数组织对与特定产品,服务或平台有关的数据进行大量的数据科学研究。 通常,产品会捕获数据本身。 在这种情况下,平台的变化以及捕获自身数据的方式的变化都可能导致与外界无关的漂移。
This different kind of drift is semantic drift. And it takes a lot of work and discipline to be able to separate this from concept drift.
这种不同的漂移是语义漂移。 要将其与概念漂移区分开,需要大量的工作和纪律。
什么是语义漂移? (What is Semantic Drift?)
The term semantic drift has applications outside of statistics and machine learning. Here, we refer to the meaning of data:
语义漂移一词在统计和机器学习之外具有其他应用。 在这里,我们指的是数据的含义:
- What does a variable or feature mean? 变量或特征是什么意思?
- What does the value of a variable or feature mean? 变量或特征的值是什么意思?
- What is an observation? 什么是观察?
Let’s take gender identity as the variable in question. Upon registration, an online shopping platform asks prospective customers to identify their gender as either male or female.
让我们将性别认同作为相关变量。 注册后,在线购物平台会要求潜在客户将其性别标识为男性还是女性。
If the online shopping platform adds a third gender identity option, unspecified. The meaning, or semantics, of the gender identity variable changes. Let’s set the shifting politics of gender identity aside. Perhaps certain people wish to leave their gender marker unspecified to reduce their personal feeling of being surveilled. The point is that gender identity used to have two choices, and it now has three.
如果在线购物平台添加了第三个性别身份选项,则未指定。 性别身份变量的含义或语义会发生变化。 让我们抛开性别认同不断变化的政治。 也许某些人希望保留其未指定的性别标记,以减少个人被监视的感觉。 关键是,性别认同曾经有两个选择,现在有三个选择。
A typical data science use case is separating the population of prospects into partitions to treat differently. I wish to bucket males separately from females for marketing.
一个典型的数据科学用例是将潜在客户群体划分为多个分区,以区别对待。 我希望将男性和女性分开进行市场营销。
When the system changes how it defines gender identity, models behave strangely. Individual prospects who answered female in the old data may have chosen unspecified if given a chance. But they responded to that question on registration, and an organization wouldn’t ask them again.
当系统更改定义性别身份的方式时,模型的行为会异常。 如果有机会,在旧数据中回答女性的个别准潜在顾客可能选择了未指定。 但是他们在注册时回答了这个问题,一个组织不会再询问他们。
Now, data scientists have to deal with a population of prospects where their value for the gender identity variable means something different depending on when the person registered.
现在,数据科学家必须处理大量潜在客户,这些潜在客户的性别认同变量的值根据注册时间的不同而有所不同。
Often data scientists don’t know that their datasets contain multiple semantic versions. And if they do, they have no easy way to tell which prospects are which.
数据科学家通常不知道他们的数据集包含多个语义版本。 如果这样做的话,他们就没有容易的方法来判断哪个前景是哪个。
克服语义漂移 (Conquering Semantic Drift)
A host of specific best practices for conquering semantic drift exist. This product management and software engineering discipline is rich and is beyond the scope of this post.
存在许多克服语义漂移的特定最佳实践。 该产品管理和软件工程学科内容丰富,超出了本文的范围。
The basics boil down to this:
基础可以归结为:
- Products, services, and platforms that generate data must product-manage the data they generate as a first-class citizen. 生成数据的产品,服务和平台必须对作为一流公民生成的数据进行产品管理。
- Help data scientists by tagging data with the semantic version, which requires code-as-deployed versioning discipline. 通过使用语义版本标记数据来帮助数据科学家,这需要按代码部署的版本控制准则。
Ensure data scientists can thoroughly understand how data of a specific semantic version was collected by preserving code-as-deployed such that it is identifiable by version.
确保数据科学家可以通过保留所部署的代码(按版本进行识别)来彻底了解特定语义版本的数据是如何收集的。
Product managers and product software engineers are not the only ones with work to do. Data scientists and data engineers do as well. The primary task is:
产品经理和产品软件工程师并不是唯一要做的工作。 数据科学家和数据工程师也是如此。 主要任务是:
- Data infrastructure should be able to rationalize data sets, such as user journeys, that include observations of differing semantic versions. 数据基础结构应该能够合理化数据集,例如用户旅程,其中包括对不同语义版本的观察。
Thoughts and feedback, please!
有想法和反馈,请!
翻译自: https://medium.com/woodlamp-tech/when-concept-drift-is-semantic-drift-be1ac7e1abf5
语义漂移