Once in a while, when reading papers in the Reinforcement Learning domain, you may stumble across mysterious-sounding phrases such as ‘we deal with a filtered probability space’, ‘the expected value is conditional on a filtration’ or ‘the decision-making policy is ℱₜ-measurable’. Without formal training in measure theory [2,3], it may be difficult to grasp what exactly such a filtration entails. Formal definitions look something like this:
有时,在阅读强化学习领域的论文时,您可能会偶然发现一些听起来很神秘的短语,例如“我们处理过滤后的概率空间” ,“ 期望值取决于过滤条件 ”或“ 决策政策是“可 衡量的 ”。 没有对度量理论的正式训练[2,3],可能很难掌握这种过滤到底需要什么。 正式定义看起来像这样:
Boilerplate language for those familiar with measure theory, no doubt, but hardly helpful otherwise. Googling for answers likely leads through a maze of σ-algebras, Borel sets, Lebesgue measures and Hausdorff spaces, again presuming that one already knows the basics. Fortunately, only a very basic understanding of a filtration is needed to grasp its implications within the RL domain. This article will provide far from a full discussion on the topic, but aims to give a brief and (hopefully) intuitive outline of the core concept.
毫无疑问,对于那些熟悉度量理论的人来说,这些样板语言非常有用,但对于其他方面几乎没有帮助。 谷歌搜索答案可能是通过σ-代数,Borel集,Lebesgue测度和Hausdorff空间的迷宫而导致的,再次假设人们已经知道基础知识。 幸运的是,只需要对过滤有一个非常基本的了解即可掌握其在RL域中的含义。 本文将不提供有关该主题的完整讨论,而旨在给出核心概念的简短(希望)直观概述。
An example
一个例子
In RL, we typically define an outcome space Ω that contains all possible outcomes or samples that may occur, with ω being a specific sample path. For the sake of illustration, we will assume that our RL problem relates to a stock with price Sₜ at day t . Of course we’d like to buy low and sell high (the precise decision-making problem is irrelevant here). We might denote the buying/selling decision as xₜ(ω), i.e., the decision is conditional on the price path. We start with price S₀ (a real number) and every day the price goes up or down according to some random process. We may simulate (or mathematically define) such a price path ω=[ω₁,…,ωₜ] up front, before running the episode. However, that does not mean we should know stock price movements before they actually happen — even Warren Buffett could only dream of having such information! To claim we base our decisions on ω without being a clairvoyant, we may state the outcome space is ‘filtered’ (using the symbol ℱ ) meaning we can only observe the sample up to time t.
在RL中,我们通常定义一个结果空间Ω ,其中包含所有可能的结果或可能发生的样本,其中ω是特定的样本路径。 为了便于说明,我们假设RL问题与在t天价格为S a的股票有关。 当然,我们想买低卖高(确切的决策问题在这里无关紧要)。 我们可以将买卖决策表示为xₜ(ω) , 即,决定取决于价格路径。 我们从价格S₀ (一个实数)开始,然后每天价格都会根据某种随机过程上升或下降。 在运行情节之前,我们可以预先模拟(或数学定义)这样的价格路径ω= [ω₁,…,ωₜ] 。 但是,这并不意味着我们不应该在股价实际发生之前就知道它们的波动-甚至沃伦·巴菲特也只能梦想拥有这样的信息! 为了声明我们的决定基于ω而并非千篇一律 ,我们可以声明结果空间已被“过滤”(使用符号ℱ),这意味着我们只能观察到时间t之前的样本。
For most RL practitioners, this restriction must sound familiar. Don’t we usually base our decisions on the current state Sₜ? Indeed, we do. In fact, as the Markov property implies that the stochastic process is memoryless — we only need the information embedded in the prevailing state Sₜ — information from the past is irrelevant [5]. As we will shortly see, the filtration is richer and more generic than a state, yet for practical purposes their implications are similar.
对于大多数RL从业者来说,此限制必须听起来很熟悉。 我们通常不是根据当前状态Sₜ做出决定吗? 确实,我们做到了。 实际上,正如马尔可夫性质所暗示的那样,随机过程是无记忆的-我们只需要嵌入盛行状态Sₜ中的信息-过去的信息就无关紧要[5]。 我们将很快看到,过滤比状态更丰富,更通用,但是出于实际目的,它们的含义是相似的。
Let’s formalize our stock price problem a bit more. We start with a discrete problem setting, in which the price either goes up (u) or down (-d). Considering an episode horizon of three days, the outcome space Ω may be visualized by a binomial lattice [4]:
让我们进一步规范一下股价问题。 我们从一个离散的问题设置开始,在该问题中,价格上涨( u )或下跌( -d )。 考虑到三天的发作期,结果空间Ω可以通过二项式格子[4]可视化:
Definition of events and filtrations
事件和过滤的定义
At this point, we need to define the notion of an `event’ A ∈ Ω. Perhaps stated somewhat abstractly, an event is an element of the outcome space. Simply put, we can assign a probability to an event and assert whether it has happened or not. As we will soon show, it is not the same as a realization ω though.
此时,我们需要定义“事件” A∈Ω的概念。 一个事件可能是结果空间的一个元素,也许有些抽象地表述。 简而言之,我们可以为事件分配概率并断言事件是否发生。 正如我们将很快证明的那样,它与实现ω不同 。
A filtration ℱ is a mathematical model that represents partial knowledge about the outcome. In essence, it tells us whether an event happened or not. The `filtration process’ may be visualized as a sequence of filters, with each filter providing us a more detailed view . Concretely, in an RL context the filtration provides us with the information needed to compute the current state Sₜ, without giving any indication of future changes in the process [2]. Indeed, just like the Markov property.
过滤ℱ是一个数学模型,代表关于结果的部分知识。 从本质上讲,它告诉我们事件是否发生。 “过滤过程”可以可视化为一系列过滤器,每个过滤器为我们提供更详细的视图。 具体而言,在RL上下文中,过滤为我们提供了计算当前状态Sₜ所需的信息,而没有提供任何对过程未来变化的指示[2]。 确实,就像马尔可夫财产一样。
Formally, a filtration is a σ-algebra, and although you don’t need to know the ins and outs some background is useful. Loosely defined, a σ-algebra is a collection of subsets of the outcome space, containing a countable number of events as well as all their complements and unions. In measure theory this concept has major implications, for the purpose of this article you only need to remember that the σ-algebra is a collection of events.
形式上,过滤是一个σ-代数,尽管您不需要了解来龙去脉,但有些背景是有用的。 松散定义的σ-代数是结果空间子集的集合,其中包含可数的事件以及它们的所有互补和并集。 在量度理论中,此概念具有重要意义,对于本文而言,您只需要记住σ-代数是事件的集合。
Example revisited — discrete case
再看示例-离散情况
Back to the example, because the filtration only comes alive when seeing it into action. We first need to define the events, using sequences such as ‘udu’ to describe price movements over time. At t=0 we basically don’t know anything — all paths are still possible. Thus, the event set A={uuu, uud, udu, udd, ddd, ddu, dud, duu} contains all possible paths ω ∈ Ω. At t=1, we know a little more: the stock price went either up or down. The corresponding events are defined by Aᵤ={uuu,uud,udu,udd} and Aₔ={ddd,ddu,dud,duu}. If the stock price went up, we can surmise that our sample path ω will be in Aᵤ and not in Aₔ (and vice versa, of course). At t=2, we have four event sets: Aᵤᵤ={uuu,uud}, Aᵤₔ={udu,udd}, Aₔᵤ={duu,dud}, and Aₔₔ={ddu,ddd}. Observe that the information is getting increasingly fine-grained; the sets to which ω might belong are becoming smaller and more numerous. At t=3, we obviously know the exact price path that has been followed.
回到示例,因为过滤只有在生效时才会生效。 我们首先需要定义事件,使用诸如“ udu”之类的序列来描述价格随时间的变化。 在t = 0时,我们基本上什么都不知道-所有路径仍然可行。 因此,事件集A = {uuu,uud,udu,udd,ddd,ddu,dud,duu}包含所有可能的路径ω∈Ω 。 在t = 1时 ,我们知道的更多:股票价格上涨或下跌。 相应的事件由Aᵤ= { u uu, u ud, u du, u dd}和Aₔ= { d dd, d du, d ud, d uu}定义 。 如果股价上涨,我们可以推测样本路径ω将在Aᵤ中 ,而不在Aₔ中 (当然,反之亦然)。 在t = 2时 ,我们有四个事件集: Aᵤᵤ= {uuu,uud} , Aᵤₔ= { ud u, ud d} , Aₔᵤ= { du u, du d}和Aₔₔ= { dd u, dd d} 。 观察到信息越来越细化; ω可能属于的集合越来越小。 在t = 3时 ,我们显然知道遵循的确切价格路径。
Having defined the events, we can define the corresponding filtrations for t=0,1,2,3:
定义事件后,我们可以为t = 0,1,2,3定义相应的过滤:
At t=0, every outcome is possible. We initialize the filtration with the empty set ∅ and outcome space Ω, also known as a trivial σ-algebra.
在t = 0时 ,所有结果都是可能的。 我们用空集∅和结果空间Ω(也称为平凡 σ-代数)初始化过滤。
At t=1, we can simply add Aᵤ and Aₔ to ℱ₀ to obtain ℱ₁; recall from the definition that each filtration always includes all elements of its predecessor. We can use the freshly revealed information to compute S₁. We also get a peek into the future (without actually revealing future information!): if the price went up, we cannot reach the lowest possible price at t=3 anymore. The event sets are illustrated below
在t = 1时 ,我们可以简单地将Aᵤ和Aₔ加到ℱ以获得obtain 。 从定义中回想起,每次过滤总是包含其前身的所有元素。 我们可以使用最新显示的信息来计算S₁ 。 我们还会窥视未来(实际上并不会透露未来的信息!):如果价格上涨,我们将无法再达到t = 3时的最低价格。 事件集如下所示
At t=2, we may distinguish between four events depending on the price paths revealed so far. Here things get a bit more involved, as we also need to add the unions and complements (in line with the requirements of the σ-algebra). This was not necessary for ℱ₁, as the union of Aᵤ and Aₔ equals the outcome space and Aᵤ is the complement of Aₔ. From an RL perspective, you might note that we have more information than strictly needed. For instance, an up-movement followed by a down-movement yields the same price as the reverse. In RL applications we would typically not store such redundant information, yet you can probably recognize the mathematical appeal.
在t = 2时 ,我们可以根据到目前为止揭示的价格路径来区分四个事件。 这里的事情涉及更多,因为我们还需要添加并集和补码(符合σ-代数的要求)。 ℱℱ不需要 ,因为A necessary和Aₔ的并集等于结果空间,而Aᵤ是Aᵤ的补码 。 从RL角度来看,您可能会注意到,我们掌握的信息超出了严格需要的信息。 例如,向上运动然后向下运动会产生与反向运动相同的价格。 在RL应用程序中,我们通常不会存储此类冗余信息,但您可能会认识到数学上的吸引力。
At t=3, we already have 256 sets, using the same procedure as before. You can see that filtrations quickly become extremely large. A filtration always contains all elements of the preceding step — our filtration gets richer and more fine-grained with the passing of time. All this means is that we can more precisely pinpoint the events to which our sample price path may or may not belong.
在t = 3处 ,我们已经有256套,使用与以前相同的过程。 您会看到过滤很快变得非常大。 过滤始终包含上一步的所有元素-随着时间的流逝,我们的过滤会变得越来越丰富,而且粒度越来越细。 这一切意味着我们可以更精确地查明样本价格路径可能或可能不属于的事件。
A continuous example
一个连续的例子
We are almost there, but we would be remiss if we only treat discrete problems. In reality, stock prices do not only go ‘up’ or ‘down’; they will change within a continuous domain. The same holds for many other RL problems. Although conceptually the same as for the discrete case, providing explicit descriptions for filtrations in continuous settings is difficult. Again, some illustrations might help more than formal definitions.
我们快到了,但是如果我们只处理离散的问题,我们将被解雇。 实际上,股票价格不仅会上涨或下跌。 他们将在一个连续的领域内变化。 许多其他RL问题也是如此。 尽管从概念上讲与离散情况相同,但是很难提供连续设置中过滤的明确描述。 同样,一些插图可能比正式定义更有帮助。
Suppose that at every time step, we simulate a return from the real domain [-d,u]. Depending on the time we look ahead, we may then define an interval in which the future stock price will fall, say [329,335] at a given point in time. We can then define intervals within this domain. Any arbitrary interval may constitute an event, for instance:
假设在每个时间步上,我们都模拟来自实域[-d,u]的收益。 根据我们的展望时间,我们可以定义一个时间间隔,即在给定的时间点,未来股价将下跌,例如[329,335] 。 然后,我们可以在此域内定义间隔。 任何任意间隔都可能构成一个事件,例如:
The complement of an interval could look like
间隔的补码可能看起来像
Furthermore, a plethora of unions may be defined, such as
此外,可以定义过多的联合,例如
As you may have guessed, there are infinitely many of such events in all shapes and sizes, yet they are still countable and we can assign a probability to each of them [2,3].
正如您可能已经猜到的那样,各种形状和大小的事件有无数种,但它们仍然是可数的,我们可以为每个事件分配一个概率[2,3]。
The further we look into the future, the more we can deviate from our current stock price. We might visualize this with a cone shape that expands over time (displayed below for t=50 and t=80). Within the cone, we can define infinitely many intervals. As before, we acquire a more detailed view once more time has passed.
我们对未来的展望越深,与当前股价的偏差就越大。 我们可以用随时间扩展的圆锥形形象化(在下面显示t = 50和t = 80 )。 在圆锥内,我们可以定义无限多个间隔。 和以前一样,一旦时间过去,我们将获得更详细的视图。
Wrapping things up
整理东西
When encountering filtrations in any RL paper, the basics treated in this article should suffice. Essentially, the only purpose of introducing filtrations ℱₜ is to ensure that decisions xₜ(ω) do not utilize information that has not yet been revealed. When the Markov property holds, a decision xₜ(Sₜ) that operates on the current state serves the same purpose. The filtration provides a rich description of the past, yet we do not need this information in memoryless problems. Nevertheless, from a mathematical perspective it is an elegant solution with many interesting applications. The reinforcement learning community consists of many researchers and engineers from different backgrounds working in a variety of domains, not everyone speaks the same language. Sometimes it goes a long way to learn another language, even if only a few words.
当在任何RL纸中遇到过滤时,本文所处理的基本知识就足够了。 从本质上讲,引入的过滤ℱₜ的唯一目的是确保决策xₜ(ω)不利用还未被透露的信息。 当马尔可夫属性成立时,在当前状态下运行的决策xₜ(Sₜ)具有相同的目的。 筛选提供了对过去的丰富描述,但是在无记忆问题中我们不需要此信息。 但是,从数学角度来看,它是一种优雅的解决方案,具有许多有趣的应用程序。 强化学习社区由来自不同背景的许多研究人员和工程师组成,这些研究人员和工程师在不同的领域工作,并不是每个人都讲相同的语言。 有时候,即使只有几个单词,学习另一种语言也会走很长的路。
[This article is partially based on my ArXiv article ‘A Gentle Lecture Note on Filtrations in Reinforcement Learning’]
[本文部分基于我的ArXiv文章“关于强化学习中的过滤的温和讲义”]
[1] Van Heeswijk, W.J.A. (2020). A Gentle Lecture Note on Filtrations in Reinforcement Learning. arXiv preprint arXiv:2008.02622
[1] Van Heeswijk,WJA(2020)。 关于强化学习中的过滤的温和的讲义。 arXiv预印本arXiv:2008.02622
[2] Shreve, S. E. (2004). Stochastic Calculus for Finance II: Continuous-Time Models, Volume 11. Springer Science & Business Media.
[2] Shreve,SE(2004)。 金融随机算术II:连续时间模型,第11卷。Springer科学与商业媒体。
[3] Shiryaev, A. N. (1996). Probability. Springer New York-Heidelberg.
[3] Shiryaev,AN(1996)。 可能性。 施普林格纽约-海德堡。
[4] Luenberger, D. G. (1997). Investment Science. Oxford University Press.
[4] Luenberger,DG(1997)。 投资科学。 牛津大学出版社。
[5] Powell, W. B. (2020). On State Variables, Bandit Problems and POMDPs. arXiv preprint arXiv:2002.06238
[5]鲍威尔,世界银行(2020)。 关于状态变量,强盗问题和POMDP。 arXiv预印本arXiv:2002.06238