并行实现回溯
介绍 (Introduction)
Complex parsers often need to support backtracking, which is a way to revisit items you've already encountered. The trick with this is twofold, doing it efficiently, and doing it transparently. The LookAheadEnumerator<T>
class provides both.
复杂的解析器通常需要支持backtracking ,这是一种重新访问您已经遇到的项目的方法。 技巧是双重的,有效地做到,透明地做到。 LookAheadEnumerator<T>
类同时提供了两者。
Update: Bug fix, more robust, doc comments, and 90% slangified the code (removed switch
/nameof
and the like) so that Slang can cook it and I can use it in my generated parser code.
更新:错误修复,功能更强大的文档注释以及90%的代码(减少了switch
/ nameof
之类)的语言化,以便Slang可以烹饪它,并且可以在生成的解析器代码中使用它。
Update: LookAheadEnumerator
is now used extensively in my Parsley parser generator which can parse C# (an example of parsing a large subset is at the link)
更新: LookAheadEnumerator
现在在我的Parsley解析器生成器中得到了广泛使用, 该生成器可以解析C# (链接中有一个解析大子集的示例)
背景 (Background)
It's often desirable in parser code to use some sort of "streaming" interface for its input like a TextReader
class or an class implementing IEnumerator<T>
. I prefer enumerators because of their ubiquity and simplicity. However, it can be difficult to backtrack on streaming sources without preloading it into memory before parsing. This is fine for small text, but not reams of say, bulk JSON.
在解析器代码中,通常希望使用某种“流”接口作为其输入,例如TextReader
类或实现IEnumerator<T>
的类。 我喜欢枚举器,因为它们无处不在且简单。 但是,在解析之前不将其预加载到内存中就很难回溯流式源。 这适用于小文本,但不适用于大量的JSON。
Normally with an enumerator, all you can do is MoveNext()
and sometimes Reset()
if you're lucky. There is no way to seek back to a particular previous position, and even if there was, it probably wouldn't work of a true streaming source, like an HTTP response, or console input.
通常,使用枚举器,您可以做的就是MoveNext()
,如果幸运的话,有时还可以进行Reset()
。 无法找到以前的特定位置,即使存在,也可能无法用于真正的流式源,例如HTTP响应或控制台输入。
A backtracking parser on the other hand, needs to "bookmark" its current position, before trying several alternatives until it finds one that parses. That means revisiting the same sequence of text several times.
另一方面,回溯解析器需要“添加书签”其当前位置,然后尝试多种选择,直到找到要解析的解析器为止。 这意味着多次重新访问相同的文本序列。
Backtracking parsers are inherently less efficient but far more flexible than non-backtracking parsers. I've made a fair effort to optimize this class to make it as efficient as possible for this purpose.
回溯解析器本质上效率较低,但比非回溯解析器要灵活得多。 我已经做出了最大的努力来优化此类,以使其达到最高效率。
概念化这个混乱 (Conceptualizing This Mess)
I've embedded an array backed queue into this class, which it uses to back the the lookahead buffer. The queue starts with 16 elements and grows as needed (almost doubling in size each time to avoid too many reallocations - heap in .NET is cheaper than CPU) depending on how much lookahead is needed. When a LookAheadEnumeratorEnumerator<T>
(the lookahead cursor) is advanced, it often requires the primary class to read more data into the queue in order to satisfy it. When the main cursor is advanced, it will discard items in the queue (simply incrementing _queueHead
which is really fast) . It's not a good idea to advance or reset the main cursor while using the lookahead cursor. The results in this case, are undefined, as I haven't implemented versioning in these enumerators.
我已经将数组支持的队列嵌入到此类中,该队列用于支持超前缓冲区。 队列从16个元素开始,并根据需要增长(每次的大小几乎加倍,以避免过多的重新分配-.NET中的堆比CPU便宜),取决于所需的提前量。 当LookAheadEnumeratorEnumerator<T>
(超前光标)前进时,通常需要主类将更多数据读入队列才能满足要求。 当主光标前进时,它将丢弃队列中的项目(只需增加_queueHead
,这确实非常快)。 在使用超前光标时前进或重置主光标不是一个好主意。 在这种情况下,结果是不确定的,因为我没有在这些枚举器中实现版本控制。
使用这个混乱 (Using This Mess)
You use the code like a standard IEnumerator<T>
with an additional property - LookAhead
that allows you to foreach
from your current position without advancing the cursor. There's also Peek()
and TryPeek()
which look ahead a specified number of positions, and DiscardLookAhead()
which simply moves the cursor to the physical position and clears the buffer.
您可以使用类似于标准IEnumerator<T>
的代码,并带有附加属性LookAhead
,该属性允许您从当前位置进行foreach
而无需前进光标。 还有Peek()
和TryPeek()
可以预见指定数量的位置,还有DiscardLookAhead()
可以将光标简单移动到物理位置并清除缓冲区。
var text = "fubarfoobaz";
var la = new LookAheadEnumerator<char>(text.GetEnumerator());
la.MoveNext(); // can't look ahead until we're already over the position
// we want to start look at.
foreach (var ch in la.LookAhead)
Console.Write(ch);
Console.WriteLine();
while (la.MoveNext())
{
foreach (var ch in la.LookAhead)
Console.Write(ch);
Console.WriteLine();
}
This would print the following to the console:
这会将以下内容打印到控制台:
fubarfoobaz
ubarfoobaz
barfoobaz
arfoobaz
rfoobaz
foobaz
oobaz
obaz
baz
az
z
As you can see, we're incrementing the primary cursor by one in each iteration, and then we're enumerating over LookAhead
from there. Enumerating over LookAhead
does not affect the primary cursor*.
如您所见,我们在每次迭代中将主光标增加一个,然后从那里枚举LookAhead
。 通过LookAhead
枚举不会影响主光标*。
* The underlying physical read cursor is advanced, as it must be, but a facade is presented using a queue that buffers the already read input for you, presenting it as the next input.
*底层物理读游标确实已被高级化,但是使用队列为您呈现了一个外观,该队列为您缓冲已读的输入,并将其作为下一个输入。
Typically in a parser, you'll use it over tokens, like LookAheadEnumerator<Token>
and then each time you need to do backtracking, you parse along LookAhead
instead of the primary cursor. When you find a match, you'll have to discard all the tokens you matched, either by reparsing along the primary cursor or by counting and advancing if you know how many tokens you parsed. If you're only parsing one alternative to the main parse, you can simply use DiscardLookAhead()
once you've matched the alternate.
通常,在解析器中,将在诸如LookAheadEnumerator<Token>
类的令牌上使用它,然后每次需要进行回溯时,都将沿LookAhead
而不是主光标进行解析。 找到匹配项后,您将必须丢弃所有匹配的令牌,方法是沿着主光标重新解析,或者如果知道解析了多少令牌,则计数并前进。 如果您仅解析主解析的一种替代方法,则在与替代方法匹配后,可以简单地使用DiscardLookAhead()
。
That's about all. This is an extremely specialized class, but when you need it, you really need it.
仅此而已。 这是一门非常专业的课程,但是当您需要它时,您确实需要它。
翻译自: https://www.codeproject.com/Tips/5254600/LookAheadEnumerator-Implement-Backtracking-in-Your
并行实现回溯