正则表达式 上下文无关_正则表达式之外:解析上下文无关语法的简介

正则表达式 上下文无关

by Christopher Diggins

克里斯托弗·迪金斯(Christopher Diggins)

正则表达式之外:解析上下文无关语法的简介 (Beyond regular expressions: An introduction to parsing context-free grammars)

An important and useful tool that is already a part of most programmers’ arsenals is the trusty regular expression. But beyond that lie context-free grammars. This is a simple concept with a fancy name.

值得信赖的正则表达式是大多数程序员中已经使用的重要且有用的工具 但是除此之外,还没有上下文无关的语法。 这是一个简单的概念,名字很漂亮。

A regular expression is a method of validating and finding patterns in text. The kinds of patterns (aka grammars) that can be described and detected using a regular expression are called regular languages. Regular languages are the simplest of formal languages in the Chomsky hierarchy.

正则表达式是一种验证和查找文本模式的方法。 可以使用正则表达式描述和检测的模式(aka语法)的种类称为正则语言 。 常规语言是Chomsky层次结构中最简单的形式语言。

Regular expressions are great for finding or validating many types of simple patterns, for example phone numbers, email addresses, and URLs. However, they fall short when applied to patterns that can have a recursive structure, such as:

正则表达式非常适合查找或验证许多类型的简单模式,例如电话号码,电子邮件地址和URL。 但是,当将它们应用于可以具有递归结构的模式时,它们就不够用了,例如:

  • HTML / XML open/close tags

    HTML / XML打开/关闭标签
  • open/close braces {/} in programming languages

    用编程语言打开/关闭大括号{/}
  • open/close parentheses in arithmetical expressions

    在算术表达式中打开/关闭括号

To parse these types of patterns, we need something more powerful. We can move to the next level of formal grammars called context free grammars (CFG).

要解析这些类型的模式,我们需要更强大的功能。 我们可以进入称为上下文无关语法 (CFG)的形式语法的下一个层次。

解析数学表达式 (Parsing mathematical expressions)

Parsing the set of all mathematical expressions is beyond the power of a true regular expression. The reason is that these can contain arbitrarily deep nested pairs of parentheses.

解析所有数学表达式的集合超出了真正正则表达式的能力。 原因是它们可以包含任意深的嵌套括号对。

For example, consider the expression: (2 + (3 * (7–4)))

例如,考虑以下表达式: (2 + (3 * (7–4)))

Notice that the structure of the arithmetical expression is effectively a tree:

注意,算术表达式的结构实际上是一棵树:

+ / \ 2   *   / \  3   -     / \     7 4

The tree structure generated as the result of running a CFG parser is called a parse tree.

运行CFG解析器所生成的树结构称为解析树

描述无上下文语法 (Describing context-free grammars)

There are two popular methods of expressing CFG grammars:

表达CFG语法有两种流行的方法:

  1. Extended Bachus-Naur Form (EBNF) — describes a CFG in terms of production rules. These are rules that, when applied, can generate all possible legal phrases in the language.

    扩展的Bachus-Naur形式 (EBNF)-根据生产规则描述CFG。 这些规则在应用时可以生成该语言中所有可能的合法短语。

  2. Parsing Expression Grammar (PEG) — describes a CFG in terms of recognition rules. These are rules that can be used to match valid phrases in the language.

    解析表达语法 (PEG)-根据识别规则描述CFG。 这些规则可用于匹配语言中的有效短语。

The PEG formalism has the advantage over EBNF that the mapping to a parser is unambiguous, and can be easily automated.

与EBNF相比,PEG形式主义具有一个优势,即到解析器的映射是明确的,并且可以轻松实现自动化。

The following is a simple PEG lifted from its Wikipedia page describing mathematical formulas that apply the basic four operations to non-negative integers.

以下是从其Wikipedia页面摘录的简单PEG ,其中描述了将基本四个运算应用于非负整数的数学公式。

Expr ← SumSum ← Product ((‘+’ / ‘-’) Product)*Product ← Value ((‘*’ / ‘/’) Value)*Value ← [0–9]+ / ‘(‘ Expr ‘)’

In plain English, we can read this as:

用简单的英语来说,我们可以这样读:

  • Expr is a Sum

    Expr是一个Sum

  • Sum is a Product followed by zero or more sub-patterns that consist of a “+” or “-” followed by a Product

    Sum是一个Product后跟零个或多个子模式,由“ +”或“-”后跟一个Product

  • Product is a Value followed by zero or more sub-patterns that consist of a “*” or “/” followed by a Value

    Product是一个Value后跟零个或多个子模式,由“ *”或“ /”组成,后跟一个Value

  • Value is either one or more members of the character set {0,..9}, or it is an open parenthesis “(“ followed by a Expr and a closing parenthesis “)”.

    Value是字符集{0,.. 9}的一个或多个成员,或者它是一个开放括号“(”,后跟一个Expr和一个封闭括号“)”。

解析器生成器与解析库 (Parser generators versus parsing libraries)

Assuming you aren’t the type of person who likes to reinvent the wheel (not that there is anything wrong with that), there are generally two options for creating a parser:

假设您不是喜欢重塑方向盘的人(不是那有什么问题),通常有两种创建解析器的选择:

1. Use a parser generator — a tool that generates the source code for a parser from an abstract definition of the parser. Some popular examples in JavaScript include Jison, PEG.js, nearley, and ANTLR.

1. 使用解析器生成器 -一种从解析器的抽象定义生成解析器源代码的工具。 在JavaScript中一些流行的例子包括JisonPEG.jsnearleyANTLR

2. Use a parsing library — a library that allows the expression of the parse rules as an API. Some examples in JavaScript include Myna, Parsimmon, and Chevrotain.

2. 使用解析库 ,该库允许将解析规则表达为API。 在JavaScript的一些例子包括八哥Parsimmon鼷鹿科

My preference is to use parsing libraries, because they are easier to understand, debug, maintain, and customize.

我更喜欢使用解析库,因为它们更易于理解,调试,维护和自定义。

使用Myna解析库以TypeScript / JavaScript编写解析器 (Writing parsers in TypeScript / JavaScript using the Myna Parsing Library)

Recently, a project I was working on (the Heron language) required a parsing library that could run in the browser. I found the complexity and overhead of existing libraries too great. Given I had previous experience in writing parsing libraries in C++ and C#, I decided to write a parser library called Myna using TypeScript.

最近,我正在研究的项目( Heron语言 )需要一个可以在浏览器中运行的解析库。 我发现现有库的复杂性和开销太大。 考虑到我以前在C ++和C#中编写解析库的经验,我决定使用TypeScript编写一个名为Myna解析器库

Myna uses fluent syntax (method chaining) to make it easy to define a parser as a set of rules (sub-parser) that resemble a PEG grammar.

Myna使用流利的语法(方法链接)来简化将解析器定义为类似于PEG语法的一组规则(子解析器)的过程。

The following example is from the Myna GitHub repo:

以下示例来自Myna GitHub存储库

从具体语法树(CST)到抽象语法树(AST) (From concrete syntax tree (CST) to abstract syntax tree (AST))

When a parser processes the input, each successfully matched rule (aka grammar production) can be mapped to a node in the parse tree. This literal mapping of production rules to nodes in a tree is a concrete syntax tree (CST).

当解析器处理输入时,可以将每个成功匹配的规则(又称语法产生)映射到解析树中的节点。 生产规则到树中节点的文字映射是一个具体的语法树 (CST)。

In some cases, the CST is of limited use as it contains a lot of syntactic clutter, for example comments in the source code, or whether a string literal has double quotes or single quotes. It may contain results from rules that are created to make the grammar easier to use, but don’t represent the intended tree structure for analysis.

在某些情况下,CST用途有限,因为它包含许多语法混乱,例如源代码中的注释,或者字符串文字是带双引号还是单引号。 它可能包含规则创建的结果,这些规则是为了使语法更易于使用而创建的,但并不表示要进行分析的预期树结构。

The simplest thing to do is to only create nodes in the output tree for specific rules and to skip other rules. This simplified version of the parse tree is called an abstract syntax tree (AST). There may be multiple passes performed on an AST to transform it into alternative AST representations, to simplify later processing steps.

最简单的操作是仅在输出树中为特定规则创建节点,并跳过其他规则。 解析树的简化版本称为抽象语法树 (AST) 。 可能会对AST执行多次处理,以将其转换为其他AST表示形式,以简化后续处理步骤。

In Myna, an AST is generated by creating nodes from rules labeled with the ast property. Technically, this property returns a new rule that has an internal property set that tells the parser to generate a parse node in the parse tree.

在Myna中,通过根据标有ast属性的规则创建节点来生成AST。 从技术上讲,此属性返回一个新规则,该规则具有一个内部属性集,该属性集告诉解析器在解析树中生成一个解析节点。

使用生成的Myna抽象语法树 (Using the generated Myna abstract syntax tree)

Here is an example of using a Myna-defined parser in “Node.JS” to evaluate an arithmetical expression:

这是在“ Node.JS”中使用Myna定义的解析器来评估算术表达式的示例:

最后的话 (Final words)

If you are interested in learning more about creating and using parsers, whether or not the Myna library meets your specific needs, I encourage you to take a bit of time to read through the source code of the Myna parsing library.

如果您想了解有关创建和使用解析器的更多信息,无论Myna库是否满足您的特定需求,我建议您花一些时间通读Myna解析库源代码

Myna was written in TypeScript (which has a familiar syntax for most programmers), is contained in a single file with no dependencies, and is less than 1200 lines including detailed documentation.

Myna用TypeScript编写(对于大多数程序员来说,语法都很熟悉),包含在一个没有依赖性的文件中,并且少于1200行,包括详细的文档。

If you are interested in seeing Myna applied to a more a complex scenario, take a look at the Chickadee programming language. This is implemented entirely in TypeScript and depends only on the Myna parsing library. Chickadee is a tiny programming language designed specifically to help people learn about techniques of implementing programming languages.

如果您希望看到Myna应用于更复杂的场景,请查看Chickadee编程语言 。 这完全在TypeScript中实现,并且仅取决于Myna解析库 。 Chickadee是一种微小的编程语言,专门用于帮助人们学习实现编程语言的技术。

If you liked this article please let me know, and consider sharing it with your friends and colleagues.

如果您喜欢这篇文章,请告诉我,并考虑与您的朋友和同事分享。

翻译自: https://www.freecodecamp.org/news/beyond-regular-expressions-an-introduction-to-parsing-context-free-grammars-ee77bdab5a92/

正则表达式 上下文无关

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值