Regex 历史 / 规范 / 流派 | JavaScript 匹配 emoji

注:本文为几篇 regex 相关合辑。机翻,未校,未整理。


Regex History and How-To

Crystal Villanueva

Jan 14, 2021

A regular expression, also known as regex or regexp, is a special string that presents itself repeatedly in a search pattern; today, programmers use regex to create a regular pattern to look for within their code to capture, validate, find and replace and insert. Regex may seem intimidating, however with a basic breakdown of what regex is and how to apply it, it isn’t so terrifying after all. This post is just a beginners dip into the world of regex, not a comprehensive guide of what each symbol means.
正则表达式(也称为 regex 或 regexp)是一种特殊字符串,它在搜索模式中重复出现;如今,程序员使用 Regex 创建常规模式,以便在其代码中查找以捕获、验证、查找、替换和插入。正则表达式可能看起来令人生畏,但是对于正则表达式是什么以及如何应用它的基本分解,它毕竟并不那么可怕。这篇文章只是初学者对正则表达式世界的深入了解,而不是每个符号含义的全面指南。

History 历史

In the 1950’s a mathematician named Stephan Cole Kleene submitted a paper on automata theory and regular expressions. While the automata theory meant that the simplest machine has bounds or limits in its memory; “the memory size is independent of the input length”, his focus on regular expressions stemmed from McCulloch and Pitts’ neural calculus in their investigation of behavioral activation and activity of a neuron: with the computation of neuron activity in his biophysics focus, Kleene denoted that neuronal activation sequences “[bring a given net…] to a particular state after they have been completely processed, and discovered interesting regularities among them”(McCulloch and Pitts’ neural logical calculus. (n.d.)). Kleene distinguished these as regular events’, later called regular expressions’ in lexical analysis (Hopcroft, Motwani, and Ullman 2014). From his mathematical contributions, regular expressions were later implemented into machines (i.e., the SNOBOL programming language in the 60’s and the Unix editor systems in the late 60’s/early 70’s ) to compile and search for expressions occurring more than once. Regex is used through out search engines to find and replace input, editing, formatting, and the output of text.
在 1950 年代,一位名叫 Stephan Cole Kleene 的数学家提交了一篇关于自动机理论和正则表达式的论文。虽然自动机理论意味着最简单的机器在其内存中有边界或限制;“内存大小与输入长度无关”,他对正则表达式的关注源于 McCulloch 和 Pitts 在研究神经元的行为激活和活动时的神经演算:在他的生物物理学重点中,神经元活动的计算表明神经元激活序列“[带来给定的网络…]到特定状态,并在它们之间发现了有趣的规律“(McCulloch 和 Pitts 的神经逻辑演算。(日期不详))。Kleene 将这些区分为“常规事件”,后来在词汇分析中称为“常规表达式”(Hopcroft、Motwani 和 Ullman 2014)。根据他的数学贡献,正则表达式后来被实现到机器中(即 60 年代的 SNOBOL 编程语言和 60 年代末/70 年代初的 Unix 编辑器系统),以编译和搜索多次出现的表达式。正则表达式在整个搜索引擎中用于查找和替换输入、编辑、格式化和文本输出。

Javascript and Regex

Javascript 和 Regex

In Javascript, ‘.test()’, ‘.match()’, ‘.matchAll()’, ‘.replaceAll()’, ‘.search()’, ‘.split()’or ‘.replace()’ are all methods that a programmer could implement into their function body to search for a regular expression within their code. The method ‘.test()’ returns true if found, and false if it cannot find the pattern within the code. The method ‘.match()’ searches throughout the code to find it’s regex match. Lastly, ‘.replace()’ takes in two arguments: the regex that the method is searching for, and what it is trying to replace:
在 Javascript 中,‘.test()’、‘.match()’、‘.matchAll()’、‘.replaceAll()’、‘.search()’、‘.split()‘或 ‘.replace()’ 都是程序员可以在其函数体中实现的方法,以在其代码中搜索正则表达式。方法 ‘.test()’ 如果找到,则返回 true,如果在代码中找不到模式,则返回 false。方法 ‘.match()’ 在整个代码中搜索以查找其正则表达式匹配项。最后,’.replace()’ 接受两个参数:方法正在搜索的正则表达式,以及它试图替换的内容:

let command = "G()()()()(al)"
let interpret = function(command) {
 return command.replace(/\(\)/g, "o").replace(/\(al\)/g, "al")
};// output => "Gooooal"

Above, the method ‘.replace()’ is featured. The first argument of ‘.replace()’ is to find all instances of ‘()’ which is syntactically placed for regex form, and to replace it with ‘o’. Another ‘.replace()’ is chained onto the function body to find all instances of ‘(al)’ and replace it with ‘al’.
上面,方法 ‘.replace()’ 是特色。‘.replace()’ 的第一个参数是找到 ‘()’ 的所有实例,它在语法上是针对正则表达式形式的,并将其替换为 ‘o’。另一个 ‘.replace()’ 被链接到函数体上,以查找 ‘(al)’ 的所有实例并将其替换为 ‘al’。

What does the filler in the first part of the argument in the ‘.replace()’ mean? It can be confusing, but here is a short guide of slowing down and understanding this example of regex. The first argument in the first ‘.replace()’, ‘/\(\)/g’, is deciphered as the following:
‘.replace()’ 中参数第一部分的填充物是什么意思?这可能会令人困惑,但这里有一个简短的指南,可以放慢速度并理解这个正则表达式示例。第一个 ‘.replace()’ 中的第一个参数 ‘/()/g’ 被解读为:

在这里插入图片描述

The ‘/ /’ is standard format in regex; a pattern is enclosed in slashes. Most times you will see it in slashes, sometimes in the constructor format.
‘/ /’ 是正则表达式中的标准格式;模式用斜杠括起来。大多数时候,您会以斜杠形式看到它,有时是构造函数格式。

The ‘g’ flag in regex means ‘global’; the ‘g’ flag finds all possible matches throughout the string.
正则表达式中的“g”标志表示“全局”;‘g’ 标志查找整个字符串中所有可能的匹配项。

The backslash in the box escapes the special character; to find one of the parenthesis, you need the backslash.
框中的反斜杠转义特殊字符;要找到其中一个括号,您需要反斜杠。

The slash in the box escapes the special character; to find one of the parenthesis, you need the backslash.
框中的斜杠转义特殊字符;要找到其中一个括号,您需要反斜杠。

Put it all together and this is the result! Were searching for the ‘()’ globally throughout the code in the correct format.
把它们放在一起,这就是结果!在整个代码中以正确的格式全局搜索 ‘()’。

note: to escape or escaping a character means a character has special meaning *inside* a regular expression.
注: To escapeing or estruing a character 表示字符在正则表达式中具有特殊含义。

Congratulations, now you understand a simple regex! Regex is universal throughout all programming languages. No matter what language you are using, regex will be used to find, replace, split, match and test your code. While this example is rudimentary, hopefully it keens your interest into the world of regex.
恭喜,现在您了解了一个简单的正则表达式!Regex 在所有编程语言中都是通用的。无论您使用哪种语言,正则表达式都将用于查找、替换、拆分、匹配和测试您的代码。虽然这个例子很初级,但希望它能激发您对正则表达式世界的兴趣。

Resources:

Hopcroft, J. E., Motwani, R., & Ullman, J. D. (2014). Introduction to automata theory, languages, and computation. Harlow: Pearson Education.

Leung, H. (2010, September 16). Regular Languages and Finite Automata. Retrieved January 12, 2021, from https://www.cs.nmsu.edu/historical-projects/Projects/kleene.9.16.10.pdf

McCulloch and Pitts’ neural logical calculus. (n.d.). Retrieved January 12, 2021, from https://www.dlsi.ua.es/~mlf/nnafmc/pbook/node10.html


Introduction and History of Regular Expression

Gurkirat Singh

Posted on 2023,08,01

Updated on 2023,09,03

Hello World! Welcome to the “Regular Expressions” series, where we tackle the intimidating syntax that has spawned numerous memes among developers. Don’t worry, though! As we move forward, I assure you that you’ll gain the confidence to craft your own elegant regular expressions by the end of this journey.
世界您好!欢迎来到“正则表达式”系列,在这里,我们将讨论在开发人员中催生了大量模因的令人生畏的语法。不过别担心!随着我们向前发展,我向你保证,在这段旅程结束时,你将有信心制作你自己的优雅正则表达式。

In this series, I will utilise https://regex101.com to share patterns with you, avoiding the constraint of a specific programming language that could potentially create barriers for some learners.
在本系列中,我将利用 https://regex101.com 与您分享模式,避免特定编程语言的限制,这可能会给某些学习者带来障碍。

Note →→ If anywhere coding is required, I will be using C++.
注意 → → 如果需要编码,我将使用 C++。

What are Regular Expressions?

什么是正则表达式?

Regular expressions are a specific type of text pattern that are used while programming a logic.
正则表达式是编写逻辑时使用的一种特定类型的文本模式。

I couldn’t think of a single modern application that doesn’t make use of it, either directly or indirectly. If you went to a website and entered any gibberish text in the email field, you would have received an invalid email format message.
我想不出任何一个现代应用程序不直接或间接地使用它。如果您访问某个网站并在电子邮件字段中输入任何乱码文本,您将收到无效的电子邮件格式消息。

invalid email

Under the hood, your input text is being verified by the following regex pattern.

在后台,您的输入文本由以下正则表达式模式进行验证。

[a-z0-9.-]@[a-z0-9]{2,}\.[a-z]{2,}

You can see how this regex works here. This is a minimal version of matching simple email addresses. Please bear with me if this appears overwhelming. Such expressions can be created in less than 10 seconds.
你可以在这里看到这个正则表达式是如何工作的。这是匹配简单电子邮件地址的最小版本。如果这看起来让人不知所措,请多多包涵。可以在 10 秒内创建此类表达式。

Why this name after all?

为什么叫这个名字呢?

The name “regular expression” originates from the mathematical concept of regular languages, which were first studied by mathematicians in the field of formal language theory. Regular expressions are a way to describe and match patterns in strings of characters, hence the name “regular” expressions.
“正则表达式”这个名字源于规则语言的数学概念,最早是由形式语言理论领域的数学家研究的。正则表达式是一种描述和匹配字符串模式的方法,因此称为 “regular” 表达式。

Problem that Led Development of RegExp

导致 RegExp 开发的问题

It was created in the 1950s (far before most of us were born) to help with text processing tasks, such as searching and editing. Since then, regular expressions have become a standard feature of many programming languages.
它创建于 1950 年代(远在我们大多数人出生之前),用于帮助完成文本处理任务,例如搜索和编辑。从那时起,正则表达式已成为许多编程语言的标准功能。

In the early 1960s, Ken Thompson implemented regular expressions in the QED text editor. This was the first time that regular expressions were used in a practical application.
在 1960 年代初期,Ken Thompson 在 QED 文本编辑器中实现了正则表达式。这是正则表达式首次在实际应用中使用。

Why You Should Learn Regular Expressions?

为什么你应该学习正则表达式?

Learning regular expressions offers numerous benefits in various fields such as
学习正则表达式在各个领域都有许多好处,例如:

  • data analysis 数据分析
  • software development 软件开发
  • content management 内容管理
  • scientific research 科研

Their versatility allows you to define complex patterns for finding specific words or phrases, extracting data from structured text, and performing advanced search and replace operations.
它们的多功能性允许您定义复杂的模式,用于查找特定单词或短语、从结构化文本中提取数据以及执行高级搜索和替换操作。

Additionally, mastering regular expressions helps prevent costly mistakes by providing a precise and controlled approach to text processing. With a solid understanding of regular expressions, you can confidently handle challenging text manipulation tasks, ensuring accuracy, reliability, and improved productivity.
此外,掌握正则表达式通过提供精确且受控的文本处理方法来帮助防止代价高昂的错误。凭借对正则表达式的深入理解,您可以自信地处理具有挑战性的文本操作任务,确保准确性、可靠性和提高生产力。

Since you’re here, I’m assuming you’re ready to understand and use those unwieldy strings of brackets and question marks in your code.
既然你在这里,我假设你已经准备好在代码中理解和使用这些笨拙的括号和问号字符串。

Flavours of RegExp

RegExp 的风格

There is no established standard that specifies which text patterns are and are not regular expressions. There are numerous languages on the market whose creators have various ideas about how regular expressions should look. So we’re now stuck with a whole spectrum of regular expression flavours (implementation of RegExp in the programming language).
没有确定的标准来指定哪些文本模式是正则表达式,哪些不是正则表达式。市场上有许多语言,其创建者对正则表达式的外观有各种想法。因此,我们现在被一整套正则表达式风格所困(在编程语言中实现 RegExp)。

But why reinvent the wheel? Instead, every modern regular expression engine may be traced back to the Perl programming language.
但为什么要重新发明轮子呢?相反,每个现代正则表达式引擎都可以追溯到 Perl 编程语言。

Regular expression flavours are commonly integrated into scripting languages, while other programming languages rely on dedicated libraries for regex support. JavaScript offers built-in support for regular expressions using the syntax /expr/ or the RegExp object. On the other hand, Python implements regular expressions through its standard library re.
正则表达式风格通常集成到脚本语言中,而其他编程语言则依赖于专用库来支持正则表达式。JavaScript 为使用语法 /expr/RegExp 对象的正则表达式提供内置支持。另一方面,Python 通过其标准库 re 实现正则表达式。

References

Automata theory

Automata theory is the study of abstract machines and automata, as well as the computational problems that can be solved using them. It is a theory in theoretical computer science with close connections to mathematical logic. The word automata comes from the Greek word αὐτόματος, which means “self-acting, self-willed, self-moving”. An automaton is an abstract self-propelled computing device which follows a predetermined sequence of operations automatically. An automaton with a finite number of states is called a finite automaton (FA) or finite-state machine (FSM). The figure on the right illustrates a finite-state machine, which is a well-known type of automaton. This automaton consists of states and transitions. As the automaton sees a symbol of input, it makes a transition to another state, according to its transition function, which takes the previous state and current input symbol as its arguments.
自动机理论是对抽象机器和自动机的研究,以及可以使用它们解决的计算问题。它是理论计算机科学中的一种理论,与数理逻辑密切相关。自动机一词来自希腊语 αὐτόματος,意思是“自我行动、任性、自我移动”。自动机是一种抽象的自走式计算设备,它会自动遵循预定的操作顺序。具有有限状态数的自动机称为有限自动机 (FA) 或有限状态机 (FSM)。下图说明了有限状态机,这是一种众所周知的自动机类型。此自动机由状态和过渡组成。当自动机看到 input 符号时,它会根据其 transition 函数过渡到另一个状态,该函数将前一个状态和当前 input 符号作为其参数。

在这里插入图片描述

Regular expression

A regular expression, sometimes referred to as rational expression, is a sequence of characters that specifies a match pattern in text. Usually such patterns are used by string-searching algorithms for “find” or “find and replace” operations on strings, or for input validation. Regular expression techniques are developed in theoretical computer science and formal language theory.
正则表达式(有时称为有理表达式)是指定文本中的匹配模式的字符序列。通常,字符串搜索算法使用此类模式对字符串执行“查找”或“查找和替换”操作,或用于输入验证。正则表达式技术是在理论计算机科学和形式语言理论中发展起来的。

Formal language

In logic, mathematics, computer science, and linguistics, a formal language consists of words whose letters are taken from an alphabet and are well-formed according to a specific set of rules called a formal grammar.
在逻辑学、数学、计算机科学和语言学中,正式语言由单词组成,这些单词的字母取自字母表,并根据一组称为正式语法的特定规则格式正确。


Regexes Got Good: The History And Future Of Regular Expressions In JavaScript

正则表达式变得好起来:JavaScript 中正则表达式的历史和未来

Steven Levithan

Aug 20, 2024

Although JavaScript regexes used to be underpowered compared to other modern flavors, numerous improvements in recent years mean that’s no longer true. Steven Levithan evaluates the history and present state of regular expressions in JavaScript with tips to make your regexes more readable, maintainable, and resilient.

尽管与其他现代风格相比,JavaScript 正则表达式曾经功能不足,但近年来的大量改进意味着情况不再如此。Steven Levithan 评估了 JavaScript 中正则表达式的历史和现状,并提供了一些技巧,以使您的正则表达式更具可读性、可维护性和弹性。

Modern JavaScript regular expressions have come a long way compared to what you might be familiar with. Regexes can be an amazing tool for searching and replacing text, but they have a longstanding reputation (perhaps outdated, as I’ll show) for being difficult to write and understand.
与你可能熟悉的相比,现代 JavaScript 正则表达式已经取得了长足的进步。正则表达式可以成为搜索和替换文本的出色工具,但它们因难以编写和理解而长期以来的声誉(可能已经过时了,我将展示)。

This is especially true in JavaScript-land, where regexes languished for many years, comparatively underpowered compared to their more modern counterparts in PCRE, Perl, .NET, Java, Ruby, C++, and Python. Those days are over.
在 JavaScript 领域尤其如此,正则表达式多年来一直萎靡不振,与 PCRE、Perl、.NET、Java、Ruby、C++ 和 Python 中更现代的对应项相比,其功能相对不足。那些日子已经过去了。

In this article, I’ll recount the history of improvements to JavaScript regexes (spoiler: ES2018 and ES2024 changed the game), show examples of modern regex features in action, introduce you to a lightweight JavaScript library that makes JavaScript stand alongside or surpass other modern regex flavors, and end with a preview of active proposals that will continue to improve regexes in future versions of JavaScript (with some of them already working in your browser today).
在本文中,我将回顾 JavaScript 正则表达式改进的历史(剧透:ES2018 和 ES2024 改变了游戏规则),展示现代正则表达式功能的实际示例,向您介绍一个轻量级的 JavaScript 库,它使 JavaScript 与其他现代正则表达式风格并驾齐驱或超越其他现代正则表达式风格,最后预览将继续改进未来 JavaScript 版本中的正则表达式的活动提案(其中一些已经在您的浏览器中运行)。

The History of Regular Expressions in JavaScript

JavaScript 中正则表达式的历史

ECMAScript 3, standardized in 1999, introduced Perl-inspired regular expressions to the JavaScript language. Although it got enough things right to make regexes pretty useful (and mostly compatible with other Perl-inspired flavors), there were some big omissions, even then. And while JavaScript waited 10 years for its next standardized version with ES5, other programming languages and regex implementations added useful new features that made their regexes more powerful and readable.
ECMAScript 3 于 1999 年标准化,将受 Perl 启发的正则表达式引入 JavaScript 语言。尽管它做对了足够多的事情使正则表达式非常有用(并且主要与其他受 Perl 启发的风格兼容),但即使在那时,也有一些很大的遗漏。虽然 JavaScript 等待了 10 年才推出 ES5 的下一个标准化版本,但其他编程语言和正则表达式实现添加了有用的新功能,使它们的正则表达式更加强大和可读。

But that was then. 但那是当时的事了。

Did you know that nearly every new version of JavaScript has made at least minor improvements to regular expressions?
您知道吗,几乎每个新版本的 JavaScript 都至少对正则表达式进行了微小的改进?

Let’s take a look at them.
让我们来看看它们。

Don’t worry if it’s hard to understand what some of the following features mean — we’ll look more closely at several of the key features afterward.
如果难以理解以下某些功能的含义,请不要担心 — 我们稍后将更仔细地研究几个关键功能。

  • ES5 (2009) fixed unintuitive behavior by creating a new object every time regex literals are evaluated and allowed regex literals to use unescaped forward slashes within character classes (/[/]/).
    ES5 (2009) 修复了不直观的行为,方法是在每次计算正则表达式文字时创建一个新对象,并允许正则表达式文字在字符类 (/[/]/) 中使用未转义的正斜杠。
  • ES6/ES2015 added two new regex flags: y (sticky), which made it easier to use regexes in parsers, and u (unicode), which added several significant Unicode-related improvements along with strict errors. It also added the RegExp.prototype.flags getter, support for subclassing RegExp, and the ability to copy a regex while changing its flags.
    ES6/ES2015 添加了两个新的正则表达式标志:ysticky),这使得在解析器中使用正则表达式变得更加容易,以及 uunicode),它添加了一些与 Unicode 相关的重要改进以及严格的错误。它还添加了 RegExp.prototype.flags getter,对子类化 RegExp 的支持,以及在更改其标志时复制正则表达式的能力。
  • ES2018 was the edition that finally made JavaScript regexes pretty good. It added the s (dotAll) flag, lookbehind, named capture, and Unicode properties (via \p{...} and \P{...}, which require ES6’s flag u). All of these are extremely useful features, as we’ll see.
    ES2018 是最终使 JavaScript 正则表达式变得相当好的版本。它添加了 sdotAll) 标志、后视、命名 capture 和 Unicode 属性(通过 \p{...}\P{...},它们需要 ES6 的标志 u)。正如我们将看到的,所有这些都是非常有用的功能。
  • ES2020 added the string method matchAll, which we’ll also see more of shortly.
    ES2020 添加了字符串方法 matchAll,我们很快也会看到更多。
  • ES2022 added flag d (hasIndices), which provides start and end indices for matched substrings.
    ES2022 添加了标志 dhasIndices),它为匹配的子字符串提供开始和结束索引。
  • And finally, ES2024 added flag v (unicodeSets) as an upgrade to ES6’s flag u. The v flag adds a set of multicharacter “properties of strings” to \p{...}, multicharacter elements within character classes via \p{...} and \q{...}, nested character classes, set subtraction [A--B] and intersection [A&&B], and different escaping rules within character classes. It also fixed case-insensitive matching for Unicode properties within negated sets [^...].
    最后,ES2024 添加了标志 vunicodeSets) 作为 ES6 标志 u 的升级。v 标志向 \p{...} 添加了一组多字符“字符串属性”,通过 \p{...}\q{...} 在字符类中添加多字符元素,嵌套字符类,集合减法 [A--B] 和交集 [A&&B],以及字符类中的不同转义规则。它还修复了否定集 [^...] 中 Unicode 属性的不区分大小写的匹配。

As for whether you can safely use these features in your code today, the answer is yes! The latest of these features, flag v, is supported in Node.js 20 and 2023-era browsers. The rest are supported in 2021-era browsers or earlier.
至于您现在是否可以在代码中安全地使用这些功能,答案是肯定的!这些功能中的最新功能 flag v 在 Node.js 20 和 2023 时代的浏览器中受支持。其余的在 2021 年版或更早版本中受支持。

Each edition from ES2019 to ES2023 also added additional Unicode properties that can be used via \p{...} and \P{...}. And to be a completionist, ES2021 added string method replaceAll — although, when given a regex, the only difference from ES3’s replace is that it throws if not using flag g.
从 ES2019 到 ES2023 的每个版本还添加了额外的 Unicode 属性,可以通过 \p{...}\P{...} 使用。为了成为完成主义者,ES2021 添加了字符串方法 replaceAll ——尽管,当给定正则表达式时,与 ES3 的 replace 的唯一区别是,如果不使用标志 g ,它会抛出。

Aside: What Makes a Regex Flavor Good?

题外话:是什么让 REGEX 风格好?

With all of these changes, how do JavaScript regular expressions now stack up against other flavors? There are multiple ways to think about this, but here are a few key aspects:
有了所有这些变化,JavaScript 正则表达式现在如何与其他风格相提并论呢?有多种方法可以考虑这个问题,但以下是几个关键方面:

  • Performance.
    This is an important aspect but probably not the main one since mature regex implementations are generally pretty fast. JavaScript is strong on regex performance (at least considering V8’s Irregexp engine, used by Node.js, Chromium-based browsers, and even Firefox; and JavaScriptCore, used by Safari), but it uses a backtracking engine that is missing any syntax for backtracking control — a major limitation that makes ReDoS vulnerability more common.
    性能。 这是一个重要的方面,但可能不是主要方面,因为成熟的 regex 实现通常非常快。JavaScript 在正则表达式性能方面很强(至少考虑到 V8 的 Irregexp 引擎,Node.js、基于 Chromium 的浏览器甚至 Firefox;以及 Safari 使用的 JavaScriptCore),但它使用的回溯引擎缺少任何回溯控制语法——这是使 ReDoS 漏洞更常见的主要限制。
  • Support for advanced features that handle common or important use cases.
    Here, JavaScript stepped up its game with ES2018 and ES2024. JavaScript is now best in class for some features like lookbehind (with its infinite-length support) and Unicode properties (with multicharacter “properties of strings,” set subtraction and intersection, and script extensions). These features are either not supported or not as robust in many other flavors.
    支持处理常见或重要使用案例的高级功能。 在这里,JavaScript 通过 ES2018 和 ES2024 加强了它的游戏。JavaScript 现在是某些功能的最佳版本,例如后瞻(具有无限长度支持)和 Unicode 属性(具有多字符“字符串属性”、集合减法和交集以及脚本扩展)。这些功能要么不受支持,要么在许多其他风格中不那么健壮。
  • Ability to write readable and maintainable patterns.
    Here, native JavaScript has long been the worst of the major flavors since it lacks the x (“extended”) flag that allows insignificant whitespace and comments. Additionally, it lacks regex subroutines and subroutine definition groups (from PCRE and Perl), a powerful set of features that enable writing grammatical regexes that build up complex patterns via composition.
    能够编写可读和可维护的模式。在这里,原生 JavaScript 长期以来一直是最糟糕的主要风格,因为它缺少 x (“extended”) 标志,该标志允许无关紧要的空格和注释。此外,它还缺少正则表达式子例程和子例程定义组(来自 PCRE 和 Perl),这是一组强大的功能,可以编写语法正则表达式,通过组合构建复杂的模式。

So, it’s a bit of a mixed bag.
所以,这有点好坏参半。

The good news is that all of these holes can be filled by a JavaScript library, which we’ll see later in this article.
好消息是,所有这些漏洞都可以由 JavaScript 库填补,我们将在本文后面看到。

Using JavaScript’s Modern Regex Features

使用 JavaScript 的现代正则表达式功能

Let’s look at a few of the more useful modern regex features that you might be less familiar with. You should know in advance that this is a moderately advanced guide. If you’re relatively new to regex, here are some excellent tutorials you might want to start with:
让我们看看一些您可能不太熟悉的更有用的现代正则表达式功能。您应该提前知道这是一个中等高级的指南。如果您对正则表达式相对较新,这里有一些优秀的教程,您可能希望从以下开始:

  • RegexLearn and RegexOne are interactive tutorials that include practice problems.
    RegexLearn 和 RegexOne 是包含练习题的交互式教程。
  • JavaScript.info’s regular expressions chapter is a detailed and JavaScript-specific guide.
    JavaScript.info 的正则表达式章节是一份详细的 JavaScript 特定指南。
  • Demystifying Regular Expressions (video) is an excellent presentation for beginners by Lea Verou at HolyJS 2017.
    揭开正则表达式的神秘面纱(视频)是 Lea Verou 在 HolyJS 2017 上为初学者提供的精彩演示。
  • Learn Regular Expressions In 20 Minutes (video) is a live syntax walkthrough in a regex tester.
    在 20 分钟内学习正则表达式(视频)是正则表达式测试器中的实时语法演练。

Named Capture

命名捕获

Often, you want to do more than just check whether a regex matches — you want to extract substrings from the match and do something with them in your code. Named capturing groups allow you to do this in a way that makes your regexes and code more readable and self-documenting.
通常,你想要做的不仅仅是检查正则表达式是否匹配 - 你想要从匹配中提取子字符串并在代码中对它们执行一些操作。命名捕获组允许您以一种使正则表达式和代码更具可读性和自文档性的方式执行此操作。

The following example matches a record with two date fields and captures the values:
以下示例匹配具有两个日期字段的记录并捕获值:

const record = 'Admitted: 2024-01-01\nReleased: 2024-01-03';
const re = /^Admitted: (?<admitted>\d{4}-\d{2}-\d{2})\nReleased: (?<released>\d{4}-\d{2}-\d{2})$/;
const match = record.match(re);
console.log(match.groups);
/* → {
 admitted: '2024-01-01',
 released: '2024-01-03'
} */

Don’t worry — although this regex might be challenging to understand, later, we’ll look at a way to make it much more readable. The key things here are that named capturing groups use the syntax (?<name>...), and their results are stored on the groups object of matches.
别担心 — 尽管这个正则表达式可能很难理解,但稍后,我们将寻找一种方法来使其更具可读性。这里的关键是命名捕获组使用语法 (?<name>...),并且它们的结果存储在匹配的 groups 对象上。

You can also use named backreferences to rematch whatever a named capturing group matched via \k<name>, and you can use the values within search and replace as follows:
您还可以使用命名反向引用来重新匹配通过 \k<name> 匹配的任何命名捕获组,并且可以在 search 和 replace 中使用值,如下所示:

// Change 'FirstName LastName' to 'LastName, FirstName'
const name = 'Shaquille Oatmeal';
name.replace(/(?<first>\w+) (?<last>\w+)/, '$<last>, $<first>');
// → 'Oatmeal, Shaquille'

For advanced regexers who want to use named backreferences within a replacement callback function, the groups object is provided as the last argument. Here’s a fancy example:
对于想要在替换回调函数中使用命名反向引用的高级 regexers,groups 对象作为最后一个参数提供。这里有一个花哨的例子:

function fahrenheitToCelsius(str) {
 const re = /(?<degrees>-?\d+(\.\d+)?)F\b/g;
 return str.replace(re, (...args) => {
  const groups = args.at(-1);
  return Math.round((groups.degrees - 32) * 5/9) + 'C';
 });
}
fahrenheitToCelsius('98.6F');
// → '37C'
fahrenheitToCelsius('May 9 high is 40F and low is 21F');
// → 'May 9 high is 4C and low is -6C'

Lookbehind

回顾

Lookbehind (introduced in ES2018) is the complement to lookahead, which has always been supported by JavaScript regexes. Lookahead and lookbehind are assertions (similar to ^ for the start of a string or \b for word boundaries) that don’t consume any characters as part of the match. Lookbehinds succeed or fail based on whether their subpattern can be found immediately before the current match position.
Lookbehind(在 ES2018 中引入)是 lookahead 的补充,JavaScript 正则表达式一直支持它。Lookahead 和 lookbehind 是断言(类似于字符串开头的 ^ 或单词边界的 \b),它们在匹配过程中不使用任何字符。后视是否成功或失败取决于是否可以在当前匹配位置之前立即找到其子模式。

For example, the following regex uses a lookbehind (?<=...) to match the word “cat” (only the word “cat”) if it’s preceded by “fat ”:
例如,如果单词前面有 “fat ”,则以下正则表达式使用后视 (?<=...) 来匹配单词 “cat” (仅单词 “cat”) :

const re = /(?<=fat )cat/g;
'cat, fat cat, brat cat'.replace(re, 'pigeon');
// → 'cat, fat pigeon, brat cat'

You can also use negative lookbehind — written as (?<!...) — to invert the assertion. That would make the regex match any instance of “cat” that’s not preceded by “fat ”.
你也可以使用否定 lookbehind — 写成 (?<!...) — 来反转断言。这将使正则表达式匹配任何前面没有 “fat ” 的 “cat” 实例。

const re = /(?<!fat )cat/g;
'cat, fat cat, brat cat'.replace(re, 'pigeon');
// → 'pigeon, fat cat, brat pigeon'

JavaScript’s implementation of lookbehind is one of the very best (matched only by .NET). Whereas other regex flavors have inconsistent and complex rules for when and whether they allow variable-length patterns inside lookbehind, JavaScript allows you to look behind for any subpattern.
JavaScript 的后视实现是最好的实现之一(仅与 .NET 相媲美)。其他正则表达式风格对于何时以及是否允许在 lookbehind 中使用可变长度模式具有不一致且复杂的规则,而 JavaScript 允许你向后查看任何子模式。

The matchAll Method

matchAll 方法

JavaScript’s String.prototype.matchAll was added in ES2020 and makes it easier to operate on regex matches in a loop when you need extended match details. Although other solutions were possible before, matchAll is often easier, and it avoids gotchas, such as the need to guard against infinite loops when looping over the results of regexes that might return zero-length matches.
JavaScript 的 String.prototype.matchAll 是在 ES2020 中添加的,当你需要扩展匹配细节时,可以更轻松地在循环中对正则表达式匹配进行操作。尽管以前还有其他解决方案,但 matchAll 通常更容易,并且它避免了陷阱,例如在循环可能返回零长度匹配的正则表达式结果时需要防止无限循环。

Since matchAll returns an iterator (rather than an array), it’s easy to use it in a for...of loop.
由于 matchAll 返回一个迭代器(而不是数组),因此很容易在 for...of 循环中使用它。

const re = /(?<char1>\w)(?<char2>\w)/g;
for (const match of str.matchAll(re)) {
 const {char1, char2} = match.groups;
 // Print each complete match and matched subpatterns
 console.log(`Matched "${match[0]}" with "${char1}" and "${char2}"`);
}

Note: matchAll requires its regexes to use flag g (global). Also, as with other iterators, you can get all of its results as an array using Array.from or array spreading.
注意:matchAll 要求其正则表达式使用标志 g(全局)。此外,与其他迭代器一样,您可以使用 Array.from 或 array spread 将其所有结果作为数组获取。

const matches = [...str.matchAll(/./g)];

Unicode Properties

UNICODE 属性

Unicode properties (added in ES2018) give you powerful control over multilingual text, using the syntax \p{...} and its negated version \P{...}. There are hundreds of different properties you can match, which cover a wide variety of Unicode categories, scripts, script extensions, and binary properties.
Unicode 属性(在 ES2018 中添加)使用语法 \p{...} 及其否定版本 \P{...} 为您提供对多语言文本的强大控制。您可以匹配数百种不同的属性,这些属性涵盖各种 Unicode 类别、脚本、脚本扩展名和二进制属性。

Note: For more details, check out the documentation on MDN.
注意:有关更多详细信息,请查看 MDN 上的文档。

Unicode properties require using the flag u (unicode) or v (unicodeSets).
Unicode 属性需要使用标志 uunicode) 或 vunicodeSets)。

Flag v

标志 v

Flag v (unicodeSets) was added in ES2024 and is an upgrade to flag u — you can’t use both at the same time. It’s a best practice to always use one of these flags to avoid silently introducing bugs via the default Unicode-unaware mode. The decision on which to use is fairly straightforward. If you’re okay with only supporting environments with flag v (Node.js 20 and 2023-era browsers), then use flag v; otherwise, use flag u.
标志 vunicodeSets) 是在 ES2024 中添加的,是标志 u 的升级 — 您不能同时使用两者。最佳做法是始终使用这些标志之一,以避免通过默认的 Unicode uncode-unaware 模式静默引入错误。决定使用哪个相当简单。如果您只支持带有标志 v 的环境(Node.js 20 和 2023 时代的浏览器),请使用标志 v;否则,使用标志 u

Flag v adds support for several new regex features, with the coolest probably being set subtraction and intersection. This allows using A--B (within character classes) to match strings in A but not in B or using A&&B to match strings in both A and B. For example:
标志 v 添加了对几个新正则表达式功能的支持,其中最酷的可能是设置减法和交集。这允许使用 A--B (在字符类中) 匹配 A 中的字符串但不匹配 B 中的字符串,或使用 A&&B 匹配 A 和 B 中的字符串。例如:

// Matches all Greek symbols except the letter 'π'
/[\p{Script_Extensions=Greek}--π]/v

// Matches only Greek letters
/[\p{Script_Extensions=Greek}&&\p{Letter}]/v

For more details about flag v, including its other new features, check out this explainer from the Google Chrome team.
有关标记 v 的更多详细信息,包括其其他新功能,请查看 Google Chrome 团队的此解释器。

A Word on Matching Emoji

关于匹配 emoji 的一句话

Emoji are 🤩🔥😎👌, but how emoji get encoded in text is complicated. If you’re trying to match them with a regex, it’s important to be aware that a single emoji can be composed of one or many individual Unicode code points. Many people (and libraries!) who roll their own emoji regexes miss this point (or implement it poorly) and end up with bugs.
表情符号是 🤩🔥😎👌 ,但表情符号如何在文本中编码很复杂。如果您尝试将它们与正则表达式匹配,请务必注意单个表情符号可以由一个或多个单独的 Unicode 码位组成。许多使用自己的表情符号正则表达式的人(和库!)没有注意到这一点(或者实现得很糟糕),最终导致 bug。

The following details for the emoji “👩🏻‍🏫” (Woman Teacher: Light Skin Tone) show just how complicated emoji can be:
表情符号 “ 👩🏻 🏫 ”(女教师:浅肤色)的以下细节显示了表情符号的复杂程度:

// Code unit length
'👩🏻‍🏫'.length;
// → 7
// Each astral code point (above \uFFFF) is divided into high and low surrogates

// Code point length
[...'👩🏻‍🏫'].length;
// → 4
// These four code points are: \u{1F469} \u{1F3FB} \u{200D} \u{1F3EB}
// \u{1F469} combined with \u{1F3FB} is '👩🏻'
// \u{200D} is a Zero-Width Joiner
// \u{1F3EB} is '🏫'

// Grapheme cluster length (user-perceived characters)
[...new Intl.Segmenter().segment('👩🏻‍🏫')].length;
// → 1

Fortunately, JavaScript added an easy way to match any individual, complete emoji via \p{RGI_Emoji}. Since this is a fancy “property of strings” that can match more than one code point at a time, it requires ES2024’s flag v.
幸运的是,JavaScript 添加了一种简单的方法,可以通过 \p{RGI_Emoji} 匹配任何单独的完整表情符号。由于这是一个花哨的“字符串属性”,可以一次匹配多个代码点,因此它需要 ES2024 的标志 v

If you want to match emojis in environments without v support, check out the excellent libraries emoji-regex and emoji-regex-xs.
如果你想在没有 v 支持的环境中匹配表情符号,请查看优秀的库 emoji-regex 和 emoji-regex-xs。

Making Your Regexes More Readable, Maintainable, and Resilient

使您的正则表达式更具可读性、可维护性和弹性

Despite the improvements to regex features over the years, native JavaScript regexes of sufficient complexity can still be outrageously hard to read and maintain.
尽管多年来对正则表达式功能进行了改进,但具有足够复杂性的原生 JavaScript 正则表达式仍然难以阅读和维护。

Regular Expressions are SO EASY!!!! pic.twitter.com/q4GSpbJRbZ
正则表达式非常简单!!!pic.twitter.com/q4GSpbJRbZ
— Garabato Kid (@garabatokid) July 5, 2019

ES2018’s named capture was a great addition that made regexes more self-documenting, and ES6’s String.raw tag allows you to avoid escaping all your backslashes when using the RegExp constructor. But for the most part, that’s it in terms of readability.
ES2018 的命名捕获是一个很好的补充,它使正则表达式更加自文档化,ES6 的 String.raw 标签允许您在使用 RegExp 构造函数时避免转义所有反斜杠。但在大多数情况下,就可读性而言,这就是它。

However, there’s a lightweight and high-performance JavaScript library named regex (by yours truly) that makes regexes dramatically more readable. It does this by adding key missing features from Perl-Compatible Regular Expressions (PCRE) and outputting native JavaScript regexes. You can also use it as a Babel plugin, which means that regex calls are transpiled at build time, so you get a better developer experience without users paying any runtime cost.
但是,有一个名为 regex 的轻量级高性能 JavaScript 库(确实是你的),它使正则表达式的可读性大大提高。它通过添加 Perl-Compatible Regular Expressions (PCRE) 中缺少的关键功能并输出本机 JavaScript 正则表达式来实现这一点。你也可以将它用作 Babel 插件,这意味着 regex 调用在构建时被转译,因此你可以获得更好的开发人员体验,而无需用户支付任何运行时成本。

PCRE is a popular C library used by PHP for its regex support, and it’s available in countless other programming languages and tools.
PCRE 是 PHP 用于其正则表达式支持的常用 C 库,它可用于无数其他编程语言和工具。

Let’s briefly look at some of the ways the regex library, which provides a template tag named regex, can help you write complex regexes that are actually understandable and maintainable by mortals. Note that all of the new syntax described below works identically in PCRE.
让我们简要地看一下 regex 库(提供名为 regex 的模板标签)可以帮助您编写复杂的正则表达式,这些正则表达式实际上是普通人可以理解和可维护的。请注意,下面描述的所有新语法在 PCRE 中的工作方式相同。

Insignificant Whitespace and Comments

无关紧要的空格和注释

By default, regex allows you to freely add whitespace and line comments (starting with #) to your regexes for readability.
默认情况下,regex 允许您自由地向正则表达式添加空格和行注释(以 # 开头)以提高可读性。

import {regex} from 'regex';
const date = regex`
 
Match a date in YYYY-MM-DD format
 (?<year> \d{4}) -
Year part
 (?<month> \d{2}) -
Month part
 (?<day>  \d{2}) 
Day part
`;

This is equivalent to using PCRE’s xx flag.
这相当于使用 PCRE 的 xx 标志。

Subroutines and Subroutine Definition Groups

子例程和子例程定义组

Subroutines are written as \g<name> (where name refers to a named group), and they treat the referenced group as an independent subpattern that they try to match at the current position. This enables subpattern composition and reuse, which improves readability and maintainability.
子例程写为 \g<name> (其中 name 引用命名组),它们将引用的组视为它们尝试在当前位置匹配的独立子模式。这支持子模式组合和重用,从而提高可读性和可维护性。

For example, the following regex matches an IPv4 address such as “192.168.12.123”:
例如,以下正则表达式匹配 IPv4 地址,例如“192.168.12.123”:

import {regex} from 'regex';
const ipv4 = regex`\b
 (?<byte> 25[0-5] | 2[0-4]\d | 1\d\d | [1-9]?\d)
 
Match the remaining 3 dot-separated bytes
 (\. \g<byte>){3}
\b`;

You can take this even further by defining subpatterns for use by reference only via subroutine definition groups. Here’s an example that improves the regex for admittance records that we saw earlier in this article:
您可以通过子例程定义组定义仅供引用使用的子模式来更进一步。下面是一个示例,该示例改进了我们在本文前面看到的准入记录的正则表达式:

const record = 'Admitted: 2024-01-01\nReleased: 2024-01-03';
const re = regex`
 ^ Admitted:\ (?<admitted> \g<date>) \n
  Released:\ (?<released> \g<date>) $

 (?(DEFINE)
  (?<date> \g<year>-\g<month>-\g<day>)
  (?<year> \d{4})
  (?<month> \d{2})
  (?<day>  \d{2})
 )
`;
const match = record.match(re);
console.log(match.groups);
/* → {
 admitted: '2024-01-01',
 released: '2024-01-03'
} */

A Modern Regex Baseline

现代正则表达式基线

regex includes the v flag by default, so you never forget to turn it on. And in environments without native v, it automatically switches to flag u while applying v’s escaping rules, so your regexes are forward and backward-compatible.
regex 默认包含 v 标志,因此您永远不会忘记打开它。在没有原生 v 的环境中,它会在应用 v 的转义规则时自动切换到标志 u,因此你的正则表达式是向前和向后兼容的。

It also implicitly enables the emulated flags x (insignificant whitespace and comments) and n (“named capture only” mode) by default, so you don’t have to continually opt into their superior modes. And since it’s a raw string template tag, you don’t have to escape your backslashes \\\\ like with the RegExp constructor.
默认情况下,它还隐式启用模拟标志 x (无关紧要的空格和注释) 和 n (“仅命名捕获” 模式),因此您不必不断选择使用它们的高级模式。而且由于它是一个原始字符串模板标签,你不必像使用 RegExp 构造函数那样转义反斜杠 \\\\

Atomic Groups and Possessive Quantifiers Can Prevent Catastrophic Backtracking

原子群和所有格量词可以防止灾难性的回溯

Atomic groups and possessive quantifiers are another powerful set of features added by the regex library. Although they’re primarily about performance and resilience against catastrophic backtracking (also known as ReDoS or “regular expression denial of service,” a serious issue where certain regexes can take forever when searching particular, not-quite-matching strings), they can also help with readability by allowing you to write simpler patterns.
原子群和所有格量词是 regex 库添加的另一组强大的功能。虽然它们主要关乎性能和抵御灾难性回溯(也称为 ReDoS 或“正则表达式拒绝服务”,这是一个严重的问题,在搜索特定的、不完全匹配的字符串时,某些正则表达式可能需要很长时间),但它们还可以通过允许您编写更简单的模式来帮助提高可读性。

Note: You can learn more in the regex documentation.
注意:您可以在 regex 文档中了解更多信息。

What’s Next? Upcoming JavaScript Regex Improvements

下一步是什么?即将推出的 JavaScript 正则表达式改进

There are a variety of active proposals for improving regexes in JavaScript. Below, we’ll look at the three that are well on their way to being included in future editions of the language.
有各种改进 JavaScript 中正则表达式的积极建议。下面,我们将看看这三个正在顺利包含在该语言的未来版本中。

Duplicate Named Capturing Groups

复制命名捕获组

This is a Stage 3 (nearly finalized) proposal. Even better is that, as of recently, it works in all major browsers.
这是一个第 3 阶段(即将完成)的提案。更好的是,截至最近,它可以在所有主要浏览器中运行。

When named capturing was first introduced, it required that all (?<name>...) captures use unique names. However, there are cases when you have multiple alternate paths through a regex, and it would simplify your code to reuse the same group names in each alternative.
首次引入命名捕获时,它要求所有 (?<name>...) 捕获都使用唯一名称。但是,在某些情况下,您有多个通过正则表达式的备用路径,这将简化您的代码,以便在每个替代路径中重复使用相同的组名称。

For example: 例如:

/(?<year>\d{4})-\d\d|\d\d-(?<year>\d{4})/

This proposal enables exactly this, preventing a “duplicate capture group name” error with this example. Note that names must still be unique within each alternative path.
此提案正好实现了这一点,从而防止此示例中出现 “duplicate capture group name” 错误。请注意,名称在每个替代路径中仍必须是唯一的。

Pattern Modifiers (aka Flag Groups)

模式修饰符(又名标志组)

This is another Stage 3 proposal. It’s already supported in Chrome/Edge 125 and Opera 111, and it’s coming soon for Firefox. No word yet on Safari.
这是另一个第 3 阶段提案。Chrome/Edge 125 和 Opera 111 已经支持它,并且很快将在 Firefox 中推出。目前还没有关于 Safari 的消息。

Pattern modifiers use (?ims:...), (?-ims:...), or (?im-s:...) to turn the flags i, m, and s on or off for only certain parts of a regex.
模式修饰符使用 (?ims:...)(?-ims:...)(?im-s:...) 来打开或关闭正则表达式某些部分的标志 ims

For example: 例如:

/hello-(?i:world)/
// Matches 'hello-WORLD' but not 'HELLO-WORLD'

Escape Regex Special Characters with RegExp.escape

使用 RegExp.escape 转义正则表达式特殊字符

This proposal recently reached Stage 3 and has been a long time coming. It isn’t yet supported in any major browsers. The proposal does what it says on the tin, providing the function RegExp.escape(str), which returns the string with all regex special characters escaped so you can match them literally.
该提案最近进入了第 3 阶段,并且已经等待了很长时间。任何主流浏览器尚不支持它。该提案按照它在 TIN 上所说的去做,提供函数 RegExp.escape(str),它返回所有正则表达式特殊字符都转义的字符串,以便您可以按字面匹配它们。

If you need this functionality today, the most widely-used package (with more than 500 million monthly npm downloads) is escape-string-regexp, an ultra-lightweight, single-purpose utility that does minimal escaping. That’s great for most cases, but if you need assurance that your escaped string can safely be used at any arbitrary position within a regex, escape-string-regexp recommends the regex library that we’ve already looked at in this article. The regex library uses interpolation to escape embedded strings in a context-aware way.
如果您现在需要此功能,那么使用最广泛的包(每月 npm 下载量超过 5 亿次)是 escape-string-regexp,这是一个超轻量级、单一用途的实用程序,可进行最少的转义。这在大多数情况下都很好,但如果你需要保证你的转义字符串可以安全地用于正则表达式中的任何位置,escape-string-regexp 建议使用我们在本文中已经介绍过的 regex 库。regex 库使用插值以上下文感知的方式转义嵌入的字符串。

Conclusion

结论

So there you have it: the past, present, and future of JavaScript regular expressions.
所以你有它:JavaScript 正则表达式的过去、现在和未来。

If you want to journey even deeper into the lands of regex, check out Awesome Regex for a list of the best regex testers, tutorials, libraries, and other resources. And for a fun regex crossword puzzle, try your hand at regexle.
如果您想更深入地了解正则表达式的领域,请查看 Awesome Regex 以获取最佳正则表达式测试器、教程、库和其他资源的列表。对于一个有趣的正则表达式填字游戏,请尝试使用 regexle。

May your parsing be prosperous and your regexes be readable.
愿您的解析成功繁荣,您的正则表达式可读。


正则表达式的历史与几大流派

一、介绍

正则表达式(英语:Regular Expression,常简写为 regex、regexp 或 RE )

此处的 Regular 即是规则、规律的意思,Regular Expression 即“描述某种规则的表达式”之意。

二、历史

最初的正则表达式出现于理论计算机科学的自动控制理论和形式化语言理论中。在这些领域中有对计算(自动控制)的模型和对形式化语言描述与分类的研究。

1950年代,数学家斯蒂芬·科尔·克莱尼利用称之为“正则集合”的数学符号来描述此模型。肯·汤普逊将此符号系统引入编辑器 QED,随后是 Unix 上的编辑器 ed,并最终引入 grep。自此以后,正则表达式被广泛地应用于各种 Unix 或类 Unix 系统的工具中。此时出现了 POSIX 规范的正则表达式

Perl 正则表达式源自于 Henry Spencer 于1986年1月19日发布的 regex。它后来演化成了 PCRE(Perl 兼容正则表达式,Perl Compatible Regular Expressions),一个由 Philip Hazel 于1997年夏天开始编写开发的,用 C 编写的,为很多现代工具所使用的库。 PCRE 的语法比 POSIX 正则表达式和许多其他正则表达式库都强大和灵活。许多著名的开源程序(例如 Apache 和 Nginx HTTP 服务器以及 PHP 和 R 脚本语言)都包含 PCRE 库。

编程语言之间关于正则表达式的集成,虽开发进展得很差,但也在逐步推行。我所接触过的,Python 中的正则比较完善,JavaScript 中的正则相对比较弱。

三、规范 / 流派

1、 POSIX 规范的正则表达式

POSIX( Portable Operating System Interface for unix )规范又可分为两种流派(flavor):

  • 基本的正则表达式(Basic Regular Expression,又叫 Basic RegEx,简称 BRE

  • 扩展的正则表达式(Extended Regular Expression,又叫 Extended RegEx,简称 ERE

应用范围:

  • Unix/Linux 下的工具
  • 一些数据库系统
(1)BRE

在 Linux/Unix 常用工具中,grep、vi、sed 都属于 BRE 这一派。

后来 GNU 也对 BRE 做了扩展。这样,GNU 的 grep 等工具虽然名义上属于 BRE 流,但更确切的名称是 GNU BRE

(2)ERE

注意:虽然 BRE 名为“基本”而 ERE 名为“扩展”,但 ERE 并不要求兼容 BRE 的语法,而是自成一体。

在 Linux/Unix 常用工具中,grep -E、sed -r、awk 都属于 ERE 这一派。

后来 GNU 也对 ERE 做了扩展。这样,GNU 的 egrep 等工具虽然名义上属于 ERE 流,但更确切的名称是 GNU ERE

(3)BRE 和 ERE 的区别

在 GNU 的拓展后,现在的 BRE 和 ERE 在功能上已经没有什么区别了,主要的差异是在元字符的转义上。

BRE需要对一些特殊字符进行转义,例如在ERE中可以使用'|'、'+'、'?',而在BRE中则需要转义才能正常使用,例如'\|'、'\+'、'\?'

BRE 之所以这么麻烦的要转义,是因为这些工具的诞生时间很早,有历史遗留问题。

2、Perl 正则表达式

Perl 正则表达式(Perl Regular Expression,又叫 Perl RegEx,简称 PRE

现在常见的正则表达式记法,其实大多都源于 Perl。例如『\d』、『\w』、『\s』之类的记法。

应用范围:

  • (高级)编程语言

Linux/Unix 工具与正则表达式的 POSIX 规范

作者 余晟

2011-07-11

对正则表达式有基本了解的读者,一定不会陌生『\d』、『[a-z]+』之类的表达式,前者匹配一个数字字符,后者匹配一个以上的小写英文字母。但是如果你用过vi、grep、awk、sed之类Linux/Unix下的工具或许会发现,这些工具虽然支持正则表达式,语法却很不一样,照通常习惯的办法写的『\d』、『[a-z]+』之类的正则表达式,往往不是无法识别就是匹配错误。而且,这些工具自身之间也存在差异,同样的结构,有时需要转义有时不需要转义。这,究竟是为什么呢?

原因在于,Unix/Linux下的工具大多采用 POSIX 规范,同时, POSIX 规范又可分为两种流派(flavor)。所以,首先有必要了解一下 POSIX 规范。

POSIX 规范

常见的正则表达式记法,其实都源于Perl,实际上,正则表达式从Perl衍生出一个显赫的流派,叫做PCRE(Perl Compatible Regular Expression),『\d』、『\w』、『\s』之类的记法,就是这个流派的特征。但是在PCRE之外,正则表达式还有其它流派,比如下面要介绍的 POSIX 规范的正则表达式。

POSIX 的全称是Portable Operating System Interface for uniX,它由一系列规范构成,定义了UNIX操作系统应当支持的功能,所以“ POSIX 规范的正则表达式”其实只是“关于正则表达式的 POSIX 规范”,它定义了BRE(Basic Regular Expression,基本型正则表达式)和ERE(Extended Regular Express,扩展型正则表达式)两大流派。在兼容 POSIX 的UNIX系统上,grep和egrep之类的工具都遵循 POSIX 规范,一些数据库系统中的正则表达式也符合 POSIX 规范。

BRE

在Linux/Unix常用工具中,grep、vi、sed都属于BRE这一派,它的语法看起来比较奇怪,元字符『(』、『)』、『{』、『}』必须转义之后才具有特殊含义,所以正则表达式『(a)b』只能匹配字符串 (a)b而不是字符串ab;正则表达式『a{1,2}』只能匹配字符串a{1,2},正则表达式『a{1,2}』才能匹配字符串a或者aa。

之所以这么麻烦,是因为这些工具的诞生时间很早,正则表达式的许多功能却是逐步发展演化出来的,之前这些元字符可能并没有特殊的含义;为保证向后兼容,就只能使用转义。而且有些功能甚至根本就不支持,比如BRE就不支持『+』和『?』量词,也不支持多选结构『(…|…)』和反向引用『\1』、『\2』…。

不过今天,纯粹的BRE已经很少见了,毕竟大家已经认为正则表达式“理所应当”支持多选结构和反向引用等功能,没有确实太不方便。所以虽然vi属于BRE流派,但提供了这些功能。GNU也对BRE做了扩展,支持『+』、『?』、『|』,只是使用时必须写成『+』、『?』、『|』,而且也支持『\1』、『\2』之类反向引用。这样,GNU的grep等工具虽然名义上属于BRE流,但更确切的名称是GNU BRE。

ERE

在Linux/Unix常用工具中,egrep、awk则属于ERE这一派,。虽然BRE名为“基本”而ERE名为“扩展”,但ERE并不要求兼容BRE的语法,而是自成一体。因此其中的元字符不用转义(在元字符之前添加反斜线会取消其特殊含义),所以『(ab|cd)』就可以匹配字符串ab或者cd,量词『+』、『?』、『{n,m}』可以直接使用。ERE并没有明确规定支持反向引用,但是不少工具都支持『\1』、『\2』之类的反向引用。

GNU出品的egrep等工具就属于ERE流(更准确的名字是GNU ERE),但因为GNU已经对BRE做了不少扩展,所谓的GNU ERE其实只是个说法而已,它有的功能GNU BRE都有了,只是元字符不需要转义而已。

下面的表格简要说明了几种 POSIX 流派的区别[[1]](http://www.infoq.com/cn/news/2011/07/regular-expressions-6- POSIX #_ftn1_6165)(其实,现在的BRE和ERE在功能上并没有什么区别,主要的差异是在元字符的转义上)。

几种 POSIX 流派的说明

流派说明工具
BRE(、)、{、}都必须转义使用,不支持+、?、|grep、sed、vi(但vi支持这些多选结构和反向引用)
GNU BRE(、)、{、}、+、?、|都必须转义使用GNU grep、GNU sed
ERE元字符不必转义,+、?、(、)、{、}、|可以直接使用,\1、\2的支持不确定egrep、awk
GNU ERE元字符不必转义,+、?、(、)、{、}、|可以直接使用,支持\1、\2grep –E、GNU awk

为了方便查阅,下面再用一张表格列出基本的正则功能在常用工具中的表示法,其中的工具GNU的版本为准。

常用Linux/Unix工具中的表示法

PCRE记法vi/vimgrepawksed
*****
+++++
?=???
{m,n}{m,n}{m,n}{m,n}{m,n}
\b *< >< >< >\y < >
(…|…)(…|…)(…|…)(…|…)(…|…)
(…)(…)(…)(…)(…)
\1 \2\1 \2\1 \2不支持\1 \2

注:PCRE中常用\b来表示“单词的起始或结束位置”,但Linux/Unix的工具中,通常用<来匹配“单词的起始位置”,用>来匹配“单词的结束位置”,sed中的\y可以同时匹配这两个位置。

POSIX 字符组

在某些文档中,你还会发现类似『[:digit:]』、『[:lower:]』之类的表示法,它们看起来不难理解(digit就是“数字”,lower就是“小写”),但又很奇怪,这就是 POSIX 字符组。不仅在Linux/Unix的常见工具中,甚至一些变成语言中都出现了这些字符组,为避免困惑,这里有必要简要介绍它们。

在 POSIX 规范中,『[a-z]』、『[aeiou]』之类的记法仍然是合法的,其意义与PCRE中的字符组也没有区别,只是这类记法的准确名称是 POSIX 方括号表达式(bracket expression),它主要用在Unix/Linux系统中。 POSIX 方括号表示法与PCRE字符组的最主要差别在于: POSIX 字符组中,反斜线\不是用来转义的。所以 POSIX 方括号表示法『[\d]』只能匹配\和d两个字符,而不是『[0-9]』对应的数字字符。

为了解决字符组中特殊意义字符的转义问题, POSIX 方括号表示法规定,如果要在字符组中表达字符](而不是作为字符组的结束标记),应当让它紧跟在字符组的开方括号之后,所以 POSIX 中,正则表达式『[]a]』能匹配的字符就是]和 a;如果要在 POSIX 方括号表示法中表达字符-(而不是范围表示法),必须将它紧挨在闭方括号]之前,所以『[a-]』能匹配的字符就是 a 和-。

POSIX 规范也定义了 POSIX 字符组,它近似等价于于PCRE的字符组简记法,用一个有直观意义的名字来表示某一组字符,比如digit表示“数字字符”,alpha 表示“字母字符”。

不过, POSIX 中还有一个值得注意的概念:locale(通常翻译为“语言环境”)。它是一组与语言和文化相关的设定,包括日期格式、货币币值、字符编码等等。 POSIX 字符组的意义会根据locale的变化而变化,下面的表格介绍了常见的 POSIX 字符组在ASCII语言环境与Unicode语言环境下的意义,供大家参考。

POSIX 字符组

POSIX 字符组说明ASCII语言环境Unicode语言环境
[:alnum:]*字母字符和数字字符[a-zA-Z0-9][\p{L&}\p{Nd}]
[:alpha:]字母[a-zA-Z]\p{L&}
[:ascii:]ASCII字符[\x00-\x7F]\p{InBasicLatin}
[:blank:]空格字符和制表符[ \t][\p{Zs}\t]
[:cntrl:]控制字符[\x00-\x1F\x7F]\p{Cc}
[:digit:]数字字符[0-9]\p{Nd}
[:graph:]空白字符之外的字符[\x21-\x7E][^\p{Z}\p{C}]
[:lower:]小写字母字符[a-z]\p{Ll}
[:print:]类似[:graph:],但包括空白字符[\x20-\x7E]\P{C}
[:punct:]标点符号[][!"#$%&'()*+,./:;<=>?@^_`{|}~-][\p{P}\p{S}]
[:space:]空白字符[ \t\r\n\v\f][\p{Z}\t\r\n\v\f]
[:upper:]大写字母字符[A-Z]\p{Lu}
[:word:]*字母字符[A-Za-z0-9_][\p{L}\p{N}\p{Pc}]
[:xdigit:]十六进制字符[A-Fa-f0-9][A-Fa-f0-9]

注1:标记*的字符组简记法并不是 POSIX 规范中的,但使用很多,一般语言中都提供,文档中也会出现。

注2:对应的 Unicode 属性请参考本系列文章已经刊发过的关于Unicode的部分。

POSIX 字符组的使用有所不同。主要区别在于,PCRE字符组简记法可以脱离方括号直接出现,而 POSIX 字符组必须出现在方括号内,所以同样是匹配数字字符,单独出现时,PCRE中可以直接写『\d』,而 POSIX 字符组就必须写成『[[:digit:]]』。

Linux/Unix下的工具中,一般都可以直接使用 POSIX 字符组,而PCRE的字符组简记法『\w』、『\d』等则大多不支持,所以如果你看到『[[:space:]]』而不是『\s』,一定不要感到奇怪。

不过,在常用的编程语言中,Java、PHP、Ruby也支持使用 POSIX 字符组。其中Java和PHP中的 POSIX 字符组都是按照 ASCII 语言环境进行匹配;Ruby的情况则要复杂一点,Ruby 1.8按照 ASCII 语言环境进行匹配,而且不支持『[:word:]』和『[:alnum:]』,Ruby 1.9 按照 Unicode 语言环境进行匹配,同时支持『[:word:]』和『[:alnum:]』。

说明:关于正则表达式的系列文章到此即告一段落,作者最近已经完成了一本关于正则表达式的书籍,其中更详细也更全面地讲解了正则表达式使用中的各种问题。该书暂定名《正则导引》,预计近期上市,有兴趣的读者敬请关注。


[[1]](http://www.infoq.com/cn/news/2011/07/regular-expressions-6- POSIX #_ftnref1_6165) 关于 ERE 和 BRE 的详细规范,可以参考 http://pubs.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap09.html

关于作者

余晟,程序员,曾任抓虾网高级顾问,现就职于盛大创新院,感兴趣的方向包括搜索和分布式算法等。翻译爱好者,译有《精通正则表达式》(第三版)和《技术领导之路》。


via:

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值