正则表达式的历史以及语法

最新推荐文章于 2021-12-07 18:00:00 发布

EJoft

最新推荐文章于 2021-12-07 18:00:00 发布

阅读量660

点赞数

分类专栏： Java 文章标签： java

本文链接：https://blog.csdn.net/EJoft/article/details/105983014

版权

Java 专栏收录该内容

35 篇文章 0 订阅

订阅专栏

欢迎跳转到本文原文链接 Backend_Notes

A regular expression (shortened as regex or regexp; also referred to as rational expression) is a sequence of characters that define a search pattern. Usually such patterns are used by string searching algorithms for “find” or “find and replace” operations on strings, or for input validation. It is a technique developed in theoretical computer science(理论计算机科学) and formal language(形式化语言) theory.

The concept arose in the 1950s when the American mathematician Stephen Cole Kleene formalized the description of a regular language. The concept came into common use with Unix text-processing utilities. Different syntaxes for writing regular expressions have existed since the 1980s, one being the POSIX standard and another, widely used, being the Perl syntax.

Regular expressions are used in search engines, search and replace dialogs of word processors and text editors, in text processing utilities such as sed and AWK and in lexical analysis. Many programming languages provide regex capabilities either built-in or via libraries.^[1]

Regular expressions are a way to describe a set of strings based on common characteristics shared by each string in the set. They can be used to search, edit, or manipulate text and data. ^[2]

简单来说正则表达式是一种用于模式匹配的替换和规范，可以对字符串进行查找、提取、分割和替换等操作。

正则表达式的历史

1951年，正则表达式起源于数学家Stephen Cole Kleene用他的数学标记（regular events）描述了正则语言（regular languages），这些出现在理论计算机科学中的自动机理论（automata theory）子领域以及形式语言（ formal languages）的描述和分类中。
1968年，正则表达式开始在两个领域中变得流行，分别是文本编辑器的模式匹配(pattern matching)以及编译器中的词法分析(lexical analysis)。正则表达式第一次在编程中出现是在 Ken Thompson 在 QED 中构建了Kleene’s notation，为了执行速度，Thompson 使用JIT 来实现正则表达式。在 Thompson 开发 QED 的同一时期，包括 Douglas T. Ross 在内的一群研究人员给予正则表达式开发了编译器中词法分析的工具。
20世纪80年代，从 Henry Spencer 编写的正则表达式库中派生出了在 Perl 语言中兼容性更强的正则表达式。
1997年，Philip Hazel 开发了 PCRE (Perl Compatible Regular Expressions)，PCRE 被很多现代开发工具所使用，包括 PHP 和 Apache HTTP Server。
现在，正则表达式已经广泛被各大编程语言，文本处理程序以及文本编辑器所支持

正则表达式语法

测试代码

import java.io.Console;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class RegularExpression2 {
    public static void main(String[] args) {
        Console console = System.console();
        if (console == null) {
            System.err.println("No console.");
            System.exit(1);
        }
        while (true) {
            Pattern pattern = Pattern.compile(console.readLine("%nEnter your regex: "));
            Matcher matcher = pattern.matcher(console.readLine("Enter input string to search: "));

            boolean found = false;
            while (matcher.find()) {
                console.format("I fount the text" +
                                " \"%s\" starting at " +
                                "index %d and ending at index %d %n",
                        matcher.group(),
                        matcher.start(),
                        matcher.end());
                found = true;
            }

            if (!found) {
                console.format("No match found %n");
            }
        }
    }
}

基本匹配（String literal)

// input
regex: the
strings: The fat cat sat on the mat.

// output
I fount the text "the" starting at index 19 and ending at index 22

使用字符串字面量是正则表达式最简单的使用方法，会在字符串中寻找与 pattern 完全一致的字符串，注意这里打印的 startIndex 和 endIndex 遵循字符串左闭右开 的原则。

元字符（Metacharacters）

The metacharacters supported by java.util.regex are: <([{\^-=$!|]})?*+.>

Note: In certain situations the special characters listed above will not be treated as metacharacters. You’ll encounter this as you learn more about how regular expressions are constructed. You can, however, use this list to check whether or not a specific character will ever be considered a metacharacter. For example, the characters @ and # never carry a special meaning.

There are two ways to force a metacharacter to be treated as an ordinary character:

precede the metacharacter with a backslash, or
enclose it within \Q (which starts the quote) and \E (which ends it).

When using this technique, the \Q and \E can be placed at any location within the expression, provided that the \Q comes first.

元字符	说明
^	匹配一行的开头，要匹配 `^` 本身，可以使用 `\^`
$	匹配一行的结尾，要匹配 `$` 本身，可以使用 `\$`
.	匹配除了换行符(\n)以外的所有字符，要匹配 `.` 本身，可以使用 `\.`
[]	`[]` 中的字符为字符集，字符集不关心顺序，如`[Tt]he` 匹配 `the` 和 `The`，要匹配 `[]` 本身，可以使用 `\[` 和 `\]`
*	匹配前面字符出现次数 >= 0 次，要匹配 `` 本身，可以使用 `\`，使用 `.*` 可以匹配整个字符串
+	匹配前面字符出现次数 >= 1 次，要匹配 `+` 本身，可以使用 `\+`
？	匹配前面字符出现次数为0或1，要匹配 `?` 本身，可以使用 `\?`
{}	用来限定一个或一组字符可以重复出现的次数,要匹配 `{}` 本身，可以使用 `\{` 和 `\}`
{	特征标群是一组写在 `(...)` 中的子模式。`(...)` 中包含的内容将会被看成一个整体，和数学中小括号（）的作用相同。
\|	或运算符就表示或，用作判断条件。例如 `(T\|t)he\|car` 匹配 `(T\|t)he` 或者`car`
\	反斜线 `\` 在表达式中用于转码紧跟其后的字符。用于指定 `{ } [ ] / \ + * . $ ^\|?` 这些特殊字符

Java中预定义字符

简写	描述
.	除换行符外的所有字符
\w	匹配所有字母数字，等同于 `[a-zA-Z0-9_]`
\W	匹配所有非字母数字，即符号，等同于： `[^\w]`
\d	匹配数字： `[0-9]`
\D	匹配非数字： `[^\d]`
\s	匹配所有空格字符，等同于： `[\t\n\f\r\p{Z}]`
\S	匹配所有非空格字符： `[^\s]`
\h	A horizontal whitespace character: [ \t\xA0\u1680\u180e\u2000-\u200a\u202f\u205f\u3000]
\H	A non-horizontal whitespace character: `[^\h]`
\v	A vertical whitespace character: `[\n\x0B\f\r\x85\u2028\u2029]`
\V	A non-vertical whitespace character: `[^\v]`