Java自学者的第四篇学习笔记

最新推荐文章于 2024-08-20 14:34:44 发布

新时代农民1

最新推荐文章于 2024-08-20 14:34:44 发布

阅读量85

点赞数

文章标签： java

本文链接：https://blog.csdn.net/qq_53324408/article/details/118582252

版权

写一写有关正则表达式的内容吧，这部分知识对于编程是很有帮助的。

语法（Grammer）定义了一类字符串。假设我们写了一个代表统一资源定位器（URL）的语法。该语法就代表了HTTP协议中合法的URL的集合。

语法中文字形式的字符串称为终结符（terminal）。之所以这样称它是因为其不能作进一步的扩展。我们通常以单引号标识这类字符串。

语法由一个被生成器（production）组成的类所描述，其中每一个生成器定义了一个非终结符（nonterminal）。一个语法的生成器有如下形式：

nonterminal ::= expression of terminals, nonterminals, and operators

一个语法具有一个唯一的非终结符称为根（root），它所包含的所有字符串都是从根生成的。

语法操作符

1. 闭包运算（Repetition），由*表示：

x ::= y*        // x matches zero or more y

2. 连接运算（Concatenation），仅由一个空格表示：

x ::= y z       // x matches y followed by z

3. 联合运算（Union），由|表示：

x ::= y | z     // x matches either y or z

4. 有或无（0 or 1 occurrence），由？表示：

x ::= y?       // an x is a y or is the empty string

5. 有（1 or more occurrences），由+表示：

x ::= y+       // an x is one or more y
               //    equivalent to  x ::= y y*

6. 有几个（exact number of occurrences, range of occurences）：

x ::= y{3}     // an x is three y
               // equivalent to x ::= y y y 

x ::= y{1,3}   // an x is between one and three y
               // equivalent to x ::= y | y y | y y y

x ::= y{,4}    // an x is at most four y
               // equivalent to x ::=   | y | y y | y y y | y y y y
               //                     ^--- note the empty string here, so this can match zero y's

x ::= y{2,}    // an x is two or more y
               // equivalent to x ::= y y y*

7. 有哪些：

① character class

x ::= [aeiou]  // equivalent to  x ::= 'a' | 'e' | 'i' | 'o' | 'u'
x ::= [a-ckx-z]    // equivalent to  x ::= 'a' | 'b' | 'c' | 'k' | 'x' | 'y' | 'z'

② inverted character class

x ::= [^a-c]  // equivalent to  x ::= 'd' | 'e' | 'f' | ... | '0' | '1' | '2' | ... | '!' | '@'
              //                          | ... (all other possible characters)

语法中的递归以及语法分析树与形式语言理论中完全一致，在这里不再赘述。

正则表达式（Regular Expressions）可简称为regexes。由于进行了简化，regex相较于原先的语法可读性更差，例如：

// original grammer
url ::= 'http://' hostname (':' port)? '/' 
hostname ::= word '.' hostname | word '.' word
port ::= [0-9]+
word ::= [a-z]+

// regex
url ::= 'http://' ([a-z]+ '.')+ [a-z]+ (':' [0-9]+)? '/'

regex有一些更特殊的符号：

.   // matches any single character (but sometimes excluding newline, depending on the regex library)

\d  // matches any digit, same as [0-9]
\s  // matches any whitespace character, including space, tab, newline
\w  // matches any word character including underscore, same as [a-zA-Z_0-9]

反斜杠（\）可以理解为跳过（Escape），即跳过下一个字符，将其直接与文本匹配。

\.  \(  \)  \*  \+  \|  \[  \]  \\

注意，当你将上述带有反斜杠的符号复制到IDEA中时，前面会再加一个反斜杠，不要删除它！具体原因MIT Reading中是这样描述的（以\.为例）：

We want to match a literal period ., so we have to first escape it as \. to protect it from being interpreted as the regex match-any-character operator, and then we have to further escape it as \\. to protect the backslash from being interpreted as a Java string escape character. The frequent necessity for double-backslash escapes makes regexes still less readable.

在Java中，你可以使用正则表达式对字符串进行操作。

将字符串s中所有连续的空格替换为单一空格：

String singleSpacedString = s.replaceAll(" +", " ");

匹配URL：

if (s.matches("http://([a-z]+\\.)+[a-z]+(:[0-9]+)?/")) {
    // then s is a url
}

提取如"2021-07-09"的日期：

String s = "2021-07-09";
Pattern regex = Pattern.compile("(?<year>\\d{4})-(?<month>\\d{2})-(?<day>\\d{2})");
Matcher m = regex.matcher(s);
if (m.matches()) {
    String year = m.group("year");
    String month = m.group("month");
    String day = m.group("day");
    // Matcher.group(name) returns the part of s that matched (?<name>...)
}