Java正则表达式匹配模式详解

最新推荐文章于 2024-05-08 18:14:02 发布

三劫散仙

最新推荐文章于 2024-05-08 18:14:02 发布

阅读量5k

点赞数 3

分类专栏： Java 文章标签： java 服务器 servlet

本文链接：https://blog.csdn.net/u010454030/article/details/131217547

版权

Java 专栏收录该内容

189 篇文章 0 订阅

订阅专栏

Java正则匹配的语法，请参考：Pattern (Java Platform SE 8 )

matches和find区别

matches: 输入的字符串必须和正则一摸一样，类似字符串相等的比较方法， "b".equals("b");

find：输入的字符串里面只要包含了正则式表达的内容即可，类似字符串包含的方法, "b".contains("b");

        String word = "my number is 188"; // matches=false, find=true
        String word1 = "188999"; // matches=false, find=true
        String word2 = "8"; // matches=true, find=true
        Pattern p = Pattern.compile("\\d");
        Matcher m = p.matcher(word2);
        System.out.println(m.matches());
        System.out.println(m.find());

默认匹配案例

       String word = "ac\nab";
       Pattern p = Pattern.compile("^a.*");
       Matcher m = p.matcher(word);
       while (m.find()){
           System.out.println(m.group());
       }
//输出
// ac

上面的结果实际上只会输出ac，而ab并不会输出，这是因为Java正则中，如果出现了^ 或 $，默认情况下会忽略任何换行符，也就是说仅仅匹配第一行，后面的所有内容都会被忽略掉，如果我们想要不忽略，就得使用多行匹配模式

如果我们不使用 ^ 和 $ ，那么没问题可以匹配到所有，但如果我们就想在严格的 ^ 和 $ 中进行匹配呢？那么就得使用多行匹配模式了

多行匹配模式MULTILINE

多行匹配模式有两种语法

第一种，使用嵌入表达式：(?m)

       String word = "ac\nab";
       Pattern p = Pattern.compile("(?m)^a.*"); 
       Matcher m = p.matcher(word);
       while (m.find()){
           System.out.println(m.group());
       }
//输出
// ac
// ab

第二种，指定Flag参数：Pattern.MULTILINE

    String word = "ac\nab";
       Pattern p = Pattern.compile("^a.*", Pattern.MULTILINE);
       Matcher m = p.matcher(word);
       while (m.find()){
           System.out.println(m.group());
       }

全字符匹配模式DOTALL

在Java正则语法里面元字符 . 代表除了换行符外的任何字符，但有些时候我们就想匹配有换行符分隔的内容应该怎么做呢？

如果我们使用多行匹配模式，就会发现行不通

在Java里面使用 Pattern.DOTALL 参数 或者 (?s) 嵌入式表达式，代表让 . 代表所有字符，包含换行符

       String word = "run\nhad\noop";
//       Pattern p = Pattern.compile("h.*p", Pattern.DOTALL);
       Pattern p = Pattern.compile("(?s)h.*p");
       Matcher m = p.matcher(word);
       while (m.find()){
           System.out.println(m.group());
       }
// 输出
// had
// oop

联合模式匹配 MULTILINE & DOTALL

有时候我们的匹配规则，比较复杂，可能需要联合多种模式一起用：

比如下面的规则：

工作的很好，ok，现在我们需求改为忽略换行符之后，仅匹配h开头和p结尾的字符串, 我们来分析下：

仅用MULTILINE肯定不行，因为h和p之间隔的有换行符

仅用DOTALL也不行，因为不区分多行，而是把整体当作一个大字符串了

所以只能联合 MULTILINE + DOTALL 两种模式了：

       String word = "run\nhad\noop\nhi\nspx";
//     Pattern p = Pattern.compile("(?ms)^h.*p$"); //嵌入式表达式
       Pattern p = Pattern.compile("^h.*p$", Pattern.DOTALL | Pattern.MULTILINE);
       Matcher m = p.matcher(word);
       while (m.find()){
           System.out.println(m.group());
       }
// 输出
// had
// oop

忽略大小写CASE_INSENSITIVE

       String word = "cAt";
       Pattern p = Pattern.compile("(?i)^h.*p$");
//       Pattern p = Pattern.compile("cat", Pattern.CASE_INSENSITIVE);
       Matcher m = p.matcher(word);
       while (m.find()){
           System.out.println(m.group());
       }

Linux换行符UNIX_LINES

默认模式中\r\n都会被当做换行符：

       String input= "This is the first line\r"
               + "This is the second line\r"
               + "This is the third line\r";
       Pattern p = Pattern.compile("^T.*e");
       Matcher m = p.matcher(input);
       while (m.find()){
           System.out.println("["+m.group()+"]");
       }
// 输出
// [This is the first line]

当指定了UNIX_LINES后，只会在. ^ $ 中，其他的换行字符都会都会当成一个普通字符

    String input= "This is the first line\r"
               + "This is the second line\r"
               + "This is the third line\r";
       Pattern p = Pattern.compile("^T.*e", Pattern.UNIX_LINES);
       Matcher m = p.matcher(input);
       while (m.find()){
           System.out.println("["+m.group()+"]");
       }
// 输出
// This is the third line]

注意 \r 代表回车，会覆盖之前输出的内容，所以这里看到的结果是最后一段的结果

增加注释COMMENTS

可以在正则中加入解释

       String input= "abc\nbbc";
       Pattern p = Pattern.compile("a.*c # 寻找以a开头以c结尾的单词", Pattern.COMMENTS);
       Matcher m = p.matcher(input);
       while (m.find()){
           System.out.println("["+m.group()+"]");
       }

文字解析模式LITERAL

       String input= "abc\nbbc";
		//仅能与 CASE_INSENSITIVE 和 UNICODE_CASE 搭配
       Pattern p = Pattern.compile("a.*", Pattern.LITERAL);
       Matcher m = p.matcher(input);
       while (m.find()){
           System.out.println("["+m.group()+"]");
       }
// 输出为空，所有的元字符会被当成普通字符

非ASCII编码忽略大小写UNICODE_CASE

默认情况下忽略大小写匹配仅支持ASCII编码，如果非ASCII编码需要使用 UNICODE_CASE 和 CASE_INSENSITIVE 组合才有效果

       String input= "À";
       Pattern p = Pattern.compile("à", Pattern.CASE_INSENSITIVE | Pattern.UNICODE_CASE);
       Matcher m = p.matcher(input);
       while (m.find()){
           System.out.println("["+m.group()+"]");
       }

UNICODE_CHARACTER_CLASS模式

启用此模式，可以使用一些特定匹配规则：

Classes	Matchesb
\p{Lower}	A lowercase character:\p{IsLowercase}
\p{Upper}	An uppercase character:\p{IsUppercase}
\p{ASCII}	All ASCII:[\x00-\x7F]
\p{Alpha}	An alphabetic character:\p{IsAlphabetic}
\p{Digit}	A decimal digit character:p{IsDigit}
\p{Alnum}	An alphanumeric character:[\p{IsAlphabetic}\p{IsDigit}]
\p{Punct}	A punctuation character:p{IsPunctuation}
\p{Graph}	A visible character: [^\p{IsWhite_Space}\p{gc=Cc}\p{gc=Cs}\p{gc=Cn}]
\p{Print}	A printable character: [\p{Graph}\p{Blank}&&[^\p{Cntrl}]]
\p{Blank}	A space or a tab: [\p{IsWhite_Space}&&[^\p{gc=Zl}\p{gc=Zp}\x0a\x0b\x0c\x0d\x85]]
\p{Cntrl}	A control character: \p{gc=Cc}
\p{XDigit}	A hexadecimal digit: [\p{gc=Nd}\p{IsHex_Digit}]
\p{Space}	A whitespace character:\p{IsWhite_Space}
\d	A digit: \p{IsDigit}
\D	A non-digit: [^\d]
\s	A whitespace character: \p{IsWhite_Space}
\S	A non-whitespace character: [^\s]
\w	A word character: [\p{Alpha}\p{gc=Mn}\p{gc=Me}\p{gc=Mc}\p{Digit}\p{gc=Pc}\p{IsJoin_Control}]
\W	A non-word character: [^\w]

这里的匹配使用的unicode下的表示字符，参考：UTS #18: Unicode Regular Expressions

搜索 punctuation 关键词，可以看到unicode下的表示符号，就是我们键盘上非数字非字母部分的符号表示：

       Pattern p = Pattern.compile("\\p{Punct}");
       Matcher m = p.matcher("`");
       System.out.println(m.matches()); // returns true
       
       Pattern p1 = Pattern.compile("\\p{Punct}", Pattern.UNICODE_CHARACTER_CLASS);
       Matcher m1 = p1.matcher("`");
       System.out.println(m1.matches()); // returns false

注意上面的第二个不匹配，因为启动了UNICODE_CHARACTER_CLASS，必须用UNICODE_CHARACTER_CLASS下的字符表示才可以匹配

UNICODE的同等关系的CANON_EQ模式

这个一般用在UNICODE的字符中，举个例子：

“◌̇” U+0307 Combining Dot Above Unicode Character

unicode字符U+0307 代表字母上方的一个点 ḃ

而通过 b + \u0307 就能组成 ḃ ，而 ḃ 也有专门的unicode字符表示： \u1E03

也就是说 b\u0307 = \u1E03

在Java的正则里面，如果想要等价表示这个关系，就必须使用CANON_EQ模式匹配才可以

    String regex = "b\u0307";
        System.out.println(regex);
        System.out.println("\u1E03");
        Pattern pattern = Pattern.compile(regex, Pattern.CANON_EQ);
        Matcher matcher = pattern.matcher("\u1E03");
        if(matcher.matches()) {
            System.out.println("Match found");
        } else {
            System.out.println("Match not found");
        }
// 输出
// ḃ
// ḃ
// Match found

三劫散仙

关注

3
点赞
踩
5

收藏

觉得还不错? 一键收藏
0
评论
Java正则表达式匹配模式详解

上面的结果实际上只会输出ac，而ab并不会输出，这是因为Java正则中，如果出现了^ 或 $，默认情况下会忽略任何换行符，也就是说仅仅匹配第一行，后面的所有内容都会被忽略掉，如果我们想要不忽略，就得使用多行匹配模式。注意上面的第二个不匹配，因为启动了UNICODE_CHARACTER_CLASS，必须用UNICODE_CHARACTER_CLASS下的字符表示才可以匹配。find：输入的字符串里面只要包含了正则式表达的内容即可，类似字符串包含的方法, "b".contains("b");
复制链接

扫一扫