正则表达式学习指南(十一)----Quantifiers(Repetition)

Repetition with Star and Plus

I already introduced one repetition operator or quantifier: the question mark. It tells the engine to attempt match the preceding token zero times or once, in effect making it optional.

The asterisk or star tells the engine to attempt to match the preceding token zero or more times. The plus tells the engine to attempt to match the preceding token once or more.<[A-Za-z][A-Za-z0-9]*> matches an HTML tag without any attributes. The sharp brackets areliterals. The first character class matches a letter. The second character class matches a letter or digit. The star repeats the second character class. Because we used the star, it's OK if the second character class matches nothing. So our regex will match a tag like <B>. When matching <HTML>, the first character class will match H. The star will cause the second character class to be repeated three times, matchingT, M and L with each step.

I could also have used <[A-Za-z0-9]+>. I did not, because this regex would match<1>, which is not a valid HTML tag. But this regex may be sufficient if you know the string you are searching through does not contain any such invalid tags.

Limiting Repetition

Modern regex flavors, like those discussed in this tutorial, have an additional repetition operator that allows you to specify how many times a token can be repeated. The syntax is{min,max}, where min is a positive integer number indicating the minimum number of matches, andmax is an integer equal to or greater than min indicating the maximum number of matches. If the comma is present butmax is omitted, the maximum number of matches is infinite. So {0,} is the same as *, and {1,} is the same as+. Omitting both the comma and max tells the engine to repeat the token exactlymin times.

You could use \b[1-9][0-9]{3}\b to match a number between 1000 and 9999.\b[1-9][0-9]{2,4}\b matches a number between 100 and 99999. Notice the use of theword boundaries.

Watch Out for The Greediness!

Suppose you want to use a regex to match an HTML tag. You know that the input will be a valid HTML file, so the regular expression does not need to exclude any invalid use of sharp brackets. If it sits between sharp brackets, it is an HTML tag.

Most people new to regular expressions will attempt to use <.+>. They will be surprised when they test it on a string likeThis is a <EM>first</EM> test. You might expect the regex to match<EM> and when continuing after that match, </EM>.

But it does not. The regex will match <EM>first</EM>. Obviously not what we wanted. The reason is that the plus isgreedy. That is, the plus causes the regex engine to repeat the preceding token as often as possible. Only if that causes the entire regex to fail, will the regex enginebacktrack. That is, it will go back to the plus, make it give up the last iteration, and proceed with the remainder of the regex. Let's take a look inside the regex engine to see in detail how this works and why this causes our regex to fail. After that, I will present you with two possible solutions.

Like the plus, the star and the repetition using curly braces are greedy.

Looking Inside The Regex Engine

The first token in the regex is <. This is a literal. As we already know, the first place where it will match is the first< in the string. The next token is the dot, which matches any character except newlines. The dot is repeated by the plus. The plus isgreedy. Therefore, the engine will repeat the dot as many times as it can. The dot matchesE, so the regex continues to try to match the dot with the next character.M is matched, and the dot is repeated once more. The next character is the>. You should see the problem by now. The dot matches the>, and the engine continues repeating the dot. The dot will match all remaining characters in the string. The dot fails when the engine has reached the void after the end of the string. Only at this point does the regex engine continue with the next token: >.

So far, <.+ has matched <EM>first</EM> test and the engine has arrived at the end of the string.> cannot match here. The engine remembers that the plus has repeated the dot more often than is required. (Remember that the plusrequires the dot to match only once.) Rather than admitting failure, the engine willbacktrack. It will reduce the repetition of the plus by one, and then continue trying the remainder of the regex.

So the match of .+ is reduced to EM>first</EM> tes. The next token in the regex is still>. But now the next character in the string is the last t. Again, these cannot match, causing the engine to backtrack further. The total match so far is reduced to<EM>first</EM> te. But > still cannot match. So the engine continues backtracking until the match of.+ is reduced to EM>first</EM. Now,> can match the next character in the string. The last token in the regex has been matched. The engine reports that<EM>first</EM> has been successfully matched.

Remember that the regex engine is eager to return a match. It will not continue backtracking further to see if there is another possible match. It will report the first valid match it finds. Because of greediness, this is the leftmost longest match.

Laziness Instead of Greediness

The quick fix to this problem is to make the plus lazy instead of greedy. Lazy quantifiers are sometimes also called "ungreedy" or "reluctant". You can do that by putting a question mark behind the plus in the regex. You can do the same with the star, the curly braces and the question mark itself. So our example becomes <.+?>. Let's have another look inside the regex engine.

Again, < matches the first < in the string. The next token is the dot, this time repeated by a lazy plus. This tells the regex engine to repeat the dot as few times as possible. The minimum is one. So the engine matches the dot with E. The requirement has been met, and the engine continues with> and M. This fails. Again, the engine willbacktrack. But this time, the backtracking will force the lazy plus to expand rather than reduce its reach. So the match of.+ is expanded to EM, and the engine tries again to continue with>. Now, > is matched successfully. The last token in the regex has been matched. The engine reports that<EM> has been successfully matched. That's more like it.

An Alternative to Laziness

In this case, there is a better option than making the plus lazy. We can use a greedy plus and anegated character class: <[^>]+>. The reason why this is better is because of the backtracking. When using the lazy plus, the engine has to backtrack for each character in the HTML tag that it is trying to match. When using the negated character class, no backtracking occurs at all when the string contains valid HTML code. Backtracking slows down the regex engine. You will not notice the difference when doing a single search in a text editor. But you will save plenty of CPU cycles when using such a regex repeatedly in a tight loop in a script that you are writing, or perhaps in a custom syntax coloring scheme forEditPad Pro.

Finally, remember that this tutorial only talks about regex-directed engines. Text-directed engines do not backtrack. They do not get the speed penalty, but they also do not support lazy repetition operators.

Repeating \Q...\E Escape Sequences

The \Q...\E sequence escapes a string of characters, matching them as literal characters. The escaped characters are treated as individual characters. If you place a quantifier after the\E, it will only be applied to the last character. E.g. if you apply \Q*\d+*\E+ to *\d+**\d+*, the match will be *\d+**. Only the asterisk is repeated. Java 4 and 5 have a bug that causes the whole \Q..\E sequence to be repeated, yielding the whole subject string as the match. This was fixed in Java 6.

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
### 回答1: 要使用 Java 正则表达式来匹配和获取文本文件,可以按照以下步骤进行: 1. 读取文本文件的内容,可以使用 Java 的文件输入流(FileInputStream)和缓冲输入流(BufferedInputStream)来实现。 2. 使用 Java 的正则表达式类(Pattern)和匹配器类(Matcher)来编译和匹配正则表达式。例如,可以使用 Pattern 类的 compile() 方法来编译正则表达式,然后使用 Matcher 类的 matcher() 方法来匹配文本文件中的内容。 3. 在正则表达式中使用元字符和特殊字符来匹配文本文件中的内容。例如,可以使用字符类(Character Class)来匹配任何单个字符,或者使用量词(Quantifiers)来匹配多个字符。 4. 在匹配成功后,可以使用 Matcher 类的 group() 方法来获取匹配到的内容。 以下是一个示例代码,演示了如何使用 Java 正则表达式来匹配并获取文本文件中的内容: ```java import java.io.BufferedReader; import java.io.FileInputStream; import java.io.IOException; import java.io.InputStreamReader; import java.util.regex.Matcher; import java.util.regex.Pattern; public class FileRegexMatcher { public static void main(String[] args) { String fileName = &quot;file.txt&quot;; String regex = &quot;.*Java.*&quot;; // 匹配包含 Java 的行 try { FileInputStream fis = new FileInputStream(fileName); BufferedReader reader = new BufferedReader(new InputStreamReader(fis)); Pattern pattern = Pattern.compile(regex); String line; while ((line = reader.readLine()) != null) { Matcher matcher = pattern.matcher(line); if (matcher.matches()) { System.out.println(line); } } reader.close(); fis.close(); } catch (IOException e) { e.printStackTrace(); } } } ``` 在这个示例中,我们首先指定了要匹配的文件名和正则表达式。然后,我们使用 FileInputStream 和 BufferedReader 读取文件内容。接着,我们使用 Pattern 类编译正则表达式,并在 while 循环中使用 Matcher 类匹配每一行。最后,如果匹配成功,我们就使用 System.out.println() 输出匹配到的行。 ### 回答2: 在Java中,可以使用正则表达式来进行文件匹配和获取。 首先,我们需要利用Java的File类来访问文件系统,并使用正则表达式来匹配文件名或路径。可以使用File类的listFiles方法来获取指定路径下的所有文件和文件夹。然后,我们可以使用正则表达式来筛选出符合条件的文件。 以下是一个简单的示例代码,假设我们要获取指定目录下以&quot;.txt&quot;为后缀的所有文件: ```java import java.io.File; import java.util.regex.Matcher; import java.util.regex.Pattern; public class FileMatcher { public static void main(String[] args) { String directory = &quot;/path/to/directory&quot;; // 指定目录路径 String regex = &quot;.*\\.txt$&quot;; // 正则表达式,匹配以&quot;.txt&quot;为后缀的文件 File folder = new File(directory); File[] files = folder.listFiles(); // 获取目录下所有文件和文件夹 Pattern pattern = Pattern.compile(regex); // 编译正则表达式 for (File file : files) { if (file.isFile()) { // 判断是否为文件 String filename = file.getName(); // 获取文件名 // 使用正则表达式匹配文件名 Matcher matcher = pattern.matcher(filename); if (matcher.matches()) { // 匹配成功 System.out.println(filename); } } } } } ``` 上述代码中,我们首先指定了一个目录路径和一个正则表达式。然后,通过File类的listFiles方法获取目录下的所有文件和文件夹,并使用正则表达式匹配文件名。对于每个文件,我们使用Matcher类进行匹配,如果匹配成功,则将文件名输出。 通过以上方式,我们可以根据正则表达式来匹配和获取文件。此外,还可以根据需要进行进一步修改和扩展。 ### 回答3: Java中可以使用正则表达式来匹配和获取文件。正则表达式是一种用来描述字符串模式的工具,在Java中可以通过Pattern和Matcher类来实现正则表达式匹配。 首先,我们需要使用正则表达式定义要匹配的文件名模式。例如,如果我们想获取以&quot;.txt&quot;结尾的文件名,我们可以使用正则表达式&quot;^.+\\.txt$&quot;,其中&quot;^&quot;表示匹配字符串的开始,&quot;.+&quot;表示匹配一个或多个任意字符,&quot;\\.&quot;表示匹配点字符,&quot;txt$&quot;表示匹配以&quot;txt&quot;结尾的字符串。 接下来,我们可以使用Pattern.compile方法将正则表达式编译为Pattern对象,并使用Matcher类的find方法进行匹配。例如: ```java String pattern = &quot;^.+\\.txt$&quot;; Pattern regex = Pattern.compile(pattern); File directory = new File(&quot;path/to/directory&quot;); File[] files = directory.listFiles(); for (File file : files) { String fileName = file.getName(); Matcher matcher = regex.matcher(fileName); if (matcher.find()) { System.out.println(&quot;匹配到文件:&quot; + fileName); } } ``` 上述代码首先定义了一个正则表达式模式,并将其编译为Pattern对象。然后,我们通过File类的listFiles方法获取指定目录下的文件列表。接下来,遍历文件列表,对每个文件名使用Matcher类的find方法进行匹配,如果匹配成功则输出匹配到的文件名。 注意,在使用正则表达式进行文件匹配时,还可以使用其他的符号和模式来定义匹配规则,具体的语法和技巧可以参考正则表达式相关的文档和教程。

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值