正则表达式学习指南(十一)----Quantifiers(Repetition)

Repetition with Star and Plus

I already introduced one repetition operator or quantifier: the question mark. It tells the engine to attempt match the preceding token zero times or once, in effect making it optional.

The asterisk or star tells the engine to attempt to match the preceding token zero or more times. The plus tells the engine to attempt to match the preceding token once or more.<[A-Za-z][A-Za-z0-9]*> matches an HTML tag without any attributes. The sharp brackets areliterals. The first character class matches a letter. The second character class matches a letter or digit. The star repeats the second character class. Because we used the star, it's OK if the second character class matches nothing. So our regex will match a tag like <B>. When matching <HTML>, the first character class will match H. The star will cause the second character class to be repeated three times, matchingT, M and L with each step.

I could also have used <[A-Za-z0-9]+>. I did not, because this regex would match<1>, which is not a valid HTML tag. But this regex may be sufficient if you know the string you are searching through does not contain any such invalid tags.

Limiting Repetition

Modern regex flavors, like those discussed in this tutorial, have an additional repetition operator that allows you to specify how many times a token can be repeated. The syntax is{min,max}, where min is a positive integer number indicating the minimum number of matches, andmax is an integer equal to or greater than min indicating the maximum number of matches. If the comma is present butmax is omitted, the maximum number of matches is infinite. So {0,} is the same as *, and {1,} is the same as+. Omitting both the comma and max tells the engine to repeat the token exactlymin times.

You could use \b[1-9][0-9]{3}\b to match a number between 1000 and 9999.\b[1-9][0-9]{2,4}\b matches a number between 100 and 99999. Notice the use of theword boundaries.

Watch Out for The Greediness!

Suppose you want to use a regex to match an HTML tag. You know that the input will be a valid HTML file, so the regular expression does not need to exclude any invalid use of sharp brackets. If it sits between sharp brackets, it is an HTML tag.

Most people new to regular expressions will attempt to use <.+>. They will be surprised when they test it on a string likeThis is a <EM>first</EM> test. You might expect the regex to match<EM> and when continuing after that match, </EM>.

But it does not. The regex will match <EM>first</EM>. Obviously not what we wanted. The reason is that the plus isgreedy. That is, the plus causes the regex engine to repeat the preceding token as often as possible. Only if that causes the entire regex to fail, will the regex enginebacktrack. That is, it will go back to the plus, make it give up the last iteration, and proceed with the remainder of the regex. Let's take a look inside the regex engine to see in detail how this works and why this causes our regex to fail. After that, I will present you with two possible solutions.

Like the plus, the star and the repetition using curly braces are greedy.

Looking Inside The Regex Engine

The first token in the regex is <. This is a literal. As we already know, the first place where it will match is the first< in the string. The next token is the dot, which matches any character except newlines. The dot is repeated by the plus. The plus isgreedy. Therefore, the engine will repeat the dot as many times as it can. The dot matchesE, so the regex continues to try to match the dot with the next character.M is matched, and the dot is repeated once more. The next character is the>. You should see the problem by now. The dot matches the>, and the engine continues repeating the dot. The dot will match all remaining characters in the string. The dot fails when the engine has reached the void after the end of the string. Only at this point does the regex engine continue with the next token: >.

So far, <.+ has matched <EM>first</EM> test and the engine has arrived at the end of the string.> cannot match here. The engine remembers that the plus has repeated the dot more often than is required. (Remember that the plusrequires the dot to match only once.) Rather than admitting failure, the engine willbacktrack. It will reduce the repetition of the plus by one, and then continue trying the remainder of the regex.

So the match of .+ is reduced to EM>first</EM> tes. The next token in the regex is still>. But now the next character in the string is the last t. Again, these cannot match, causing the engine to backtrack further. The total match so far is reduced to<EM>first</EM> te. But > still cannot match. So the engine continues backtracking until the match of.+ is reduced to EM>first</EM. Now,> can match the next character in the string. The last token in the regex has been matched. The engine reports that<EM>first</EM> has been successfully matched.

Remember that the regex engine is eager to return a match. It will not continue backtracking further to see if there is another possible match. It will report the first valid match it finds. Because of greediness, this is the leftmost longest match.

Laziness Instead of Greediness

The quick fix to this problem is to make the plus lazy instead of greedy. Lazy quantifiers are sometimes also called "ungreedy" or "reluctant". You can do that by putting a question mark behind the plus in the regex. You can do the same with the star, the curly braces and the question mark itself. So our example becomes <.+?>. Let's have another look inside the regex engine.

Again, < matches the first < in the string. The next token is the dot, this time repeated by a lazy plus. This tells the regex engine to repeat the dot as few times as possible. The minimum is one. So the engine matches the dot with E. The requirement has been met, and the engine continues with> and M. This fails. Again, the engine willbacktrack. But this time, the backtracking will force the lazy plus to expand rather than reduce its reach. So the match of.+ is expanded to EM, and the engine tries again to continue with>. Now, > is matched successfully. The last token in the regex has been matched. The engine reports that<EM> has been successfully matched. That's more like it.

An Alternative to Laziness

In this case, there is a better option than making the plus lazy. We can use a greedy plus and anegated character class: <[^>]+>. The reason why this is better is because of the backtracking. When using the lazy plus, the engine has to backtrack for each character in the HTML tag that it is trying to match. When using the negated character class, no backtracking occurs at all when the string contains valid HTML code. Backtracking slows down the regex engine. You will not notice the difference when doing a single search in a text editor. But you will save plenty of CPU cycles when using such a regex repeatedly in a tight loop in a script that you are writing, or perhaps in a custom syntax coloring scheme forEditPad Pro.

Finally, remember that this tutorial only talks about regex-directed engines. Text-directed engines do not backtrack. They do not get the speed penalty, but they also do not support lazy repetition operators.

Repeating \Q...\E Escape Sequences

The \Q...\E sequence escapes a string of characters, matching them as literal characters. The escaped characters are treated as individual characters. If you place a quantifier after the\E, it will only be applied to the last character. E.g. if you apply \Q*\d+*\E+ to *\d+**\d+*, the match will be *\d+**. Only the asterisk is repeated. Java 4 and 5 have a bug that causes the whole \Q..\E sequence to be repeated, yielding the whole subject string as the match. This was fixed in Java 6.

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值