利用?防止过度匹配

Preventing Over Matching

? matches are limited in scope (zero or one match only), and so are interval matches when using exact amounts or ranges. But the other forms of repetition described in this lesson can match an unlimited number of matches—sometimes too many.

All the examples thus far were carefully chosen so as not to run into over matching, but consider this next example. The text that follows is part of a Web page and contains text with embedded HTML <B> tags. The regular expression needs to match any text within <B> tags (perhaps so as to be able to replace the formatting). Here's the example:

Text
This offer is not available to customers

living in <B>AK</B> and <B>HI</B>.

RegEx
<[Bb]>.*</[Bb]>

Result
This offer is not available to customers

living in <B>AK</B> and <B>HI</B>
.



Analysis

<[Bb]> matches the opening <B> tag (in either uppercase or lowercase), and </[Bb]> matches the closing </B> tag (also in either uppercase or lowercase). But instead of two matches, only one was found; the .* matched everything after the first <B> until the last </B> so that the text AK</B> and <B>HI was matched. This includes the text we wanted matched, but also other instances of the tags as well.

The reason for this is that metacharacters such as * and + are greedy; that is, they look for the greatest possible match as opposed to the smallest. It is almost as if the matching starts from the end of the text, working backward until the next match is found, in contrast to starting from the beginning. This is deliberate and by design, quantifiers are greedy.

But what if you don't want greedy matching? The solution is to use lazy versions of these quantifiers (they are referred to as being lazy because they match the fewest characters instead of the most). Lazy quantifiers are defined by appending an ? to the quantifier being used, and each of the greedy quantifiers has a lazy equivalent as listed in Table 5.1.

Table 5.1. Greedy and Lazy Quantifiers

Greedy

Lazy

*

*?

+

+?

{n,}

{n,}?

*? is the lazy version of *, so let's revisit our example, this time using *?:

Text
This offer is not available to customers

living in <B>AK</B> and <B>HI</B>.

RegEx
<[Bb]>.*?</[Bb]>

Result
This offer is not available to customers

living in <B>AK</B>

 and <B>HI</B>
.



Analysis

That worked, by using the lazy *? only AK, was matched in the first match allowing <B>HI</B> to be matched independently.

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值