python正则表达式匹配aabb_python正则表达式每个搜索字符串索引匹配多次

I'm looking for a way to make the finditer function of the python re module or the newer regex module to match all possible variations of a particular pattern, overlapping or otherwise. I am aware of using lookaheads to get matches without consuming the search string, but I still only get one regex per index, where I could get more than one.

The regex I am using is something like this:

(?=A{2}[BA]{1,6}A{2})

so in the string:

AABAABBAA

it should be able to match:

AABAA

AABAABBAA

AABBAA

but currently it will only match the last two of these. I realise it is to do with the greediness of the [BA]{1,6}. Is there a way to make the regex match everything from the laziest to the greediest possible pattern?

解决方案

I realise it is to do with the greediness of the [BA]{1,6}. Is there a way to make the regex match everything from the laziest to the greediest possible pattern?

The problem is twofold.

1. Regex engines will only match once at a character position.

2. There is not a regex construct of between lazy and greedy

it's either one or the other.

Skipping problem 1. for the moment..,

Problem 2:

There could be a case where there is {1,6} 1,2,3,4,5 or 6 matches

of a construct (character) at a given position.

To solve that problem, you'd have to specify independent {1},{2},{3},{4},{5},{6}

as optional alternations at that position.

Clearly a range {1,6} is not going to work.

As far as a Range is concerned, it can be specified to find the

minimum amount by adding the lazy modifier as such {1,6}?

But this will only find the smallest amount it can, no more, no less.

Finally,

Problem 1:

When a regex engine matches, it always advances the current position forward

an amount equal to the length of the last match.

In the case of a matched zero-length assertion, it artificially increases

the position one character forward.

So, given these two problems, one can use these strengths/weaknesses to come

up with a workaround, and have to live with some side affects.

Workarounds:

Put all the possible alternatives at a position as assertions to be analyzed.

Each match at a position, will contain a list of groups that hold a variation.

So, if you've matched 3 variations out of 6 possible variant groups, the groups with values will be the variants.

If none of the groups have values, no variants were found at that position.

No variants can happen because all of the assertions are optional.

To avoid unnecessarily matching at these specific positions, a final

conditional can be used to not report these. (i.e., (?(1)|(?(2)|(?!))) etc..).

Lets use your range example as an example.

We will use the conditional at the end to verify a group matched,

but it could be done without it.

_Note that using this range example caused an overlap with identical

values in the final match. This does not insure unique matches at

a position (the example following this shows how to avoid this).

# (?=(A{2}[BA]{1,6}?A{2}))?(?=(A{2}[BA]{1,6}A{2}))?(?(1)|(?(2)|(?!)))

(?=

( # (1 start)

A{2}

[BA]{1,6}?

A{2}

) # (1 end)

)?

(?=

( # (2 start)

A{2}

[BA]{1,6}

A{2}

) # (2 end)

)?

(?(1)

| (?(2)

| (?!)

)

)

Output:

** Grp 1 - ( pos 0 , len 5 )

AABAA

** Grp 2 - ( pos 0 , len 9 )

AABAABBAA

-------------

** Grp 1 - ( pos 3 , len 6 )

AABBAA

** Grp 2 - ( pos 3 , len 6 )

AABBAA

Same, but without the range problem.

Here, we explicitly define unique constructs.

Note the unique values at each position.

# (?=(A{2}[BA]{1}A{2}))?(?=(A{2}[BA]{2}A{2}))?(?=(A{2}[BA]{3}A{2}))?(?=(A{2}[BA]{4}A{2}))?(?=(A{2}[BA]{5}A{2}))?(?=(A{2}[BA]{6}A{2}))?(?(1)|(?(2)|(?(3)|(?(4)|(?(5)|(?(6)|(?!)))))))

(?=

( # (1 start)

A{2}

[BA]{1}

A{2}

) # (1 end)

)?

(?=

( # (2 start)

A{2}

[BA]{2}

A{2}

) # (2 end)

)?

(?=

( # (3 start)

A{2}

[BA]{3}

A{2}

) # (3 end)

)?

(?=

( # (4 start)

A{2}

[BA]{4}

A{2}

) # (4 end)

)?

(?=

( # (5 start)

A{2}

[BA]{5}

A{2}

) # (5 end)

)?

(?=

( # (6 start)

A{2}

[BA]{6}

A{2}

) # (6 end)

)?

(?(1)|(?(2)|(?(3)|(?(4)|(?(5)|(?(6)|(?!)))))))

Output:

** Grp 1 - ( pos 0 , len 5 )

AABAA

** Grp 2 - NULL

** Grp 3 - NULL

** Grp 4 - NULL

** Grp 5 - ( pos 0 , len 9 )

AABAABBAA

** Grp 6 - NULL

------------------

** Grp 1 - NULL

** Grp 2 - ( pos 3 , len 6 )

AABBAA

** Grp 3 - NULL

** Grp 4 - NULL

** Grp 5 - NULL

** Grp 6 - NULL

Finally, all you need to do is on each match, grab the capture groups

with values, and put them into an array.

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值