python正则表达式匹配aabb_python正则表达式每个搜索字符串索引匹配多次

最新推荐文章于 2022-08-03 16:39:29 发布

weixin_39897267

最新推荐文章于 2022-08-03 16:39:29 发布

阅读量320

点赞数

文章标签： python正则表达式匹配aabb

I'm looking for a way to make the finditer function of the python re module or the newer regex module to match all possible variations of a particular pattern, overlapping or otherwise. I am aware of using lookaheads to get matches without consuming the search string, but I still only get one regex per index, where I could get more than one.

The regex I am using is something like this:

(?=A{2}[BA]{1,6}A{2})

so in the string:

AABAABBAA

it should be able to match:

AABAA

AABAABBAA

AABBAA

but currently it will only match the last two of these. I realise it is to do with the greediness of the [BA]{1,6}. Is there a way to make the regex match everything from the laziest to the greediest possible pattern?

解决方案

I realise it is to do with the greediness of the [BA]{1,6}. Is there a way to make the regex match everything from the laziest to the greediest possible pattern?

The problem is twofold.

1. Regex engines will only match once at a character position.

2. There is not a regex construct of between lazy and greedy

it's either one or the other.

Skipping problem 1. for the moment..,

Problem 2:

There could be a case where there is {1,6} 1,2,3,4,5 or 6 matches

of a construct (character) at a given position.

To solve that problem, you'd have to specify independent {1},{2},{3},{4},{5},{6}

as optional alternations at that position.

Clearly a range {1,6} is not going to work.

As far as a Range is concerned, it can be specified to find the

minimum amount by adding the lazy modifier as such {1,6}?

But this will only find the smallest amount it can, no more, no less.

Finally,

Problem 1:

When a regex engine matches, it always advances the current position forward

an amount equal to the length of the last match.

In the case of a matched zero-length assertion, it artificially increases

the position one character forward.

So, given these two problems, one can use these strengths/weaknesses to come

up with a workaround, and have to live with some side affects.

Workarounds:

Put all the possible alternatives at a position as assertions to be analyzed.

Each match at a position, will contain a list of groups that hold a variation.

So, if you've matched 3 variations out of 6 possible variant groups, the groups with values will be the variants.

If none of the groups have values, no variants were found at that position.

No variants can happen because all of the assertions are optional.

To avoid unnecessarily matching at these specific positions, a final

conditional can be used to not report these. (i.e., (?(1)|(?(2)|(?!))) etc..).

Lets use your range example as an example.

We will use the conditional at the end to verify a group matched,

but it could be done without it.

_Note that using this range example caused an overlap with identical

values in the final match. This does not insure unique matches at

a position (the example following this shows how to avoid this).

# (?=(A{2}[BA]{1,6}?A{2}))?(?=(A{2}[BA]{1,6}A{2}))?(?(1)|(?(2)|(?!)))

(?=

( # (1 start)

A{2}

[BA]{1,6}?

A{2}

) # (1 end)

(?=

( # (2 start)

A{2}

[BA]{1,6}

A{2}

) # (2 end)

(?(1)

| (?(2)

| (?!)

)

Output:

** Grp 1 - ( pos 0 , len 5 )

AABAA

** Grp 2 - ( pos 0 , len 9 )

AABAABBAA

-------------

** Grp 1 - ( pos 3 , len 6 )

AABBAA

** Grp 2 - ( pos 3 , len 6 )

AABBAA

Same, but without the range problem.

Here, we explicitly define unique constructs.

Note the unique values at each position.

# (?=(A{2}[BA]{1}A{2}))?(?=(A{2}[BA]{2}A{2}))?(?=(A{2}[BA]{3}A{2}))?(?=(A{2}[BA]{4}A{2}))?(?=(A{2}[BA]{5}A{2}))?(?=(A{2}[BA]{6}A{2}))?(?(1)|(?(2)|(?(3)|(?(4)|(?(5)|(?(6)|(?!)))))))

(?=

( # (1 start)

A{2}

[BA]{1}

A{2}

) # (1 end)

(?=

( # (2 start)

A{2}

[BA]{2}

A{2}

) # (2 end)