如何提取两个标记之间的子字符串?

本文翻译自:How to extract the substring between two markers?

Let's say I have a string 'gfgfdAAA1234ZZZuijjk' and I want to extract just the '1234' part. 假设我有一个字符串'gfgfdAAA1234ZZZuijjk'而我只想提取'1234'部分。

I only know what will be the few characters directly before AAA , and after ZZZ the part I am interested in 1234 . 我只知道AAA之前和ZZZ之后我对1234感兴趣的那几个字符是什么。

With sed it is possible to do something like this with a string: 使用sed可以用字符串执行以下操作:

echo "$STRING" | sed -e "s|.*AAA\(.*\)ZZZ.*|\1|"

And this will give me 1234 as a result. 结果是我会得到1234

How to do the same thing in Python? 如何在Python中做同样的事情?


#1楼

参考:https://stackoom.com/question/jA5L/如何提取两个标记之间的子字符串


#2楼

Just in case somebody will have to do the same thing that I did. 以防万一某人必须做与我相同的事情。 I had to extract everything inside parenthesis in a line. 我必须在一行中提取括号内的所有内容。 For example, if I have a line like 'US president (Barack Obama) met with ...' and I want to get only 'Barack Obama' this is solution: 例如,如果我有一条类似“美国总统(巴拉克·奥巴马)与...会面……”这样的句子,而我只想得到“巴拉克·奥巴马”,这就是解决方案:

regex = '.*\((.*?)\).*'
matches = re.search(regex, line)
line = matches.group(1) + '\n'

Ie you need to block parenthesis with slash \\ sign. 也就是说,您需要使用slash \\符号来阻止括号。 Though it is a problem about more regular expressions that Python. 尽管这是关于Python的更多正则表达式的问题。

Also, in some cases you may see 'r' symbols before regex definition. 另外,在某些情况下,您可能会在正则表达式定义之前看到“ r”符号。 If there is no r prefix, you need to use escape characters like in C. Here is more discussion on that. 如果没有r前缀,则需要像C中那样使用转义字符。 此处对此进行了更多讨论。


#3楼

>>> s = '/tmp/10508.constantstring'
>>> s.split('/tmp/')[1].split('constantstring')[0].strip('.')

#4楼

With sed it is possible to do something like this with a string: 使用sed可以用字符串执行以下操作:

echo "$STRING" | sed -e "s|.*AAA\\(.*\\)ZZZ.*|\\1|"

And this will give me 1234 as a result. 结果是我会得到1234。

You could do the same with re.sub function using the same regex. 您可以使用相同的正则表达式对re.sub函数执行相同的操作。

>>> re.sub(r'.*AAA(.*)ZZZ.*', r'\1', 'gfgfdAAA1234ZZZuijjk')
'1234'

In basic sed, capturing group are represented by \\(..\\) , but in python it was represented by (..) . 在基本sed中,捕获组由\\(..\\) ,但是在python中,捕获组由(..)表示。


#5楼

>>> s = 'gfgfdAAA1234ZZZuijjk'
>>> start = s.find('AAA') + 3
>>> end = s.find('ZZZ', start)
>>> s[start:end]
'1234'

Then you can use regexps with the re module as well, if you want, but that's not necessary in your case. 然后,如果需要,也可以将reexpexp与re模块一起使用,但这在您的情况下不是必需的。


#6楼

import re
print re.search('AAA(.*?)ZZZ', 'gfgfdAAA1234ZZZuijjk').group(1)
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值