本文翻译自:How to extract the substring between two markers?
Let's say I have a string 'gfgfdAAA1234ZZZuijjk'
and I want to extract just the '1234'
part. 假设我有一个字符串'gfgfdAAA1234ZZZuijjk'
而我只想提取'1234'
部分。
I only know what will be the few characters directly before AAA
, and after ZZZ
the part I am interested in 1234
. 我只知道AAA
之前和ZZZ
之后我对1234
感兴趣的那几个字符是什么。
With sed
it is possible to do something like this with a string: 使用sed
可以用字符串执行以下操作:
echo "$STRING" | sed -e "s|.*AAA\(.*\)ZZZ.*|\1|"
And this will give me 1234
as a result. 结果是我会得到1234
。
How to do the same thing in Python? 如何在Python中做同样的事情?
#1楼
参考:https://stackoom.com/question/jA5L/如何提取两个标记之间的子字符串
#2楼
Just in case somebody will have to do the same thing that I did. 以防万一某人必须做与我相同的事情。 I had to extract everything inside parenthesis in a line. 我必须在一行中提取括号内的所有内容。 For example, if I have a line like 'US president (Barack Obama) met with ...' and I want to get only 'Barack Obama' this is solution: 例如,如果我有一条类似“美国总统(巴拉克·奥巴马)与...会面……”这样的句子,而我只想得到“巴拉克·奥巴马”,这就是解决方案:
regex = '.*\((.*?)\).*'
matches = re.search(regex, line)
line = matches.group(1) + '\n'
Ie you need to block parenthesis with slash \\
sign. 也就是说,您需要使用slash \\
符号来阻止括号。 Though it is a problem about more regular expressions that Python. 尽管这是关于Python的更多正则表达式的问题。
Also, in some cases you may see 'r' symbols before regex definition. 另外,在某些情况下,您可能会在正则表达式定义之前看到“ r”符号。 If there is no r prefix, you need to use escape characters like in C. Here is more discussion on that. 如果没有r前缀,则需要像C中那样使用转义字符。 此处对此进行了更多讨论。
#3楼
>>> s = '/tmp/10508.constantstring'
>>> s.split('/tmp/')[1].split('constantstring')[0].strip('.')
#4楼
With sed it is possible to do something like this with a string: 使用sed可以用字符串执行以下操作:
echo "$STRING" | sed -e "s|.*AAA\\(.*\\)ZZZ.*|\\1|"
And this will give me 1234 as a result. 结果是我会得到1234。
You could do the same with re.sub
function using the same regex. 您可以使用相同的正则表达式对re.sub
函数执行相同的操作。
>>> re.sub(r'.*AAA(.*)ZZZ.*', r'\1', 'gfgfdAAA1234ZZZuijjk')
'1234'
In basic sed, capturing group are represented by \\(..\\)
, but in python it was represented by (..)
. 在基本sed中,捕获组由\\(..\\)
,但是在python中,捕获组由(..)
表示。
#5楼
>>> s = 'gfgfdAAA1234ZZZuijjk'
>>> start = s.find('AAA') + 3
>>> end = s.find('ZZZ', start)
>>> s[start:end]
'1234'
Then you can use regexps with the re module as well, if you want, but that's not necessary in your case. 然后,如果需要,也可以将reexpexp与re模块一起使用,但这在您的情况下不是必需的。
#6楼
import re
print re.search('AAA(.*?)ZZZ', 'gfgfdAAA1234ZZZuijjk').group(1)