您可以在文本的所有子文本上运行日期解析器并选择第一个日期。当然,这样的解决方案要么捕捉到不是日期的东西,要么捕捉不到是日期的东西,或者很可能两者都是。
让我提供一个使用^{}捕获任何类似日期的内容的示例:import dateutil.parser
from itertools import chain
import re
# Add more strings that confuse the parser in the list
UNINTERESTING = set(chain(dateutil.parser.parserinfo.JUMP,
dateutil.parser.parserinfo.PERTAIN,
['a']))
def _get_date(tokens):
for end in xrange(len(tokens), 0, -1):
region = tokens[:end]
if all(token.isspace() or token in UNINTERESTING
for token in region):
continue
text = ''.join(region)
try:
date = dateutil.parser.parse(text)
return end, date
except ValueError:
pass
def find_dates(text, max_tokens=50, allow_overlapping=False):
tokens = filter(None, re.split(r'(\S+|\W+)', text))
skip_dates_ending_before = 0
for start in xrange(len(tokens)):
region = tokens[start:start + max_tokens]
result = _get_date(region)
if result is not None:
end, date = result
if allow_overlapping or end > skip_dates_ending_before:
skip_dates_ending_before = end
yield date
test = """Adelaide was born in Finchley, North London on 12 May 1999. She was a
child during the Daleks' abduction and invasion of Earth in 2009.
On 1st July 2058, Bowie Base One became the first Human colony on Mars. It
was commanded by Captain Adelaide Brooke, and initially seemed to prove that
it was possible for Humans to live long term on Mars."""
print "With no overlapping:"
for date in find_dates(test, allow_overlapping=False):
print date
print "With overlapping:"
for date in find_dates(test, allow_overlapping=True):
print date
代码的结果是,毫不奇怪,垃圾,不管你是否允许重叠。如果允许重叠,会得到很多看不见的日期;如果不允许重叠,则会错过文本中的重要日期。With no overlapping:
1999-05-12 00:00:00
2009-07-01 20:58:00
With overlapping:
1999-05-12 00:00:00
1999-05-12 00:00:00
1999-05-12 00:00:00
1999-05-12 00:00:00
1999-05-03 00:00:00
1999-05-03 00:00:00
1999-07-03 00:00:00
1999-07-03 00:00:00
2009-07-01 20:58:00
2009-07-01 20:58:00
2058-07-01 00:00:00
2058-07-01 00:00:00
2058-07-01 00:00:00
2058-07-01 00:00:00
2058-07-03 00:00:00
2058-07-03 00:00:00
2058-07-03 00:00:00
2058-07-03 00:00:00
基本上,如果允许重叠:“1999年5月12日”解析为1999-05-12 00:00:00
“1999年5月”解析为1999-05-03 00:00:00(因为今天是一个月的第三天)
但是,如果不允许重叠,“2009年。2058年7月1日”被解析为2009-07-01 20:58:00,并且不尝试解析该期间之后的日期。