python正则表达式详解 pandas_python – 使用正则表达式提取不同格式的日期并对它们进行排序 – pandas...

我认为这是课程文本挖掘任务之一.那么你可以使用正则表达式和提取来获得解决方案.

dates.txt即

doc = []

with open('dates.txt') as file:

for line in file:

doc.append(line)

df = pd.Series(doc)

def date_sorter():

# Get the dates in the form of words

one = df.str.extract(r'((?:\d{,2}\s)?(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]*(?:-|\.|\s|,)\s?\d{,2}[a-z]*(?:-|,|\s)?\s?\d{2,4})')

# Get the dates in the form of numbers

two = df.str.extract(r'((?:\d{1,2})(?:(?:\/|-)\d{1,2})(?:(?:\/|-)\d{2,4}))')

# Get the dates where there is no days i.e only month and year

three = df.str.extract(r'((?:\d{1,2}(?:-|\/))?\d{4})')

#Convert the dates to datatime and by filling the nans in two and three. Replace month name because of spelling mistake in the text file.

dates = pd.to_datetime(one.fillna(two).fillna(three).replace('Decemeber','December',regex=True).replace('Janaury','January',regex=True))

return pd.Series(dates.sort_values())

date_sorter()

输出:

9 1971-04-10

84 1971-05-18

2 1971-07-08

53 1971-07-11

28 1971-09-12

474 1972-01-01

153 1972-01-13

13 1972-01-26

129 1972-05-06

98 1972-05-13

111 1972-06-10

225 1972-06-15

31 1972-07-20

171 1972-10-04

191 1972-11-30

486 1973-01-01

335 1973-02-01

415 1973-02-01

36 1973-02-14

405 1973-03-01

323 1973-03-01

422 1973-04-01

375 1973-06-01

380 1973-07-01

345 1973-10-01

57 1973-12-01

481 1974-01-01

436 1974-02-01

104 1974-02-24

299 1974-03-01

如果只想返回索引,则返回pd.Series(dates.sort_values().index)

解析第一个正则表达式

#?: Non-capturing group

((?:\d{,2}\s)? # The two digits group. `?` refers to preceding token or group. Here the digits of 2 or 1 and space occurring once or less.

(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]* # The words in group ending with any letters `[]` occuring any number of times (`*`).

(?:-|\.|\s|,) # Pattern matching -,.,space

\s? #(`?` here it implies only to space i.e the preceding token)

\d{,2}[a-z]* # less than or equal to two digits having any number of letters at the end (`*`). (Eg: may be 1st, 13th , 22nd , Jan , December etc ) .

(?:-|,|\s)?# The characters -/,/space may occur once and may not occur because of `?` at the end

\s? # space may occur or may not occur at all (maximum is 1) (`?` here it refers only to space)

\d{2,4}) # Match digit which is 2 or 4

希望能帮助到你.

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值