Python 正则表达式学习笔记

正则表达式

给老师做RA,需要做文本匹配相关的研究,所以自学一下python的正则表达式,记录一下学习笔记。正则表达式是一个特殊的字符序列,它能方便的检查一个字符串是否与某种模式匹配。

正则表达式符号

接下一张
截图自:http://c.biancheng.net/view/7768.html
截图自:http://c.biancheng.net/view/7768.html
截图自:http://c.biancheng.net/view/7768.html

Python DataFrame 实例解读

此实例来自Cousera网站,密歇根大学的课程“Applied Text Mining in Python"的课程实例。

import pandas as pd

time_sentences = ["Monday: The doctor's appointment is at 2:45pm.", 
                  "Tuesday: The dentist's appointment is at 11:30 am.",
                  "Wednesday: At 7:00pm, there is a basketball game!",
                  "Thursday: Be back home by 11:15 pm at the latest.",
                  "Friday: Take the train at 08:10 am, arrive at 09:00am."]

df = pd.DataFrame(time_sentences, columns=['text'])
df

Output

	text
0	Monday: The doctor's appointment is at 2:45pm.
1	Tuesday: The dentist's appointment is at 11:30...
2	Wednesday: At 7:00pm, there is a basketball game!
3	Thursday: Be back home by 11:15 pm at the latest.
4	Friday: Take the train at 08:10 am, arrive at ...

Find the number of characters for each string in df[‘text’]

df['text'].str.len()

Output

0    46
1    50
2    49
3    49
4    54
Name: text, dtype: int64

Find the number of tokens for each string in df[‘text’]

df['text'].str.split().str.len()

Output

0     7
1     8
2     8
3    10
4    10
Name: text, dtype: int64

Find which entries contain the word ‘appointment’

df['text'].str.split().str.len()

Output

0     7
1     8
2     8
3    10
4    10
Name: text, dtype: int64

Find which entries contain the word ‘appointment’

df['text'].str.contains('appointment')

Output

0     True
1     True
2    False
3    False
4    False
Name: text, dtype: bool

Find how many times a digit occurs in each string

df['text'].str.count(r'\d')

Output

0    3
1    4
2    3
3    4
4    8
Name: text, dtype: int64

Find all occurances of the digits

df['text'].str.findall(r'\d')

Output

0                   [2, 4, 5]
1                [1, 1, 3, 0]
2                   [7, 0, 0]
3                [1, 1, 1, 5]
4    [0, 8, 1, 0, 0, 9, 0, 0]
Name: text, dtype: object

Group and find the hours and minutes

df['text'].str.findall(r'(\d?\d):(\d\d)')

()代表一个子表达式的起始和结束。(\d?\d)代表第一个子表达式的,可能是一个一位数,也可以是两位数,(\d\d)代表第二个子表达式,只能是两位数

0               [(2, 45)]
1              [(11, 30)]
2               [(7, 00)]
3              [(11, 15)]
4    [(08, 10), (09, 00)]
Name: text, dtype: object

Replace weekdays with ‘???’

df['text'].str.replace(r'\w+day\b', '???')

本段正则表达式的含义是匹配以“day”结尾的单词,然后和”???“替换

0          ???: The doctor's appointment is at 2:45pm.
1       ???: The dentist's appointment is at 11:30 am.
2          ???: At 7:00pm, there is a basketball game!
3         ???: Be back home by 11:15 pm at the latest.
4    ???: Take the train at 08:10 am, arrive at 09:...
Name: text, dtype: object

Replace weekdays with 3 letter abbrevations

df['text'].str.replace(r'(\w+day\b)', lambda x: x.groups()[0][:3])

这里的lambda函数中输入x为所匹配出来的字符串,groups函数可以获取所有分段分段匹配的字符串,返回tuple,在这个问题中,返回的tuple只有一个元素;所以取[0], 针对所取出的字符,切片取前三个字符,于原字符串完成替换

0          Mon: The doctor's appointment is at 2:45pm.
1       Tue: The dentist's appointment is at 11:30 am.
2          Wed: At 7:00pm, there is a basketball game!
3         Thu: Be back home by 11:15 pm at the latest.
4    Fri: Take the train at 08:10 am, arrive at 09:...
Name: text, dtype: object

Create new columns from first match of extracted groups

df['text'].str.extract(r'(\d?\d):(\d\d)')

extract为文本提取函数,返回DataFrame

	0	 1
0	 2	 45
1	11   30
2	 7	 00
3	11   15
4	08   10

Extract the entire time, the hours, the minutes, and the period

df['text'].str.extractall(r'((\d?\d):(\d\d) ?([ap]m))')

该正则表达式有四段,分别用()隔开,此处第三段[ap]m代表匹配项为”am“和”pm“,第四处()为最大的括号,把所有的表达式都囊括在内。
extractall 会把所有分段的字符都输出为DataFrame,分段字符判断依据为()。
其中”:“没有特殊含义仅仅代表冒号。
" ?" 空格加一个问号,由于文本中的时间有的有空格有的没有,因此这样写以确保所有的时间都被找出来,可以匹配有或无空格。

	        0	         1	    2	     3
match				
0	0	2:45pm	 2	   45	    pm
1	0	11:30 am	11   30	    am
2	0	7:00pm	 7	   00	    pm
3	0	11:15 pm	11   15	    pm
4	0	08:10 am	08   10	    am
    1	09:00am  09   00 	am

Extract the entire time, the hours, the minutes, and the period with group names

df['text'].str.extractall(r'(?P<time>(?P<hour>\d?\d):(?P<minute>\d\d) ?(?P<period>[ap]m))')

输出规范化,?P

	     time	 hour minute	period
match				
0	0	2:45pm	 2 	45	pm
1	0	11:30 am  11	30	am
2	0	7:00pm	 7 	00	pm
3	0	11:15 pm	11	15	pm
4	0	08:10 am	08	10	am
    1	09:00am	09	00	am
  1. 2.17日晚
  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 1
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值