Python 正则表达式学习笔记

最新推荐文章于 2022-01-08 20:42:55 发布

JokerJL

最新推荐文章于 2022-01-08 20:42:55 发布

阅读量648

点赞数

分类专栏： Python 正则表达式文章标签： python 正则表达式

本文链接：https://blog.csdn.net/JokerJL/article/details/113837947

版权

Python 正则表达式专栏收录该内容

1 篇文章 0 订阅

订阅专栏

Python 正则表达式学习笔记

正则表达式
- 正则表达式符号
- Python DataFrame 实例解读

正则表达式

给老师做RA，需要做文本匹配相关的研究，所以自学一下python的正则表达式，记录一下学习笔记。正则表达式是一个特殊的字符序列，它能方便的检查一个字符串是否与某种模式匹配。

正则表达式符号

接下一张

截图自：http://c.biancheng.net/view/7768.html

Python DataFrame 实例解读

此实例来自Cousera网站,密歇根大学的课程“Applied Text Mining in Python"的课程实例。

import pandas as pd

time_sentences = ["Monday: The doctor's appointment is at 2:45pm.", 
                  "Tuesday: The dentist's appointment is at 11:30 am.",
                  "Wednesday: At 7:00pm, there is a basketball game!",
                  "Thursday: Be back home by 11:15 pm at the latest.",
                  "Friday: Take the train at 08:10 am, arrive at 09:00am."]

df = pd.DataFrame(time_sentences, columns=['text'])
df

Output

	text
0	Monday: The doctor's appointment is at 2:45pm.
1	Tuesday: The dentist's appointment is at 11:30...
2	Wednesday: At 7:00pm, there is a basketball game!
3	Thursday: Be back home by 11:15 pm at the latest.
4	Friday: Take the train at 08:10 am, arrive at ...

Find the number of characters for each string in df[‘text’]

df['text'].str.len()

Output

0    46
1    50
2    49
3    49
4    54
Name: text, dtype: int64

Find the number of tokens for each string in df[‘text’]

df['text'].str.split().str.len()

Output

0     7
1     8
2     8
3    10
4    10
Name: text, dtype: int64

Find which entries contain the word ‘appointment’

df['text'].str.split().str.len()

Output

0     7
1     8
2     8
3    10
4    10
Name: text, dtype: int64

Find which entries contain the word ‘appointment’

df['text'].str.contains('appointment')

Output

0     True
1     True
2    False
3    False
4    False
Name: text, dtype: bool

Find how many times a digit occurs in each string

df['text'].str.count(r'\d')

Output

0    3
1    4
2    3
3    4
4    8
Name: text, dtype: int64

Find all occurances of the digits

df['text'].str.findall(r'\d')

Output

0                   [2, 4, 5]
1                [1, 1, 3, 0]
2                   [7, 0, 0]
3                [1, 1, 1, 5]
4    [0, 8, 1, 0, 0, 9, 0, 0]
Name: text, dtype: object

Group and find the hours and minutes

df['text'].str.findall(r'(\d?\d):(\d\d)')

()代表一个子表达式的起始和结束。(\d?\d)代表第一个子表达式的，可能是一个一位数，也可以是两位数，（\d\d）代表第二个子表达式，只能是两位数

0               [(2, 45)]
1              [(11, 30)]
2               [(7, 00)]
3              [(11, 15)]
4    [(08, 10), (09, 00)]
Name: text, dtype: object

Replace weekdays with ‘???’

df['text'].str.replace(r'\w+day\b', '???')

本段正则表达式的含义是匹配以“day”结尾的单词，然后和”？？？“替换

0          ???: The doctor's appointment is at 2:45pm.
1       ???: The dentist's appointment is at 11:30 am.
2          ???: At 7:00pm, there is a basketball game!
3         ???: Be back home by 11:15 pm at the latest.
4    ???: Take the train at 08:10 am, arrive at 09:...
Name: text, dtype: object

Replace weekdays with 3 letter abbrevations

df['text'].str.replace(r'(\w+day\b)', lambda x: x.groups()[0][:3])

这里的lambda函数中输入x为所匹配出来的字符串，groups函数可以获取所有分段分段匹配的字符串，返回tuple,在这个问题中，返回的tuple只有一个元素；所以取[0]，针对所取出的字符，切片取前三个字符，于原字符串完成替换

0          Mon: The doctor's appointment is at 2:45pm.
1       Tue: The dentist's appointment is at 11:30 am.
2          Wed: At 7:00pm, there is a basketball game!
3         Thu: Be back home by 11:15 pm at the latest.
4    Fri: Take the train at 08:10 am, arrive at 09:...
Name: text, dtype: object

Create new columns from first match of extracted groups

df['text'].str.extract(r'(\d?\d):(\d\d)')

extract为文本提取函数，返回DataFrame

Extract the entire time, the hours, the minutes, and the period

df['text'].str.extractall(r'((\d?\d):(\d\d) ?([ap]m))')

该正则表达式有四段，分别用（）隔开，此处第三段[ap]m代表匹配项为”am“和”pm“,第四处（）为最大的括号，把所有的表达式都囊括在内。
extractall 会把所有分段的字符都输出为DataFrame,分段字符判断依据为（）。
其中”：“没有特殊含义仅仅代表冒号。
" ?" 空格加一个问号，由于文本中的时间有的有空格有的没有，因此这样写以确保所有的时间都被找出来，可以匹配有或无空格。

	        0	         1	    2	     3
match				
0	0	2:45pm	 2	   45	    pm
1	0	11:30 am	11   30	    am
2	0	7:00pm	 7	   00	    pm
3	0	11:15 pm	11   15	    pm
4	0	08:10 am	08   10	    am
    1	09:00am  09   00 	am

Extract the entire time, the hours, the minutes, and the period with group names

df['text'].str.extractall(r'(?P<time>(?P<hour>\d?\d):(?P<minute>\d\d) ?(?P<period>[ap]m))')

输出规范化，？P在每一个正则表达式的开头，以此为DataFrame的列名称

	     time	 hour minute	period
match				
0	0	2:45pm	 2 	45	pm
1	0	11:30 am  11	30	am
2	0	7:00pm	 7 	00	pm
3	0	11:15 pm	11	15	pm
4	0	08:10 am	08	10	am
    1	09:00am	09	00	am

2.17日晚

JokerJL

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
1
评论
Python 正则表达式学习笔记

Python 正则表达式学习笔记正则表达式正则表达式符号Python DataFrame 实例解读正则表达式给老师做RA，需要做文本匹配相关的研究，所以自学一下python的正则表达式，记录一下学习笔记。正则表达式是一个特殊的字符序列，它能方便的检查一个字符串是否与某种模式匹配。正则表达式符号Python DataFrame 实例解读此实例来自Cousera网站,密歇根大学的课程“Applied Text Mining in Python"的课程实例。import pandas as pd
复制链接

扫一扫