python regular expression

最新推荐文章于 2023-11-16 20:35:08 发布

scgillian

最新推荐文章于 2023-11-16 20:35:08 发布

阅读量1.5k

点赞数

分类专栏： Python

本文链接：https://blog.csdn.net/zhanglei0107/article/details/8742224

版权

Python 专栏收录该内容

2 篇文章

订阅专栏

匹配单个字符的基本模式

cite from google python class

https://developers.google.com/edu/python/regular-expressions?hl=zh-CN

The power of regular expressions is that they can specify patterns, not just fixed characters. Here are the most basic patterns which match single chars:

a, X, 9, < -- ordinary characters just match themselves exactly. The meta-characters which do not match themselves because they have special meanings are: . ^ $ * + ? { [ ] \ | ( ) (details below)
. (a period) -- matches any single character except newline '\n'
\w -- (lowercase w) matches a "word" character: a letter or digit or underbar [a-zA-Z0-9_]. Note that although "word" is the mnemonic for this, it only matches a single word char, not a whole word. \W (upper case W) matches any non-word character.
\b -- boundary between word and non-word
\s -- (lowercase s) matches a single whitespace character -- space, newline, return, tab, form [ \n\r\t\f]. \S (upper case S) matches any non-whitespace character.
\t, \n, \r -- tab, newline, return
\d -- decimal digit [0-9] (some older regex utilities do not support but \d, but they all support \w and \s)
^ = start, $ = end -- match the start or end of the string
\ -- inhibit the "specialness" of a character. So, for example, use \. to match a period or \\ to match a slash. If you are unsure if a character has special meaning, such as '@', you can put a slash in front of it, \@, to make sure it is treated

str='http://www.google.com'
match = re.search(r'^ht\w+',str)
print match.group()  #http
match = re.search(r'^h[\w@/:.]+m$',str)
print match.group()  #http://www.google.com

匹配以某个字符开头的用^号，比如以p开头的则是r'^p'.以某个字符结尾的用$符号，比如以字符m结果的,r'm$'

Repetition

Things get more interesting when you use + and * to specify repetition in the pattern

+ -- 1 or more occurrences of the pattern to its left, e.g. 'i+' = one or more i's
* -- 0 or more occurrences of the pattern to its left
? -- match 0 or 1 occurrences of the pattern to its left

Leftmost & Largest

First the search finds the leftmost match for the pattern, and second it tries to use up as much of the string as possible -- i.e. + and * go as far as possible (the + and * are said to be "greedy").

第一点注意到搜索的是最左边匹配的模式，第二点是它会尝试匹配尽可能长的字符串

Square Brackets

Square brackets can be used to indicate a set of chars, so [abc] matches 'a' or 'b' or 'c'.

方括号中的字符都是并列的选择关系。[abc]表示既可以匹配字符a，也可以匹配字符b，也可以字符c

str = 'purple alice-b@google.com monkey dishwasher'
match = re.search(r'[\w-]+@[\w.]+',str)
if match:
    print match.group()  ## 'b@google'

(More square-bracket features) You can also use a dash to indicate a range, so [a-z] matches all lowercase letters. To use a dash without indicating a range, put the dash last, e.g. [abc-]. An up-hat (^) at the start of a square-bracket set inverts it, so [^ab] means any char except 'a' or 'b'.

如果^号在方括号开头，那么将取反整个表达式。因此[^ab]代表任何除了a和b外的字符。

Group Extraction

The "group" feature of a regular expression allows you to pick out parts of the matching text. Suppose for the emails problem that we want to extract the username and host separately. To do this, add parenthesis ( ) around the username and host in the pattern, like this: r'([\w.-]+)@([\w.-]+)'.

组的提取。可以在pattern表达式里面提取的部分增加括号

. dot sign

dot matches any character except newline

^
anything except digit
^a意味着不是a的字符
[^ab]
anything not and also not b

re.sub

re.sub(pattern, repl, string, count=0, flags=0)
Return the string obtained by replacing the leftmost non-overlapping occurrences of pattern in string by the replacement repl. If the pattern isn’t found, string is returned unchanged.

re.sub方法返回替换后的字符串。

比如下例，将字符串中的字符mo替换成mo

import re
p=re.compile(r'mo')
s='molmormo'
s1=re.sub(r'mo','mi',s)
print s1

输出结果为

milmirmi