python regular expression

匹配单个字符的基本模式

cite from google python class

https://developers.google.com/edu/python/regular-expressions?hl=zh-CN

The power of regular expressions is that they can specify patterns, not just fixed characters. Here are the most basic patterns which match single chars:

  • a, X, 9, < -- ordinary characters just match themselves exactly. The meta-characters which do not match themselves because they have special meanings are: . ^ $ * + ? { [ ] \ | ( ) (details below)
  • . (a period) -- matches any single character except newline '\n'
  • \w -- (lowercase w) matches a "word" character: a letter or digit or underbar [a-zA-Z0-9_]. Note that although "word" is the mnemonic for this, it only matches a single word char, not a whole word. \W (upper case W) matches any non-word character.
  • \b -- boundary between word and non-word
  • \s -- (lowercase s) matches a single whitespace character -- space, newline, return, tab, form [ \n\r\t\f]. \S (upper case S) matches any non-whitespace character.
  • \t, \n, \r -- tab, newline, return
  • \d -- decimal digit [0-9] (some older regex utilities do not support but \d, but they all support \w and \s)
  • ^ = start, $ = end -- match the start or end of the string
  • \ -- inhibit the "specialness" of a character. So, for example, use \. to match a period or \\ to match a slash. If you are unsure if a character has special meaning, such as '@', you can put a slash in front of it, \@, to make sure it is treated 

str='http://www.google.com'
match = re.search(r'^ht\w+',str)
print match.group()  #http
match = re.search(r'^h[\w@/:.]+m$',str)
print match.group()  #http://www.google.com
匹配以某个字符开头的用^号,比如以p开头的则是r'^p'.以某个字符结尾的用$符号,比如以字符m结果的,r'm$'

Repetition

Things get more interesting when you use + and * to specify repetition in the pattern

  • + -- 1 or more occurrences of the pattern to its left, e.g. 'i+' = one or more i's
  • * -- 0 or more occurrences of the pattern to its left
  • ? -- match 0 or 1 occurrences of the pattern to its left

Leftmost & Largest

First the search finds the leftmost match for the pattern, and second it tries to use up as much of the string as possible -- i.e. + and * go as far as possible (the + and * are said to be "greedy").

第一点注意到搜索的是最左边匹配的模式,第二点是它会尝试匹配尽可能长的字符串

Square Brackets

Square brackets can be used to indicate a set of chars, so [abc] matches 'a' or 'b' or 'c'. 
方括号中的字符都是并列的选择关系。[abc]表示既可以匹配字符a,也可以匹配字符b,也可以字符c
str = 'purple alice-b@google.com monkey dishwasher'
match = re.search(r'[\w-]+@[\w.]+',str)
if match:
    print match.group()  ## 'b@google'


(More square-bracket features) You can also use a dash to indicate a range, so [a-z] matches all lowercase letters. To use a dash without indicating a range, put the dash last, e.g. [abc-]. An up-hat (^) at the start of a square-bracket set inverts it, so [^ab] means any char except 'a' or 'b'.
如果^号在方括号开头,那么将取反整个表达式。因此[^ab]代表任何除了a和b外的字符。

Group Extraction

The "group" feature of a regular expression allows you to pick out parts of the matching text. Suppose for the emails problem that we want to extract the username and host separately. To do this, add parenthesis ( ) around the username and host in the pattern, like this: r'([\w.-]+)@([\w.-]+)'. 

组的提取。可以在pattern表达式里面提取的部分增加括号

.  dot sign
dot matches any character except newline

^
anything except digit
^a意味着不是a的字符
[^ab]
anything not and also not b




re.sub
re.sub(pattern, repl, string, count=0, flags=0)
Return the string obtained by replacing the leftmost non-overlapping occurrences of pattern in string by the replacement repl. If the pattern isn’t found, string is returned unchanged.
re.sub方法返回替换后的字符串。

比如下例,将字符串中的字符mo替换成mo
import re
p=re.compile(r'mo')
s='molmormo'
s1=re.sub(r'mo','mi',s)
print s1
输出结果为
milmirmi





评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值