python笔记-RE

Basic Patterns

The power of regular expressions is that they can specify patterns, not just fixed characters. Here are the most basic patterns which match single chars:

  • a, X, 9, < – ordinary characters just match themselves exactly. The meta-characters which do not match themselves because they have special meanings are: . ^ $ * + ? { [ ] \ | ( ) (details below)
  • . (a period) – matches any single character except newline ‘\n’
  • \w – (lowercase w) matches a “word” character: a letter or digit or underbar [a-zA-Z0-9_]. Note that although “word” is the mnemonic for this, it only matches a single word char, not a whole word. \W (upper case W) matches any non-word character.
  • \b – boundary between word and non-word
  • \s – (lowercase s) matches a single whitespace character – space, newline, return, tab, form [ \n\r\t\f]. \S (upper case S) matches any non-whitespace character.
  • \t, \n, \r – tab, newline, return
  • \d – decimal digit [0-9] (some older regex utilities do not support but \d, but they all support \w and \s)
  • ^ = start, $ = end – match the start or end of the string
  • \ – inhibit the “specialness” of a character. So, for example, use . to match a period or \ to match a slash. If you are unsure if a character has special meaning, such as ‘@’, you can put a slash in front of it, \@, to make sure it is treated just as a character.

Repetition

Things get more interesting when you use + and * to specify repetition in the pattern

  • + – 1 or more occurrences of the pattern to its left, e.g. ‘i+’ = one or more i’s
  • * – 0 or more occurrences of the pattern to its left
    ? – match 0 or 1 occurrences of the pattern to its left

findall With Files

For files, you may be in the habit of writing a loop to iterate over the lines of the file, and you could then call findall() on each line. Instead, let findall() do the iteration for you – much better! Just feed the whole file text into findall() and let it return a list of all the matches in a single step (recall that f.read() returns the whole text of a file in a single string):

  # Open file
  f = open('test.txt', 'r')
  # Feed the file text into findall(); it returns a list of all the found strings
  strings = re.findall(r'some pattern', f.read())

Options

The re functions take options to modify the behavior of the pattern match. The option flag is added as an extra argument to the search() or findall() etc., e.g. re.search(pat, str, re.IGNORECASE).

  • IGNORECASE – ignore upper/lowercase differences for matching, so ‘a’ matches both ‘a’ and ‘A’.
  • DOTALL – allow dot (.) to match newline – normally it matches anything but newline. This can trip you up – you think .* matches everything, but by default it does not go past the end of a line. Note that \s (whitespace) includes newlines, so if you want to match a run of whitespace that may include a newline, you can just use \s*
  • MULTILINE – Within a string made of many lines, allow ^ and $ to match the start and end of each line. Normally ^/$ would just match the start and end of the whole string.

Substitution

The re.sub(pat, replacement, str) function searches for all the instances of pattern in the given string, and replaces them. The replacement string can include ‘\1’, ‘\2’ which refer to the text from group(1), group(2), and so on from the original matching text.

Here’s an example which searches for all the email addresses, and changes them to keep the user (\1) but have yo-yo-dyne.com as the host.

  str = 'purple alice@google.com, blah monkey bob@abc.com blah dishwasher'
  ## re.sub(pat, replacement, str) -- returns new string with all replacements,
  ## \1 is group(1), \2 group(2) in the replacement
  print re.sub(r'([\w\.-]+)@([\w\.-]+)', r'\1@yo-yo-dyne.com', str)
  ## purple alice@yo-yo-dyne.com, blah monkey bob@yo-yo-dyne.com blah dishwasher
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值