python regular expression

匹配单个字符的基本模式

cite from google python class

https://developers.google.com/edu/python/regular-expressions?hl=zh-CN

The power of regular expressions is that they can specify patterns, not just fixed characters. Here are the most basic patterns which match single chars:

  • a, X, 9, < -- ordinary characters just match themselves exactly. The meta-characters which do not match themselves because they have special meanings are: . ^ $ * + ? { [ ] \ | ( ) (details below)
  • . (a period) -- matches any single character except newline '\n'
  • \w -- (lowercase w) matches a "word" character: a letter or digit or underbar [a-zA-Z0-9_]. Note that although "word" is the mnemonic for this, it only matches a single word char, not a whole word. \W (upper case W) matches any non-word character.
  • \b -- boundary between word and non-word
  • \s -- (lowercase s) matches a single whitespace character -- space, newline, return, tab, form [ \n\r\t\f]. \S (upper case S) matches any non-whitespace character.
  • \t, \n, \r -- tab, newline, return
  • \d -- decimal digit [0-9] (some older regex utilities do not support but \d, but they all support \w and \s)
  • ^ = start, $ = end -- match the start or end of the string
  • \ -- inhibit the "specialness" of a character. So, for example, use \. to match a period or \\ to match a slash. If you are unsure if a character has special meaning, such as '@', you can put a slash in front of it, \@, to make sure it is treated 

str='http://www.google.com'
match = re.search(r'^ht\w+',str)
print match.group()  #http
match = re.search(r'^h[\w@/:.]+m$',str)
print match.group()  #http://www.google.com
匹配以某个字符开头的用^号,比如以p开头的则是r'^p'.以某个字符结尾的用$符号,比如以字符m结果的,r'm$'

Repetition

Things get more interesting when you use + and * to specify repetition in the pattern

  • + -- 1 or more occurrences of the pattern to its left, e.g. 'i+' = one or more i's
  • * -- 0 or more occurrences of the pattern to its left
  • ? -- match 0 or 1 occurrences of the pattern to its left

Leftmost & Largest

First the search finds the leftmost match for the pattern, and second it tries to use up as much of the string as possible -- i.e. + and * go as far as possible (the + and * are said to be "greedy").

第一点注意到搜索的是最左边匹配的模式,第二点是它会尝试匹配尽可能长的字符串

Square Brackets

Square brackets can be used to indicate a set of chars, so [abc] matches 'a' or 'b' or 'c'. 
方括号中的字符都是并列的选择关系。[abc]表示既可以匹配字符a,也可以匹配字符b,也可以字符c
str = 'purple alice-b@google.com monkey dishwasher'
match = re.search(r'[\w-]+@[\w.]+',str)
if match:
    print match.group()  ## 'b@google'


(More square-bracket features) You can also use a dash to indicate a range, so [a-z] matches all lowercase letters. To use a dash without indicating a range, put the dash last, e.g. [abc-]. An up-hat (^) at the start of a square-bracket set inverts it, so [^ab] means any char except 'a' or 'b'.
如果^号在方括号开头,那么将取反整个表达式。因此[^ab]代表任何除了a和b外的字符。

Group Extraction

The "group" feature of a regular expression allows you to pick out parts of the matching text. Suppose for the emails problem that we want to extract the username and host separately. To do this, add parenthesis ( ) around the username and host in the pattern, like this: r'([\w.-]+)@([\w.-]+)'. 

组的提取。可以在pattern表达式里面提取的部分增加括号

.  dot sign
dot matches any character except newline

^
anything except digit
^a意味着不是a的字符
[^ab]
anything not and also not b




re.sub
re.sub(pattern, repl, string, count=0, flags=0)
Return the string obtained by replacing the leftmost non-overlapping occurrences of pattern in string by the replacement repl. If the pattern isn’t found, string is returned unchanged.
re.sub方法返回替换后的字符串。

比如下例,将字符串中的字符mo替换成mo
import re
p=re.compile(r'mo')
s='molmormo'
s1=re.sub(r'mo','mi',s)
print s1
输出结果为
milmirmi





  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
正则表达式(regular expression)是一种用于匹配文本的模式。可以利用正则表达式来提取文本中的特定信息。NLTK的正则表达式分词器(Regular Expression Tokenizer)是一个用于根据正则表达式模式将文本分成标记(tokens)的工具。 以下是使用NLTK的正则表达式分词器提取文本中的标点符号、货币金额、日期、英文名字和组织名称的代码: ```python import nltk # 加载 English.txt 文件 with open('English.txt', 'r') as f: text = f.read() # 定义正则表达式模式 patterns = [ r'\p{P}', # 标点符号 r'\$?\d+(\.\d{2})?', # 货币金额 r'\d{1,2}(st|nd|rd|th)? [A-Za-z]{3,10} \d{4}', # 日期 r'[A-Z][a-z]+ [A-Z][a-z]+', # 英文名字 r'[A-Z][a-z]+ (University|College|Institute|Foundation|Association|Company|Corporation)' # 组织名称 ] # 将正则表达式模式转换为 NLTK 的正则表达式对象 patterns = '|'.join('(?:{})'.format(p) for p in patterns) tokenizer = nltk.tokenize.RegexpTokenizer(patterns) # 使用正则表达式分词器提取文本中的标点符号、货币金额、日期、英文名字和组织名称 tokens = tokenizer.tokenize(text) # 打印提取的结果 print(tokens) ``` 这段代码将文本文件 "English.txt" 中的内容读取出来,并定义了一个正则表达式模式列表。接着,将这些模式转换为 NLTK 的正则表达式对象,然后使用这个对象对文本进行分词。最后,将分词结果打印出来。 注意,这个正则表达式模式只是一个简单的示例。如果需要更精确地提取特定类型的信息,可能需要根据具体的需求调整正则表达式模式。

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值