1.在一段字符中找出tip 或 top
import re
st = "top tip taq twp tep"
res = r"t[io]p"
print re.findall(res,st)
输出:[‘top’, ‘tip’]
2.在一段字符在找出‘t?p’(‘?’表示除了’i’或’o’以外的任意字符)
import re
st = "top tip taq twp tep"
res = r"t[^io]p"
print re.findall(res,st)
输出:[‘twp’, ‘tep’]
3.判断字符串s是否是以hello开头
import re
s = "hello world,hello boy"
r = r"^hello"
print re.findall(r,s)
输出为:[‘hello’]
4.判断字符串s是否是以boy开头
import re
s = "hello world,hello boy"
r = r"boy$"
print re.findall(r,s)
输出为:[‘boy’]
5.匹配电话010-开头后面跟着八个数字
import re
s = "010-77189021"
r = r"^010-\d{8}$"
print re.findall(r,s)
输出为:[‘010-77189021’]
6.a后面至少一个b
import re
s = "abbbbbbb"
s1 = "a"
r = r"ab+"
print re.findall(r,s)
输出为:
[‘abbbbbbb’]
[]
7.a后面至少零个b
import re
s = "abbbbbbb"
s1 = "a"
r = r"ab*"
print re.findall(r,s)
print re.findall(r,s1)
[‘abbbbbbb’]
[‘a’]
8.a后面有一个或没有b
import re
s = "abbbbbbb"
s1 = "a"
r = r"ab?"
print re.findall(r,s)
print re.findall(r,s1)
输出为:
[‘ab’]
[‘a’]
9.a后面b的个数非贪婪(如果多个之匹配一个)
import re
s = "abbbbbbb"
r = r"ab+?"
print re.findall(r,s)
输出:[‘ab’]
闲来无聊,附加一个爬虫。
匹配百度主页的所有汉字:
import re
import urllib
import urllib2
def get_html(url):
request = urllib2.Request(url)
response = urllib2.urlopen(request)
html = response.read()
return html
def get_china(url):
html = unicode(get_html(url),'utf8')
r = ur'[\u4e00-\u9fa5]+' #ur,u表示unicode编码,r表示原始字符没有变化
china = re.findall(r,html)
return china
china = get_china("http://www.baidu.com")
for c in china:
print c