正则表达式要学好,其他库如beautifulsoup虽然简单好用,
但是遇到有些无法解决的问题就必须用正则表达式了。
结合实例分析:
re.match
从字符串的起始位置开始匹配
最常规的匹配:
import re content = "Hello 123 4567 World_This is a Regex Demo" result = re.match('^Hello\s\d\d\d\s\d{4}\s\w{8}.*Demo$', content) print(result.group())
泛匹配:
import re content = "Hello 123 4567 World_This is a Regex Demo" result = re.match('^Hello.*Demo$', content) print(result.group())
匹配目标:
import re content = "Hello 1234567 World_This is a Regex Demo" result = re.match('^Hello\s(\d+)\sWorld.*Demo$', content) print(result.group()) print(result.group(1)) #Hello 1234567 World_This is a Regex Demo #1234567
贪婪匹配:(.*会匹配尽可能多,所以123456都没有匹配到)
import re content = "Hello 1234567 World_This is a Regex Demo" result = re.match('^He.*(\d+).*Demo$', content) print(result.group()) print(result.group(1)) #Hello 1234567 World_This is a Regex Demo #7
非贪婪匹配:(.*?会匹配尽可能少的字符)
import re content = "Hello 1234567 World_This is a Regex Demo" result = re.match('^He.*?(\d+).*Demo$', content) print(result.group()) print(result.group(1)) #Hello 1234567 World_This is a Regex Demo #1234567
匹配模式:
注意(.*)无法匹配换行符
import re content = "Hello 1234567 World_This \n is a Regex Demo" result = re.match('^He.*?(\d+).*?Demo$', content) print(result) #None加入匹配模式就可以解决:
import re content = "Hello 1234567 World_This \n is a Regex Demo" result = re.match('^He.*?(\d+).*?Demo$', content, re.S) print(result.group(1)) #1234567
转义:
如果要匹配如同这样一个字符串,发现匹配失败:
import re content = "price is $5.00" result = re.match('price is $5.00', content) print(result) #None
这样即可解决:
import re content = "price is $5.00" result = re.match('price is \$5\.00', content) print(result.group()) #price is $5.00
小结:尽量使用泛匹配,使用()得到匹配目标,尽量使用非贪婪模式,有换行符就用re.S
re.search
搜索整个字符串,并返回第一个成功的匹配
import re content = "Hello 1234567 World_This \n is a Regex Demo" result = re.match('llo.*?(\d+).*?Demo$', content, re.S) print(result) #None发现如果不是从第一个字符开始,match会失效。
如果用search方法:
import re content = "Hello 1234567 World_This is a Regex Demo" result = re.search('llo.*?(\d+).*?Demo$', content, re.S) print(result.group()) #llo 1234567 World_This is a Regex Demo
所以,为了匹配方便,能用search就别用match
找到某段html做个实例:
import re html = '<span id="js_msgvoice_reading_title"></span> '\ ' <span class="ic_voice">你好</span>正在朗读文字' \ '<span class="icon_more"></span>' result = re.search('<span.*?ice">(.*?)</span>(.*?)<span', html, re.S) print(result.group(1)) print(result.group(2)) # 你好 # 正在朗读文字
还有一个findall方法,继续上面这段html:
import re html = '<span id="js_msgvoice_reading_title"></span> ' \ ' <span class="ic_voice">你好</span>正在朗读文字' \ '<span class="icon_more"></span>' \ '<span id="js_msgvoice_reading_title"></span> ' \ ' <span class="ic_voice">我好</span>正在学习' \ '<span class="icon_more"></span>' \ '<span id="js_msgvoice_reading_title"></span> ' \ ' <span class="ic_voice">他好</span>正在玩游戏' \ '<span class="icon_more"></span>' result = re.findall('<span.*?ice">(.*?)</span>(.*?)<span', html, re.S) print(result) # [('你好', '正在朗读文字'), ('我好', '正在学习'), ('他好', '正在玩游戏')] for item in result: print(item[1]) # 正在朗读文字 # 正在学习 # 正在玩游戏重点理解下*和?的0个或1个的含义,可以加入到更复杂的正则表达式中。
re.sub方法:
字符串替换:
import re content = "Hello World 6666123 re.sub" content = re.sub("\d+", "hey", content) print(content) #Hello World hey re.sub
稍深入些:
import re content = "Hello World 6666123 re.sub" content = re.sub("(\d+)", r"\1 hey", content) print(content) #Hello World 6666123 hey re.sub
re.sub最妙的用处在于,可以把一个复杂的html中多余的标签用空字符串代替,
然后再用findall方法时候会方便很多。
re.compile
import re content = "Hello World 6666123 re.sub" pattern = re.compile("He.*sub", re.S) result = re.match(pattern, content) print(result.group()) #Hello World 6666123 re.sub
接下来一个实战例子:
获取豆瓣图书的书籍信息:
用到了requests库:
import re import requests content = requests.get("https://book.douban.com/").text pattern = re.compile('<li.*?cover.*?href="(.*?)".*?title="(.*?)".*?more-meta.*?author">(.*?)</span>.*?year">(.*?)</span>.*?</li>', re.S) results = re.findall(pattern, content) for result in results: url, name, author, date = result author = re.sub("\s", "", author) date = re.sub("\s", "", date) print(url, name, author, date)
输出如下:(截取了部分)