正则表达式使用说明:
import re
print(re.search(r'FishC','I love FishC.com!'))#输出结果:<re.Match object; span=(7, 12), match='FishC'>
#表示在7-12的位置发现了要找的字串
print(re.search(r'\d\d\d\.\d\d\d\.\d\d\d\.\d\d\d','192.168.111.123'))
#输出:<re.Match object; span=(0, 15), match='192.168.111.123'>
#但是有的IP是其他形式,比如:192,168.1.1,就很不方便了
#可以用下面创建字符类的写法
print(re.search(r'[0-1]\d\d|2[0-4]\d|25[0-5]','188'))
#多次匹配
print(re.search(r'ab{3}c','abbbc'))
print(re.search(r'ab{3,10}c','abbbbbbc'))#三到十次b都可
#试试匹配IP
print(re.search(r'(([0-1]\d\d|2[0-4]\d|25[0-5])\.){3}([0-1])\d\d|2[0-4]\d|25[0-5]','192.168.1.1'))
#但是这样结果是None,原因是我们没考虑位数,我们强制要求三位数,我们无法用001。001匹配1.1
#改为:
print(re.search(r'(([0-1]{0,1}\d{0,1}\d|2[0-4]\d|25[0-5])\.){3}([0-1]){0,1}\d{0,1}\d|2[0-4]\d|25[0-5]','192.168.1.1'))
#成功!!!
使用正则表达式(由于贴吧的反爬机制,这里的代码已经无法使用了,只是用来举例):
import urllib.request
import re
def open_url(url):
req = urllib.request.Request(url)
req.add_header('User-Agent','Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36')
page = urllib.request.urlopen(req)
html = page.read().decode('utf-8')
return html
def get_imag(html):
p = r'<img class="BDE_Image" scr="([^"]+\.jpg)"'
imglist = re.findall(p,html)
for each in imglist:
filename = each.split("/")[-1]
urllib.request.urlretrieve(each,filename,None)
if __name__ == '__main__':
url = 'https://tieba.baidu.com/p/3823765471'
get_imag(open_url(url))
来自小甲鱼,请支持原版!!!