首先通过对字幕组网页源码进行分析
图片<img src="http://tu.jstucdn.com/ftp/2018/1113/1e9afeab694d5fb5061fcb618c28b138.jpg">
src=“balabala.jpg”
reg = r'src="(.+?\.jpg)"'
reg_img = re.compile(reg)
引入正则化表达式并编译
reg = r'src="(.+?\.jpg)"'
reg_img = re.compile(reg)
最后在源码中匹配具有类似格式的链接,并下载
源码如下:
# 页面获取
from urllib import request
import re
def get_imgs(url,path):
# 打开网页
page = request.urlopen(url)
htmlcode = page.read()
# 匹配正则表达式
reg = r'src="(.+?\.jpg)"'
reg_img = re.compile(reg)
# utf-8解码
html = htmlcode.decode('utf-8')
imgs = reg_img.findall(html)
num = len(imgs)
for i in range(num):
try:
request.urlretrieve(imgs[i], '%s\%s.jpg' % (path, i))
except:
print(imgs[i],'保存失败')
urlstr = 'http://www.zimuzu.tv/'
path = r'E:\Workspace Pycharm\spyder\spyderfiles\zimuzu'
get_imgs(urlstr,path)
一张图片保存失败
可以看到是43.jpg没有保存下来...
过程中遇到几个错误
TypeError: write() argument must be str, not bytes pageFile = open('E:\\WorkSpace Spyder\\Spyderfile\\%s.txt'%filename,'w') 写入方式改为wb+使用二进制方式 TypeError: cannot use a string pattern on a bytes-like object html=html.decode('utf-8')#python3 使用utf-8编码 AttributeError: module 'urllib' has no attribute 'urlopen' 这种情况的解决办法就是将urllib改成urllib.request