新年新气象,祝大家牛转乾坤,牛气冲天!
过年期间收到了很多朋友的新年祝福,没有一一回应,见谅!
很久没写爬虫了,手生了,在吾爱找了一个练手网站,国外的壁纸网站,wallhaven,这里采集下载热门图片为例,重温一下python图片爬虫,感兴趣的不妨自行练手尝试一番!
目标网址:https://wallhaven.cc/toplist
通过初步的观察,可以很清晰的看到网站的翻页情况
https://wallhaven.cc/toplist?page=1 https://wallhaven.cc/toplist?page=2 https://wallhaven.cc/toplist?page=2
这里我们就可以应用python字符串链接来构造列表页的网址链接
f"https://wallhaven.cc/toplist?page={pagenum}"
pagenum即为页码
进一步观察图片数据
封面图地址:https://th.wallhaven.cc/small/rd/rddgwm.jpg
大图地址:https://w.wallhaven.cc/full/rd/wallhaven-rddgwm.jpg
这里我们同样可以应用python字符串链接来构造图片的网址链接
img = imgsrc.replace("th", "w").replace("small", "full")
imgs = img.split('/')
imgurl = f"{'/'.join(imgs[:-1])}/wallhaven-{imgs[-1]}"
不过这里可能会存在一个BUG,比如小图的后缀格式是jpg,但是大图的后缀格式png,这个时候你以jpg的后缀格式访问图片,下载的话无疑是会出错的,这里本渣渣的处理方式可能还是存在bug,笨方法无疑是访问到详情页拿到大图的访问地址。
如果你有更好的处理方式,不妨交流分享!
初次基础版本:
#wallhaven热门图片采集下载
#author 微信:huguo00289
# —*—coding: utf-8 -*-
import requests
from lxml import etree
from fake_useragent import UserAgent
url = "https://wallhaven.cc/toplist?page=1"
ua = UserAgent().random
html = requests.get(url=url, headers={'user-agent': ua}, timeout=6).content.decode('utf-8')
tree = etree.HTML(html)
imgsrcs = tree.xpath('//ul/li/figure/img/@src')
print(len(imgsrcs))
print(imgsrcs)
i = 1
for imgsrc in imgsrcs:
img = imgsrc.replace("th", "w").replace("small", "full")
imgs = img.split('/')
imgurl = f"{'/'.join(imgs[:-1])}/wallhaven-{imgs[-1]}"
print(imgurl)
try:
r = requests.get(url=imgurl, headers={'user-agent': ua}, timeout=6)
with open(f'{i}.jpg', 'wb') as f:
f.write(r.content)
print(f"保存 {i}.jpg 图片成功!")
except Exception as e:
print(f"下载图片出错,错误代码:{e}")
imgurl = imgurl.replace('jpg', 'png')
r = requests.get(url=imgurl, headers={'user-agent': ua}, timeout=6)
with open(f'{i}.png', 'wb') as f:
f.write(r.content)
print(f"保存 {i}.png 图片成功!")
i = i + 1
优化版本,添加了类,多线程,以及超时重试处理
#wallhaven热门图片采集下载
#author 微信:huguo00289
# —*—coding: utf-8 -*-
import requests
from lxml import etree
from fake_useragent import UserAgent
import time
from requests.adapters import HTTPAdapter
import threading
class Top(object):
def __init__(self):
self.ua=UserAgent().random
self.url="https://wallhaven.cc/toplist?page="
def get_response(self,url):
response=requests.get(url=url, headers={'user-agent': self.ua}, timeout=6)
return response
def get_third(self,url,num):
s = requests.Session()
s.mount('http://', HTTPAdapter(max_retries=num))
s.mount('https://', HTTPAdapter(max_retries=num))
print(time.strftime('%Y-%m-%d %H:%M:%S'))
try:
r = s.get(url=url, headers={'user-agent': self.ua},timeout=5)
return r
except requests.exceptions.RequestException as e:
print(e)
print(time.strftime('%Y-%m-%d %H:%M:%S'))
def get_html(self,response):
html = response.content.decode('utf-8')
tree = etree.HTML(html)
return tree
def parse(self,tree):
imgsrcs = tree.xpath('//ul/li/figure/img/@src')
print(len(imgsrcs))
return imgsrcs
def get_imgurl(self,imgsrc):
img = imgsrc.replace("th", "w").replace("small", "full")
imgs = img.split('/')
imgurl = f"{'/'.join(imgs[:-1])}/wallhaven-{imgs[-1]}"
print(imgurl)
return imgurl
def down(self,imgurl,imgname):
#r=self.get_response(imgurl)
r = self.get_third(imgurl,3)
with open(f'{imgname}', 'wb') as f:
f.write(r.content)
print(f"保存 {imgname} 图片成功!")
time.sleep(2)
def downimg(self,imgsrc,pagenum,i):
imgurl = self.get_imgurl(imgsrc)
imgname = f'{pagenum}-{i}{imgurl[-4:]}'
try:
self.down(imgurl, imgname)
except Exception as e:
print(f"下载图片出错,错误代码:{e}")
if "jpg" in imgname:
ximgname = f'{pagenum}-{i}.png'
if "png" in imgname:
ximgname = f'{pagenum}-{i}.jpg'
self.down(imgurl, ximgname)
def get_topimg(self,pagenum):
url=f'{self.url}{pagenum}'
print(url)
response=self.get_response(url)
tree=self.get_html(response)
imgsrcs=self.parse(tree)
i=1
for imgsrc in imgsrcs:
self.downimg(imgsrc,pagenum,i)
i=i+1
def get_topimgs(self,pagenum):
url=f'{self.url}{pagenum}'
print(url)
response=self.get_response(url)
tree=self.get_html(response)
imgsrcs=self.parse(tree)
i=1
threadings = []
for imgsrc in imgsrcs:
t = threading.Thread(target=self.downimg, args=(imgsrc,pagenum,i))
i = i + 1
threadings.append(t)
t.start()
for x in threadings:
x.join()
print("多线程下载图片完成")
def main(self):
num=3
for pagenum in range(1,num+1):
print(f">>正在采集第{pagenum}页图片数据..")
self.get_topimgs(pagenum)
if __name__=='__main__':
spider=Top()
spider.main()
采集下载效果
福利
源码打包,
同时附上两个多线程以及一个多进程,
感兴趣,尤其是想要研究多线程的不妨自行获取,
公众号后台回复“多线程”,即可获取!
·················END·················
您好,我是二大爷,
革命老区外出进城务工人员,
互联网非早期非专业站长,
喜好python,写作,阅读,英语
不入流程序,自媒体,seo . . .
公众号不挣钱,交个网友。
读者交流群已建立,找到我备注 “交流”,即可获得加入我们~
听说点 “在看” 的都变得更好看呐~
关注关注二大爷呗~给你分享python,写作,阅读的内容噢~
扫一扫下方二维码即可关注我噢~
关注我的都变秃了
说错了,都变强了!
不信你试试
扫码关注最新动态
公众号ID:eryeji