更多详情请查看Honker
Python | 使用Python爬取Wallhaven网站壁纸并上传百度网盘
给大家推荐一款超好用的壁纸下载网站—— wallhaven
第一次知道这个网站的时候,惊为天人。顿时有一种挖到宝藏的feel。给用户带来的是丝滑的体验。壁纸全都是免费下载。对比国内相关壁纸网站,可谓是业界良心。
壁纸这么多,当然就要用Python下载。
如何存储?本地空间不够,当然网盘来凑。
如何持续爬取?部署服务器
编程序
见博文
上传百度网盘
因为需要上传百度网盘,需加入相关代码:
class Adapter:
"""
bypy 适配器
前提运行 bypy info 登陆成功
"""
def __init__(self):
self._bp = ByPy()
def upload(self,localpath,remotepath,**kwargs):
"""
上传
:param localpath:
:param remotepath: /videos 实际路径/bypy/videos
:param kwargs:
:return:
"""
self._bp.upload(localpath=localpath,remotepath=remotepath,**kwargs)
!!!注意:代码运行的前提是 bypy info运行成功
并修改函数 down_pic(image_url):
def down_pic(image_url):
try:
path = 'temporary data/{}'.format((image_title.split('/')[-1]) + (image_url.split('/')[-1]))
print(path)
opener = request.build_opener()
opener.addheaders = [('User-Agent',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.122 Safari/537.36')]
request.install_opener(opener)
request.urlretrieve(image_url, path)
adapter.upload(localpath=path, remotepath='image/wallhaven/')
os.remove(path)
except Exception as m:
print(m)
- !!!注意: 需提前在程序工作目录 创建文件夹 temporary data 。
- **!!!注意: **需提前在百度网盘 创建文件夹 image/wallhaven/ 。
- os.remove():既然已经上传,就可以删除本地壁纸啦(认为本地存储足够的,可以删去此代码)
最终的代码
from requests_html import HTMLSession # 用于数据请求、数据提取、相较于其他库更加简洁方便
from urllib import request # 本例中该库只用于下载保存图片
import os
from bypy import ByPy
class Adapter:
"""
bypy 适配器
前提运行 bypy info 登陆成功
"""
def __init__(self):
self._bp = ByPy()
def upload(self,localpath,remotepath,**kwargs):
"""
上传
:param localpath:
:param remotepath: /videos 实际路径/bypy/videos
:param kwargs:
:return:
"""
self._bp.upload(localpath=localpath,remotepath=remotepath,**kwargs)
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.122 Safari/537.36'} # 请求头,用于反反爬
session = HTMLSession()
urls = []
num_int = 2
for i in range(1, num_int)
# r = session.get('https://wallhaven.cc/toplist?page={}'.format(i))
try:
r = session.get('https://wallhaven.cc/search?categories=110&purity=100&topRange=1y&sorting=toplist&order=desc&page={}'.format(i))
urls.extend(list(r.html.links))
print(i, len(list(r.html.links)))
except Exception as m:
print(m)
print(len(urls))
adapter = Adapter()
def down_pic(image_url):
try:
path = 'temporary data/{}'.format((image_title.split('/')[-1]) + (image_url.split('/')[-1]))
print(path)
opener = request.build_opener()
opener.addheaders = [('User-Agent',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.122 Safari/537.36')]
request.install_opener(opener)
request.urlretrieve(image_url, path)
adapter.upload(localpath=path, remotepath='image/wallhaven/')
os.remove(path)
except Exception as m:
print(m)
for url in urls:
try:
session1 = HTMLSession()
r1 = session1.get(url)
sr = r1.html.find("img#wallpaper", first=True)
image_url = sr.attrs['src']
image_title = sr.attrs['alt']
print(image_url)
print(image_title)
down_pic(image_url)
except BaseException as e:
print(e)
部署服务器
- 登录服务器
- 上传程序
- 创建文件夹temporary data
- 输入命令 “nohup python3 程序名.py &”
- 优雅地去睡觉,睡等壁纸装满网盘
成果展示
爬取的壁纸下载
链接(提取码: 7p8q)
一晚上爬取了两千多个,还在持续爬取ing