没啥好说的,就一爬虫,想用就老老实实的下载scrapy这个第三方库
爬的是wallhaven网站上的壁纸,经本人验证,壁纸质量挠挠的,以下爬虫爬的是他的top排行榜上的壁纸,建议隔段时间没啥壁纸了就可以运行试试,进你收藏夹里吃灰去吧。
先贴代码吧
# Obey robots.txt rules
ROBOTSTXT_OBEY = False
在设置settings.py里设置为False,没啥好说的
import scrapy
class WallhavenItem(scrapy.Item):
img=scrapy.Field()
接着就是items.py里管道的设置,就一个口用来放壁纸的,名字什么的不重要
import scrapy
import time
from scrapy import Request
from bs4 import BeautifulSoup
from ..items import WallhavenItem
class WallhavenSpiderSpider(scrapy.Spider):
name = 'wallhaven_spider'
allowed_domains = ['wallhaven.cc']
start_urls = ["https://wallhaven.cc/toplist?page=" + f"{i}" for i in range(1, 25)]
def parse(self, response):
soup = BeautifulSoup(response.text, "html.parser")
rick = soup.select(".thumb")
for i in rick:
morty = i.select_one("a").get("href")
yield Request(url=morty, callback=self.PortalGun)
def PortalGun(self, response):
soup = BeautifulSoup(response.text, 'html.parser')
meeseeks = str(soup.select("img")[-1])
img_src = meeseeks.split("\"")[-2]
item = WallhavenItem()
item['img'] = img_src
time.sleep(1)
yield item
简单易懂啊,就解析网页,get到目标地址,下载,再在pipelines.py输出就好了
然后就是pipelines.py里的东西
import requests
from scrapy.pipelines.images import ImagesPipeline
import scrapy
class WallhavenPipeline(ImagesPipeline):
count = 0
# 对某一个媒体资源进行请求发送
# item就是接收到的spider提交过来的item
def get_media_requests(self, item, info):
r = requests.get(item['img'], stream=True)
print(r.status_code) # 返回状态码
if r.status_code == 200:
self.count+=1
open("./"+str(self.count)+'.png', 'wb').write(r.content) # 将内容写入图片
说实话,没啥技术含量,希望大家玩了命的copy
QWQ