Python零基础之多线程爬取王者荣耀官方网站高清壁纸
1. 目标
- 通过多线程和队列的方式快速抓取王者荣耀高清壁纸
- 程序架构以生产者-消费者模式进行设计,数据缓存在两个队列中
- 将壁纸文件按照英雄名称为目录的方式保存
- 实现对下载失败的文件重新下载
2. 代码示例
import requests
from urllib import parse
import os
from urllib import request
import threading
import queue
headers = {
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36'}
class Producer(threading.Thread):
def __init__(self, page_queue, image_queue, *args, **kwargs):
super(Producer, self).__init__(*args, **kwargs)
self.page_queue = page_queue
self.image_queue = image_queue
def run(self) -> None:
while not self.page_queue.empty():
page_url = self.page_queue.get_nowait()
resp = requests.get(page_url, headers=headers)
result = resp.json()
datas = result['List']
for data in datas:
image_urls = extract_images(data)
name = parse.unquote(data['sProdName']).replace('1:1', '').strip()
dir_path = os.path.join('image', name)
if not os.path.exists(dir_path):
os.mkdir(dir_path)
for index, image_url in enumerate(image_urls):
self.image_queue.put(
{'image_url': image_url, 'image_path': os.path.join(dir_path, '%d.jpg' % (index + 1))})
class Consumer(threading.Thread):
def __init__(self, image_queue, *args, **kwargs):
super(Consumer, self).__init__(*args, **kwargs)
self.image_queue = image_queue
def run(self) -> None:
while True:
try:
image_obj = self.image_queue.get(timeout=5)
image_url = image_obj.get('image_url')
image_path = image_obj.get('image_path')
if not os.path.exists(image_path):
try:
request.urlretrieve(image_url, image_path)
print(image_path, '下载完成!')
except:
print(image_path + '下载失败!')
except:
break
def extract_images(data):
image_urls = []
for x in range(1,9):
image_url = parse.unquote(data['sProdImgNo_%d' % x]).replace('200', '0')
image_urls.append(image_url)
return image_urls
def main():
if not os.path.exists('./image'):
os.mkdir('./image')
page_queue = queue.Queue(23)
image_queue = queue.Queue(1000)
for x in range(23):
page_url = 'https://apps.game.qq.com/cgi-bin/ams/module/ishow/V1.0/query/workList_inc.cgi?activityId=2735&sVerifyCode=ABCD&sDataType=JSON&iListNum=20&totalpage=0&page={page}&iOrder=0&iSortNumClose=1&iAMSActivityId=51991&_everyRead=true&iTypeId=2&iFlowId=267733&iActId=2735&iModuleId=2735&_=1597325146548'.format(page=x)
page_queue.put(page_url)
for x in range(3):
p = Producer(page_queue, image_queue)
p.start()
for x in range(5):
c = Consumer(image_queue)
c.start()
if __name__ == '__main__':
main()
3. 注意
- super()的用法,继承父类的方法,减少代码量,实现动态继承,并且父类的方法可以不用传入self
- 当传参不明确时,用*args和**kwargs传参
- 对文件名中含有特殊参数的处理,用replace()替换和strip()去空格
- 异常处理:
- 用os.path.exists()进行重复文件夹的处理,重复文件的处理
- 用try→except进行报错的处理和跳出循环
- 可以用queue.Queue(数字)定义队列的容量
- .put()向队列放置对象时,可以设置timeout时间,避免阻塞
- 用for x in range(数字):可以实现启动多个生产者或消费者
- 生产者或消费者的数量应当根据实际工作的工作量进行分配,并不是越多越好
- 超过一定数量后,数量越多,争夺线程发生冲突的可能性越大,运行效率越低
4. 引用