Python实现多线程批量下载昵图网的清晰预览图

Scott0902

于 2022-09-29 16:20:35 发布

阅读量844

点赞数

分类专栏： Python 文章标签： python 爬虫开发语言

本文链接：https://blog.csdn.net/Scott0902/article/details/127108376

版权

Python 专栏收录该内容

38 篇文章 3 订阅

订阅专栏

我入门Python编程的一个习作：Python多线程下载昵图网的清晰预览图。

目前昵图网（nipic.com）没有限制爬虫，可以用requests来快速打开页面和下载图片。

注意：本文只是示范多线程下载比较清晰的预览图，图片边长最大1024像素，不是下载会员资源，也不是下载设计图的原稿。

比如一个缩略图的链接格式是：https://pic1.ntimg.cn/pic/20220719/4244141_213528444101_4.jpg

对应的清晰预览图链接是：https://pic.ntimg.cn/file/20220719/4244141_213528444101_2.jpg

生成清晰预览图链接的思路是先把缩略图链接用split('/')分开几个小段，然后再拼接字符串。

下面的代码是下载昵图网的七夕主题素材的预览图，下载范围是1~10页，开启10个线程，每个线程打开一个缩略图页面，然后逐个下载预览图。

多线程的代码参考了runoob.com的教程，教程链接在：Python3 多线程

## 线程优先级队列（ Queue）
## Python 的 Queue 模块中提供了同步的、线程安全的队列类，
## 包括FIFO（先入先出)队列Queue，LIFO（后入先出）队列LifoQueue，
## 和优先级队列 PriorityQueue。
## 这些队列都实现了锁原语，能够在多线程中直接使用，
## 可以使用队列来实现线程间的同步。

import queue
import threading
import requests
import re
import os

exitFlag = 0

class myThread (threading.Thread):
    def __init__(self, threadID, name, q):
        threading.Thread.__init__(self)
        self.threadID = threadID
        self.name = name
        self.q = q
    def run(self):
        print (f"开启线程：{self.name}")
        process_data(self.name, self.q)
        print (f"退出线程：{self.name}")

def process_data(threadName, q):
    while not exitFlag:
        queueLock.acquire()
        if not workQueue.empty():
            url = q.get()
            queueLock.release()

            # 打开缩略图链接，每个线程下载一个页面的所有大图
            small_list=[]
            res=se.get(url,headers=headers).text
            # 查找缩略图的链接
            small_list=re.findall('data-original="(.*?)"',res)
            print (f"线程{threadName}下载{len(small_list)}个图片……")
            # 清晰预览图
            big_url=''
            # 记录是否有重复文件
            repeat_file=0
            for i in small_list:
                split_list=i.split('/')
                big_url='https://pic.ntimg.cn/file/'+split_list[4]+'/'+split_list[5][:-5]+'2.jpg'
                filepath=outputpath+'\\'+split_list[5][:-5]+'2.jpg'
                # 重复的文件不用下载
                if os.path.lexists(filepath): 
                    repeat_file+=1
                    continue
                f=open(filepath,'wb')
                c=se.get(big_url,headers=headers)
                f.write(c.content)
                f.close()
            if repeat_file>0: print (f"线程{threadName}发现{repeat_file}个重复的图片。")
        else:
            queueLock.release()

# 主程序
# 设定输出路径
outputpath=r'e:\temp\七夕'
if not os.path.exists(outputpath): os.mkdir(outputpath)

# 设定请求头、session
headers={'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36',}
se = requests.session()

# 昵图网的七夕主题素材链接
start_url='https://soso.nipic.com/?q=%E4%B8%83%E5%A4%95&g=1&or=0&y=60&page='
# 设定起始页和结束页
startpage=1
lastpage=10
# 定义10线程
threadList = range(10)

# 添加缩略图的页面链接到linkList
linkList=[]
for pages in range(startpage,lastpage+1):
        linkList.append(start_url+str(pages))
queueLock = threading.Lock()
workQueue = queue.Queue(10)
threads = []
threadID = 1

# 创建新线程
for tName in threadList:
    thread = myThread(threadID, tName, workQueue)
    thread.start()
    threads.append(thread)
    threadID += 1

# 把缩略图的页面链接填充到队列
queueLock.acquire()
for k in linkList:
    workQueue.put(k)
queueLock.release()

# 等待队列清空
while not workQueue.empty():
    pass

# 通知线程是时候退出
exitFlag = 1

# 等待所有线程完成
for t in threads:
    t.join()
print ("退出主线程")

运行结果截图：