python多线程爬取图片

weixin_40938312

于 2024-09-18 14:45:38 发布

阅读量36

点赞数

文章标签： python 开发语言

本文链接：https://blog.csdn.net/weixin_40938312/article/details/142332941

版权

用了锁锁住共享变量i
i是网页页数初始值0
两个线程开启后线程1取到页数1 上锁后将i加1 线程2取到的就是页数2
线程锁需要在函数开始阶段使用不然两个线程取到的都是1
t0=threading.Thread(target=p,args=(“t0”,c1))
创建线程如果有形参需要使用args 不能直接targetp(“t0，”c1) 不然会直接调用函数

import requests
from bs4 import BeautifulSoup
import threading
import time
from PIL import Image
import os

i=0
c1=0
c2=10
lock=threading.Lock()
def p(n,c):
    global i
    cc=0
    for xun in range(7):
        if i<7:
            lock.acquire()
            try:

                i = i + 1
                cc=i
            finally:
                lock.release()


            url="https://.html/"+str(i)+""
            headers={
                'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 '
                              '(KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36'
            }
            r = requests.get(url, headers=headers)
            # 这个网站页面使用的是GBK编码 这里进行编码转换
            r.encoding = 'utf-8'
            html = r.text
            #print(html)
            soup=BeautifulSoup(html,"html.parser")

            '''
            这里就需要用到BeautifulSoup库了，将上一个函数带来的内容用html.parser解析器进行解析。
            然后用到BeautifulSoup的find_all函数来对所需要的标签内容进行查找，用打印找的内容进行长度输出的方式来验证自己是否查找成功。
            因为找到的是多个，因此需要挨个输出。这个网站的照片是放在img标签的data-original属性里面的，
            并不是src属性里面。若是用src属性则爬取到的图片将是黑色的什么都没有，
            用data-original属性爬取的才是正确的。
            至于为什么，这个应该就是JavaScript的知识内容了，这里就不做解释了。
            并且data-original属性里面的内容还需要在前面加上https:，因为它本身缺少了这个东西，因此我们就需要给它加上。然后将这个内容代给下一个函数。
            
            作者：_WJL_
            链接：https://www.jianshu.com/p/532c9a463201
            来源：简书
            著作权归作者所有。商业转载请联系作者获得授权，非商业转载请注明出处。
            '''
            tags=soup.find("p",class_="").find_all("img")
            #print("/n",len(tags),tags)
            print(tags)
            x=0
            for eacl in tags:
                time.sleep(1)
                #print('....'+eacl.get('src'))
                #img_request=requests.get(url=each.get('img'),headers=headers)
                #img_encoding='utf-8'
                #html=img_request.text
                tu=eacl.get('data-wpfc-original-src')
                print(tu, "tu",eacl)
                print(n)
                print(i,cc)
                #soup = BeautifulSoup(tu, "html.parser")
                #img_url = soup.find("div", class_="single-video-info-content box mb-3").find("img")['src']
                #print(img_url)



                name=str(c)+str(cc)+'.webp'
                path=r'e:\p'
                file=path+'\\'+name

                req = requests.get(tu, headers=headers).content
                #print(req)
                with open(file,'wb') as f:
                    #f = open(file, 'wb')
                    f.write(req)
                    #print('当前'+str(x))





                im_path = "e:\\p\\"+str(c)+str(cc)+".webp"
                # 加载WebP图片
                webp_image = Image.open(im_path)

                # 转换并保存为PNG格式
                output_dir = 'e:\p'
                webp_image.save(os.path.join(output_dir, ""+str(c)+"+"+str(cc)+".png"), 'PNG')
                c = c + 1
        else:
            print("pp")

t=[]

t0=threading.Thread(target=p,args=("t0",c1))
t1=threading.Thread(target=p,args=("t1",c2))
t0.start()
t1.start()