python尝试动态网页抓取图片，并保存

最新推荐文章于 2024-08-13 17:55:51 发布

见习程序员小张

最新推荐文章于 2024-08-13 17:55:51 发布

阅读量867

点赞数

本文链接：https://blog.csdn.net/smart_boy_/article/details/86560785

版权

首先要区分动态加载和静态的区别，同样的爬虫代码能够抓取网页静态的部分，但对于动态加载的部分，可以尝试以下手段

一、分析网页结构

打开百度图片，摁F12打开开发者模式
搜索图片，本文以“狗”为例，点击确定
注意点击XHR，以acjson开头的就是动态加载出来的内容，通过分析请求头中可以发现与搜索内容有关的是queryWord和word，而pn代表的是页数，其他内容相对固定，无需更改，值得注意的是，pn是以30为单位的变化页数，因为一页放有30张图，而滚轮滚动一次，会加载出30张图，所以pn也会改变
得出请求网址：https://image.baidu.com/search/acjson?tn=resultjson_com&ipn=rj&ct=201326592&is=&fp=result&queryWord=%E7%8B%97&cl=2&lm=-1&ie=utf-8&oe=utf-8&adpicid=&st=-1&z=&ic=0&hd=&latest=&copyright=&word=%E7%8B%97&s=&se=&tab=&width=&height=&face=0&istype=2&qc=&nc=1&fr=&expermode=&force=&pn=30&rn=30&gsm=1e&1547957073143=，该网址返回的是一组json格式的内容，需要使用python自带的json库进行解码

二、实现代码

from urllib.request import urlopen
from urllib.request import urlretrieve
from bs4 import BeautifulSoup
import json




def down(bs,page):
    jd = json.loads(bs.text)
    for (i,list) in zip(range(30),jd['data']):
        print(list['thumbURL'])
        urlretrieve(list['thumbURL'],"logo"+str(page)+"-"+str(i)+".jpg")
        

if __name__ == '__main__':
    keyword='狗'
   
    try:
        for i in range(5):
            page=i*30
            html = urlopen("https://image.baidu.com/search/acjson?tn=resultjson_com&ipn=rj&ct=201326592&is=&fp=result&queryWord="+str(keyword)+"&cl=2&lm=-1&ie=utf-8&oe=utf-8&adpicid=&st=-1&z=&ic=0&hd=&latest=&copyright=&word="+str(keyword)+"&s=&se=&tab=&width=&height=&face=0&istype=2&qc=&nc=1&fr=&expermode=&force=&pn="+str(page)+"&rn=30&gsm=3c&1547900131393=")
            bs=BeautifulSoup(html,'lxml')
            down(bs,i)
    
    except:
        print("出错")
        
    print("好了")

代码会在根目录下生成30*5张图片