python 爬虫爬取百度图片

最新推荐文章于 2025-05-21 16:35:50 发布

有时间也不简史

最新推荐文章于 2025-05-21 16:35:50 发布

阅读量2.5k

点赞数 1

CC 4.0 BY-SA版权

文章标签： python 爬虫百度图片

本文链接：https://blog.csdn.net/u010050735/article/details/78157583

   本人初学python爬虫，想试着爬取百度图片搜索上的图片。但简单的只是设置一下爬取的网页，然后用正则取筛选图片的链接，在京东或者当当的网页上可以，在百度上不好使。具体的代码如下：
import re
import urllib.request
def craw(url,page):
    html=urllib.request.urlopen(url).read()
    html=str(html)
    div='<div class="p-img">.+?<div class="p-scroll"'
    img='<img width="220" height="282" class="err-product" data-img="1" src="// (.+?\.jpg)" />'
    result1=re.compile(div).findall(html)
    print(result1[0])
    imglist=re.compile(img).findall(result1)
    print(imglist)
    x=1
    for imageurl in imglist:
        imgname="C:/Users/scorpion/Desktop/mm/"+str(page)+"_"+str(x)+".jpg"
        imageurl="http://"+imageurl
        try:
            urllib.request.urlretrieve(imageurl,filename=imgname)
        except urllib.error.URLError as e:
            if hasattr(e,"code"):
                x+=1
            if hasattr(e,"reason"):
                x+=1
        x+=1

以上代码是爬取京东图片的代码，若用相同的代码去爬取百度图片就会失败。你会发现在用链接去将百度的网页下载到本地之后是这个样子的：

图片都未加载到网页上，为什么呢？是因为百度图片将图片信息存储在json中，你需要获取它的json信息，然后筛选出图片的链接。

在chrome中，打开百度图片搜索，按f12，接着在搜索栏中输入你要搜索的内容，选中开发者模式中的Network->XHR，然后下拉网页刷新图片，你会发现出现了如图的刷新项：

点击其中一项，查看头信息，其中的Request URL就是json的请求链接：

那么就来分析一下这个链接，如果你点击多个项，比较一下就会发现，不同的json结构，其链接地址变化的就“pn=”，"gsm="和最后的一个字段。"pn="是以30的倍数递增，"gsm="是十六进制表示的数字，从‘1e’开始，每刷新一个json，低4位就减2，高四位就加2。如其规律变化：1e->3c->5a->78->96->b4->d2->f0->10e。

接下来就是简化一下链接地址，我将删除了一些字段，发现最后一个字段不影响请求，于是地址可以简化为如下形式：

https://image.baidu.com/search/acjson?tn=resultjson_com&ipn=rj&fp=result&queryWord=搜索内容编码&word=搜索内容编码&pn=30的倍数&rn=30&gsm=hex码

以上内容搞清楚之后，就可以写爬虫爬取图片了，贴上代码：

def craw(content,imgnum,localadr):
    try:
        x=1
        content=urllib.request.quote(content)
        for n in range(0,imgnum):
            print('get gsm code...')
            code=getgsm(n)
            print(code)
            url='https://image.baidu.com/search/acjson?tn=resultjson_com&ipn=rj&fp=result&queryWord='+content+'&word='+content+'&pn='+str((n+1)*30)+'&rn=30&gsm='+code
            print(url)
            print('get html data....')
            data=str(urllib.request.urlopen(url,timeout=100).read())
            print('html data gotten')
            thumbUrls='"thumbURL":"(.+?\.jpg)","middleURL"'
            imgs=re.compile(thumbUrls).findall(data)
            print('img num:'+str(len(imgs)))
            for img in imgs:
                imgname=localadr+'/'+str(x)+'.jpg'
                print('save img to '+imgname)
                urllib.request.urlretrieve(img,filename=imgname)
                x+=1
    except urllib.error.URLError as e:
        if hasattr(e,"code"):
            print(e.code)
        if hasattr(e,"reason"):
            print(e.reason)
    except Exception as e:
        print("exception:"+str(e))

#获取所需的gsm参数
def getgsm(page):
    print('page:'+str(page))
    lown=0xfe
    lown=(lown-2*page)&0x0f
    print('lown:'+str(lown))
    lowc=gethex(lown)
    highn=0x01
    highnl=(highn+2*page)&0x0f
    print('highnl:'+str(highnl))
    highnh=((highn+2*page)&0xf0)>>4
    print('highnh:'+str(highnh))
    highcl=gethex(highnl)
    highch=gethex(highnh)
    return highch+highcl+lowc

#十进制获取十六进制的字符串
def gethex(number):
    if number<10:
        return str(number)
    elif number==10:
        return 'a'
    elif number==11:
        return 'b'
    elif number==12:
        return 'c'
    elif number==13:
        return 'd'
    elif number==14:
        return 'e'
    elif number==15:
        return 'f'
    else:
        print('number bigger than 0xf')
        print(number)

当然，还要记得设置你的浏览器的头信息。