Python爬虫实战 | (8) 爬取百度图片

最新推荐文章于 2021-10-19 09:59:09 发布

CoreJT

最新推荐文章于 2021-10-19 09:59:09 发布

阅读量892

点赞数 2

分类专栏： Python3网络爬虫从理论到实践Base 文章标签： Python爬虫实战爬取百度图片 requests

本文链接：https://blog.csdn.net/sdu_hao/article/details/96443007

版权

Python3网络爬虫从理论到实践Base 专栏收录该内容

30 篇文章 48 订阅

订阅专栏

本篇博客我们将爬取百度图片，输入搜索词，爬取与搜索词相关的图片。

首先打开百度图片http://image.baidu.com，比如搜索"美女"，此时的URL如下：

https://image.baidu.com/search/index?tn=baiduimage&ipn=r&ct=201326592&cl=2&lm=-1&st=-1&fm=index&fr=&hs=0&xthttps=111111&sf=1&fmq=&pv=&ic=0&nc=1&z=&se=1&showtab=0&fb=0&width=&height=&face=0&istype=2&ie=utf-8&word=%E7%BE%8E%E5%A5%B3&oq=%E7%BE%8E%E5%A5%B3&rsp=-1

如果仅凭借URL来爬取的话，URL中需要体现出搜索词信息以及页数信息，所以我们需要使用下面这个URL(至于这个URL是怎么得到的，目前我也不清楚，先照搬)：

http://image.baidu.com/search/acjson?tn=resultjson_com&ipn=rj&ct=201326592&is=&fp=result&queryWord={}&cl=2&lm=-1&ie=utf-8&oe=utf-8&adpicid=&st=&z=&ic=&hd=&latest=&copyright=&word={}&s=&se=&tab=&width=&height=&face=&istype=&qc=&nc=1&fr=&expermode=&force=&pn={}&rn=30&gsm=1e&1553692825521=

前两个{}替换搜索词，最后一个是页数信息。

首先搭建程序主体框架：

import time
import requests
import os
from requests import RequestException
import json

def get_page(url):
    pass


def parse_page(html,count,word):
    pass





if __name__ == '__main__':
    word = '美女' #关键词
    page = 1 #爬取的页数
    count = 0

    if not os.path.exists(word):
        os.makedirs(word)  #建目录

    for i in range(page):
        url = 'http://image.baidu.com/search/acjson?' \
              'tn=resultjson_com&ipn=rj&ct=201326592&is=&fp=result&' \
              'queryWord={}&cl=2&lm=-1&ie=utf-8&oe=utf-8&adpicid=&st=&z=' \
              '&ic=&hd=&latest=&copyright=&word={}&s=&se=&tab=&width=&height=&face=' \
              '&istype=&qc=&nc=1&fr=&expermode=&force=&pn={}&rn=30&gsm=1e&1553692825521='.format(word,word,i*30)
        #发送请求、获取响应
        html = get_page(url)
        #解析响应 数据存储
        count = parse_page(html,count,word)

        time.sleep(1)

发送请求获取响应，编写get_page(url)函数：


def get_page(url):
    try:
        # 添加User-Agent，放在headers中，伪装成浏览器
        headers = {
            'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36'
        }
        response = requests.get(url, headers=headers)
        if response.status_code == 200:
            response.encoding = response.apparent_encoding
            return response
        return None
    except RequestException:
        return None

注意和之前不同，这里指返回response，因为在解析首页时，我们需要的是response.text;当获取图片URL爬取保存图片时，需要的是response.content。返回response，两次请求可以通用这个函数。

打开上面的链接，会发现他返回的是json格式的数据：

所有的图片信息都在data下，上图蓝色阴影代表一张图片的信息，内部都是由一些键值对组成，我们关心的是middleURL字段，他的值是图片真正的链接。所以，我们要先把图片的middleURL解析回来，然后再进行图片爬取，和保存。

解析响应，解析json数据，提取middleURL并保存，然后爬取middleURL，保存图片：

def parse_page(html,count,word):
    html = html.text
    if html:
        p = json.loads(html)['data'] #转为json格式  提取data字段
        print(len(p)) #图片数
        for i in p[:-1]: #[0:5]前5张
            print(i['middleURL'])
            count = count + 1
            #数据保存
            with open(word+'/'+word+'_url.txt','a',encoding='utf-8') as f:
                f.write(i['middleURL']+'\n')
            pic = get_page(i['middleURL'])
            if pic:
                with open(word+'/'+str(count)+'.jpg','wb') as f:
                    f.write(pic.content)
            time.sleep(1)

        return count

完整代码：

import time
import requests
import os
from requests import RequestException
import json

def get_page(url):
    try:
        # 添加User-Agent，放在headers中，伪装成浏览器
        headers = {
            'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36'
        }
        response = requests.get(url, headers=headers)
        if response.status_code == 200:
            response.encoding = response.apparent_encoding
            return response
        return None
    except RequestException:
        return None

def parse_page(html,count,word):
    html = html.text
    if html:
        p = json.loads(html)['data'] #转为json格式  提取data字段
        print(len(p)) #图片数
        for i in p[:-1]: #[0:5]前5张
            print(i['middleURL'])
            count = count + 1
            #数据保存
            with open(word+'/'+word+'_url.txt','a',encoding='utf-8') as f:
                f.write(i['middleURL']+'\n')
            pic = get_page(i['middleURL'])
            if pic:
                with open(word+'/'+str(count)+'.jpg','wb') as f:
                    f.write(pic.content)
            time.sleep(1)

        return count





if __name__ == '__main__':
    word = '美女' #关键词
    page = 1 #爬取的页数
    count = 0

    if not os.path.exists(word):
        os.makedirs(word)  #建目录

    for i in range(page):
        url = 'http://image.baidu.com/search/acjson?' \
              'tn=resultjson_com&ipn=rj&ct=201326592&is=&fp=result&' \
              'queryWord={}&cl=2&lm=-1&ie=utf-8&oe=utf-8&adpicid=&st=&z=' \
              '&ic=&hd=&latest=&copyright=&word={}&s=&se=&tab=&width=&height=&face=' \
              '&istype=&qc=&nc=1&fr=&expermode=&force=&pn={}&rn=30&gsm=1e&1553692825521='.format(word,word,i*30)
        #发送请求、获取响应
        html = get_page(url)
        #解析响应
        count = parse_page(html,count,word)

        time.sleep(1)

CoreJT

关注

2
点赞
踩
9

收藏

觉得还不错? 一键收藏
2
评论
Python爬虫实战 | (8) 爬取百度图片

本篇博客我们将爬取百度图片，输入搜索词，爬取与搜索词相关的图片。首先打开百度图片http://image.baidu.com，比如搜索"美女"，此时的URL如下：https://image.baidu.com/search/index?tn=baiduimage&ipn=r&ct=201326592&cl=2&lm=-1&st=-1&fm=i...
复制链接

扫一扫

专栏目录