python学习（6）：python爬虫之requests和BeautifulSoup的使用

最新推荐文章于 2024-02-07 08:00:00 发布

name_s_Jimmy

最新推荐文章于 2024-02-07 08:00:00 发布

阅读量1w

点赞数 3

分类专栏： Python 文章标签： python 爬虫 BeautifulSoup4 Requests 图片保存

本文链接：https://blog.csdn.net/qq_32166627/article/details/60345731

版权

Python 专栏收录该内容

12 篇文章 2 订阅

订阅专栏

前言：

Requests库跟urllib库的作用相似，都是根据http协议操作各种消息和页面。

都说Requests库比urllib库好用，我也没有体会到好在哪儿。

但是，urllib库有一点不爽的是：urllib.request.urlretrieve(url, localPath)函数在将某些图片链接保存到本地时，会出现错误：httpError：304 Forbidden

为什么会出现这个错误？查询网上的说法，大多认为是Header的问题，不过我试了将完整的Header添加进去仍然不行。

本案例用Requests库替换urllib库，并用open().write()方法替换掉urllib.request.urlretrieve(url, localPath)方法。

正文：

一，安装Requests库

pip3 install requests

安装后进入python导入模块测试是否安装成功

import requests

没有出错即安装成功

Requests库的使用请参阅中文官方文档：http://cn.python-requests.org/zh_CN/latest/

二，结合了Requests库和BeautifulSoup库的图片爬虫程序

'''
    requests,bs4
'''

import os
import requests
from bs4 import BeautifulSoup

def getHtmlCode(url):  # 该方法传入url，返回url的html的源码
    headers = {
        'User-Agent': 'MMozilla/5.0 (Windows NT 6.1; WOW64; rv:31.0) Gecko/20100101 Firefox/31.0'
    }

    r= requests.get(url,headers=headers)
    r.encoding='UTF-8'
    page = r.text
    return page

def getImg(page,localPath):  # 该方法传入html的源码，经过截取其中的img标签，将图片保存到本机
    if not os.path.exists(localPath): # 新建文件夹
        os.mkdir(localPath)
    soup = BeautifulSoup(page,'html.parser') # 按照html格式解析页面
    imgList = soup.find_all('img')  # 返回包含所有img标签的列表
    x = 0
    for imgUrl in imgList:  # 列表循环
        print('正在下载：%s'%imgUrl.get('src'))
        ir = requests.get(imgUrl.get('src'))

        # open().write()方法原始且有效
        open(localPath+'%d.jpg'%x, 'wb').write(ir.content)
        x+=1


if __name__ == '__main__':
    url = 'http://www.zhangzishi.cc/20160712mz.html'
    localPath = 'e:/pythonSpiderFile/img8/'
    page = getHtmlCode(url)
    getImg(page,localPath)

name_s_Jimmy

关注

3
点赞
踩
8

收藏

觉得还不错? 一键收藏
0
评论
python学习（6）：python爬虫之requests和BeautifulSoup的使用

前言：Requests库跟urllib库的作用相似，都是根据http协议操作各种消息和页面。都说Requests库比urllib库好用，我也没有体会到好在哪儿。但是，urllib库有一点不爽的是：urllib.request.urlretrieve(url, localPath)函数在将某些图片链接保存到本地时，会出现错误：httpError：304 Forbidden为什么会
复制链接

扫一扫