python request库，爬取图片的讲解与应用

最新推荐文章于 2024-07-22 17:25:13 发布

DoCki

最新推荐文章于 2024-07-22 17:25:13 发布

阅读量1.7k

点赞数 1

文章标签： python requests lxml

本文链接：https://blog.csdn.net/Love_ProgramingKi/article/details/82695663

版权

requests库在python的web应用中使用较多，一些小型爬虫也使用的较多，话不多说，上一段requests爬取糗事百科图片的代码。

# coding:utf-8
import requests
from lxml import etree
import webbrowser
import os


def requests_view(response):
    """
    使用默认浏览器打开临时网页
    :param response: 目标网站的响应
    :return: 
    """
    base_url = response.url
    base_url = '<head><base href="%s">' % base_url
    content = response.content.replace(b'<head>', base_url.encode())
    with open('temp.html', 'wb') as temp_html:
        temp_html.write(content)
    webbrowser.open_new_tab('temp.html')

def get_links(url):
    """
    获取下一页链接，循环爬取
    :param url: 需要爬取得网页
    :return:
    """
    print(url)
    response = requests.get(url)
    html = etree.HTML(response.content)
    imgs = html.xpath('//*[@id="wrapper"]//div[@class="ui-module"]//img/@src')
    links = html.xpath('//*[@id="wrapper"]/div/div[1]/div[9]/div/a[contains(text(),"下一页")]/@href')
    if response.status_code == 200:
        download_imgs(imgs)
    else:
        print(response.status_code, url)
    if len(links) > 0:
        host = 'http://www.qiubaichengren.net/'
        next_url = host + links[-1]
        get_links(next_url)


def download_imgs(imgs):
    """
    下载图片
    :param imgs: 一个图片地址的集合
    :return:
    """
    imgs_dir = r"C:\Users\Administrator\Desktop\pics"
    try:
        for i in imgs:
            img_name = i.split('/')[-1]
            img_path = os.path.join(imgs_dir, img_name)
            content = requests.get(i).content
            with open(img_path,'wb') as f:
                f.write(content)
    except Exception as e:
        print("download_imgs err: %s" % e)

if __name__ == "__main__":
    host_name = 'http://www.qiubaichengren.net'
    response = requests.get(host_name)
    # requests_view(response)
    get_links(host_name)

讲解：

webbrowser是python的一个内置库，应用webbrowser.open_new_tab(url)可以使用系统默认浏览器打开指定的url。
<base> 标签为页面上的所有链接规定默认地址或默认目标。通常情况下，浏览器会从当前文档的 URL 中提取相应的元素来填写相对 URL 中的空白。使用 <base> 标签可以改变这一点。浏览器随后将不再使用当前文档的 URL，而使用指定的基本 URL 来解析所有的相对 URL。
etree.HTML()可以将requests响应的内容格式化成html格式，随后可以对html使用xpath提取我们想要的内容，这里提取图片。
本例中使用递归来循环爬取网页，可以通过sys.setrecursionlimit()设置python递归层数，防止递归超过python默认递归数，感兴趣的同学可以将此例改写为非递归实现。
此例可以使用多线程实现哦。

DoCki

关注

1
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
python request库，爬取图片的讲解与应用

requests库在python的web应用中使用较多，一些小型爬虫也使用的较多，话不多说，上一段requests爬取糗事百科图片的代码。# coding:utf-8import requestsfrom lxml import etreeimport webbrowserimport osdef requests_view(response): """ 使用默...
复制链接

扫一扫