Python爬虫实例（六）：爬取XX网站图书的xml格式数据（xpath应用）

最新推荐文章于 2024-05-27 13:16:19 发布

199铱

最新推荐文章于 2024-05-27 13:16:19 发布

阅读量2.9k

点赞数 1

分类专栏：爬虫 Python爬虫专栏文章标签： python爬虫 xpath插件 etree requests html

本文链接：https://blog.csdn.net/linzhjbtx/article/details/87696558

版权

爬虫同时被 2 个专栏收录

9 篇文章 2 订阅

订阅专栏

Python爬虫专栏

9 篇文章 6 订阅

订阅专栏

本文结合之前的练习，完成项目目标：爬取XX网站的经济学图书xml格式数据。

项目思路

发送get请求获取响应，使用xpath方法和etree.HTML方法提取想要的内容，保存至本地html文件；再从本地html文件读取出来进行处理或分析。

该项目用到的新工具和新方法：

1、Chrome的xpath插件工具：从网上下载xpath插件工具，并添加到Chrome。成功添加后，Chrome浏览器的右上角会出现如下图类似的图标，点击后就可以调用上方的黑色查询工具，找到所需信息的路径。xpath是用一种类似目录树的方法来描述在xml文档中的路径。xpath的语法这里不做详细说明，请参考其他资料。

2、html.xpath()方法，用来提取xpath路径中的内容。如果路径中最末尾不带/text()，则返回的是对象；如果带/text()，则返回的是文本（可能是字符串，也可能是含字符串的列表）

3、lxml.etree.HTML()方法，将字符串转换为对象。

项目开头导入所需模块

# 【加载所需模块】
import requests
from lxml import etree
import os
import json

定义html_abs_path是xpath绝对路径，用来对获取的内容进行分组，将每一页的20本图书信息分成20条记录。（上面的图中，一条

下就包含一条图书信息，html_abs_path就是前面的一串路径）。

发送get请求后获取到内容，则对内容进行解析。如果翻页后网页显示“没有找到符合条件的图书”，则在程序主体中停止循环（注意网页中虽然显示有200+页，但实际搜索并没有那么多）。否则，则对分组后的内容进行提取，并定义一个空字典dict_str来存放每一条图书信息，每一个字典又作为一个元素不断追加到空列表s_1中。

class DoubanEconomyBookSpider:

    def __init__(self):
        self.tmp_url = 'https://book.douban.com/tag/%E7%BB%8F%E6%B5%8E%E5%AD%A6?start={}&type=T'
        self.headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36'}
        self.html_abs_path = "//li[@class='subject-item']"
        self.file_name = 'DoubanEconomyBookSpider.html'

    def web_request(self, url, headers):
        response = requests.get(url, headers=headers)
        res_dec = response.content.decode()
        return res_dec

    def parse_date(self, html_abs_path, html):
        s_1 = []
        item_list = html.xpath(html_abs_path)
        list_is_none = lambda x: x[0] if x else ""
        if html.xpath("//div[@id='content']//p[@class='pl2']/text()") == ["没有找到符合条件的图书"]:
            pages = -1
        else:
            pages = int(html.xpath("//div[@class='paginator']//a[@href='/tag/经济学?start=4140&type=T']/text()")[0])
            for content in item_list:
                dict_str = {}
                dict_str["title"] = list_is_none(content.xpath(".//div[@class='info']//a/@title"))
                dict_str["rating_num"] = list_is_none(content.xpath(".//div[@class='star clearfix']/span[@class='rating_nums']/text()"))
                dict_str["comment_num"] = list_is_none(content.xpath(".//span[@class='pl']/text()")).strip().strip('()')
                dict_str["link_src"] = list_is_none(content.xpath(".//a[@class='nbg']/@href"))
                dict_str["pub"] = list_is_none(content.xpath(".//div[@class='pub']/text()")).strip()
                s_1.append(dict_str)
        return s_1, pages

定义save_data函数，将解析后的数据保存到本地文件中。保存前需对一些网页异常数据进行必要处理。

    def file_exit_dec(self, file_name):
        try:
            os.remove(file_name)
        except IOError:
            print('File does not exit, now you can rewrite the file.')

    def save_data(self, file, data):
        if data:
            data_tmp = str(data).strip('[]').replace("}, ", "},\n") + ",\n"  # 去除首尾的[]，将空格替换为换行（一条信息一行），每页数据最后加上","分隔
            data_w = data_tmp.replace('\"[美]', '[美]').replace("\"China's Economy\"", "\'China s Economy\'")  # 处理异常数据
            # print(data_w)
            with open(file, 'a', encoding='utf-8') as f:
                f.write(data_w)
        else:
            pass

在程序主体run()中定义一个循环，while循环中利用start变量指示每一页第一条信息的序号，并用它去构造新的url地址，然后发送新的请求，直到循环结束。

    def run(self):

        # 0、数据准备
        headers = self.headers
        start = 0
        count = 20
        pages = 3
        html_abs_path = self.html_abs_path
        file = self.file_name
        self.file_exit_dec(file)

        while start < count * pages:
            # 1、准备url
            url = self.tmp_url.format(start)

            # 2、发送请求，获取响应
            res_dec = self.web_request(url, headers)

            # 3、提取数据
            html = etree.HTML(res_dec)
            dict_str, pages = self.parse_date(html_abs_path, html)
            # pages = 5

            # 4、保存数据
            self.save_data(file, dict_str)

            # 5、准备下一次url
            start += count

        self.extract_data(file, 'title')

循环结束后（数据已保存到本地html文件中），则是对数据进行处理、分析。extract_data可以将html文件中想要的ele_name信息读取出来，以供后续处理（例如将数据放到DataFrame中进行处理、分析、建模等，本文省略了这一步）。

    def extract_data(self, file, ele_name):
        with open(file, 'r') as a:
            list_1 = a.read().replace('\'', '\"').rstrip(",\n").split(",\n")  # 单引号转换为双引号，去除末尾的",\n"，以",\n"分隔
        for cont in list_1:
            list_2 = json.loads(cont)[ele_name]
            print(list_2)

if __name__ == '__main__':
    example = DoubanEconomyBookSpider()
    example.run()

下图一是保存到html文件中的数据，每条图书信息对应一个字典，总共有998条数据。下图二则是调用extract_data()函数提取到的图书标题，根据输出到控制台的内容，可以得知已经成功爬取、保存和提取到想要的数据了！！

如果觉得内容不错，请扫码关注微信公众号，获取更多内容

199铱

关注

1
点赞
踩
3

收藏

觉得还不错? 一键收藏
打赏
0
评论
Python爬虫实例（六）：爬取XX网站图书的xml格式数据（xpath应用）

本文结合之前的练习，完成项目目标：爬取XX网站的经济学图书xml格式数据。项目思路发送get请求获取响应，使用xpath方法和etree.HTML方法提取想要的内容，保存至本地html文件；再从本地html文件读取出来进行处理或分析。该项目用到的新工具和新方法：1、Chrome的xpath插件工具：从网上下载xpath插件工具，并添加到Chrome。成功添加后，Chro...
复制链接

扫一扫