爬虫--获取文本并拼接的几种方法

最新推荐文章于 2023-10-26 12:00:28 发布

想不到叫啥好

最新推荐文章于 2023-10-26 12:00:28 发布

阅读量3.4k

点赞数

本文链接：https://blog.csdn.net/weixin_42657103/article/details/81411805

版权

本文以爬取小说吧为例，详细介绍了四种获取和拼接网页文本的方法：1) 遍历标签获取文本；2) 使用正则表达式去除HTML标签；3) 利用//text()提取内容；4) 通过xpath('string(.)')获取节点文本。每种方法都有其特点，如正则可精准去除HTML，但无法处理中间空行，而.xpath('string(.)')则能获取所有文本。

摘要由CSDN通过智能技术生成

以爬小说吧为例

import scrapy
import re

class QingrenSpider(scrapy.Spider):
    name = 'qingren'
    allowed_domains = ['tieba.baidu.com']
    start_urls = ['https://tieba.baidu.com/p/5820130343']
    f = open('走不出你.txt','a',encoding='utf-8')
    def parse(self, response):
        # 获取小说楼主的名字以及小说内容
        div_list = response.xpath('//div[@class="l_post l_post_bright j_l_post clearfix  "]')
        # print(div_list)

第一种：获取总标签，遍历所有子标签，取得标签文本：

     for div in div_list:
            author = div.xpath('.//div[@class="louzhubiaoshi_wrap"]').extract()
            # print(author)
            if len(author) != 0:
                 content_list = div.xpath('.//div[@class=