scrapy无法使用xpath解析？特殊网页的信息提取（1） — 百度贴吧

最新推荐文章于 2024-06-28 15:36:23 发布

Kosmoo

最新推荐文章于 2024-06-28 15:36:23 发布

阅读量5.6k

点赞数 1

分类专栏： python爬虫文章标签：百度贴吧爬虫爬取百度贴吧页面解析 xpath无法解析页面

本文链接：https://blog.csdn.net/zwq912318834/article/details/79884640

版权

在使用scrapy爬取百度贴吧时遇到xpath无法解析页面元素的问题。通过分析源码，发现内容被嵌套在<code>标签内。解决方案包括：用正则提取<code>标签中的内容，再用lxml的xpath处理，以及注意json属性的提取。实战经验分享。

摘要由CSDN通过智能技术生成

scrapy无法使用xpath解析？特殊网页的信息提取（1） — 百度贴吧

1. 背景

最近在使用scrapy爬取百度贴吧帖子内容时，发现用xpath无法解析到页面元素。但是利用xpath helper这个插件，很明显可以看到xpath路径是没有问题的。

articleList = response.xpath("//li[contains(@class, 'j_thread_list')]")
print(f"articleList = {articleList}")
# 输出结果为：articleList = []

之前的文章（https://blog.csdn.net/zwq912318834/article/details/78738362）也提到过审查元素看到的页面是渲染之后的页面，而scrapy真正抓下来的页面内容是在Preview这个栏位中。抱着好奇的想法，把这个内容打印了出来：

print(f"text = {response.text}")

大致看了一下，发现页面内容和审查元素页面的内容基本上一致，不像是传统的页面框架+ js渲染模式，完全摸不着头脑到底是怎么回事，为什么xpath无法提取到相关的内容。

2. 环境

python 3.6.1
系统：win7
IDE：pycharm
scrapy框架

3. 分析过程

试了很多次，各种重写xpath路径，还是不管用。然后我试着提取所有的div元素：

articleList = response.xpath("//div")
print(f"articleList = {articleList}")
# 输出结果为：articleList = [XXXXXX.........]

发现，结果中提取出来的元素很少。但是能提取到元素，这也说明scrapy中的xpath是起作用的。问题应该是出在网页内容本身上，当时在想，原因可能有两个：
- 第一，xpath路径不对，网页内容有隐藏的陷阱，这在前篇文章中提到过一些。
- 第二，网页结构混乱，缺少一些标签，导致xpath无法识别。
仔细分析这个网页源码，终于被我发现了……要提取的这部分网页内容，显示的颜色是绿色的（字符串的颜色），而标签的颜色是紫色的。
最终发现，原来核心的网页内容是以字符串的形式，被包含在< code >这个标签中。非常非常的隐蔽！！！（我猜写这个页面的人一定是盗贼属性）
类似于这样子：

<code class="pagelet_html" id="pagelet_html_frs-list/pagelet/thread_list" style="display:none;">

<!--
<ul id="thread_list" class="threadlist_bright j_threadlist_bright">
    <li class=" j_thread_list clearfix" data-field='{&quot;id&quot;:1795630165,&quot;author_name&quot;:&quot;\u7f8e\u7eaf\u8f76&quot;,&quot;first_post_id&quot;:23158535830,&quot;reply_num&quot;:8,&quot;is_bakan&quot;:null,&quot;vid&quot;:&quot;&quot;,&quot;is_good&quot;:true,&quot;is_top&quot;:null,&quot;is_protal&quot;:null,&quot;is_membertop&quot;:null,&quot;is_multi_forum&quot;:null,&quot;frs_tpoint&quot;:null}' >
            <div class="t_con cleafix">
                            <div class="col2_left j_threadlist_li_left">

                        <span class="threadlist_rep_num center_text"
                            title="回复">8</span>
                            </div>
                <div class="col2_right j_threadlist_li_right ">
            <div class="threadlist_lz clearfix">
                <div class="threadlist_title pull_left j_th_tit 
">


    <a rel="noreferrer"  href="/p/1795630165" title="去北京看了一场球赛" target="_blank" class="j_th_tit ">去北京看了一场球赛</a>
</div><div class="threadlist_author pull_right">
    <span class="tb_icon_author "
          title="主题作者: 美纯轶"
          data-field='{&quot;user_id&quot;:3464099}' ><i class="icon_author"></i><span class="frs-author-name-wrap"><a rel="noreferrer"  data-field='{&quot;un&quot;:&quot;\u7f8e\u7eaf\u8f76&quot;}' class="frs-author-name j_user_card " href="/home/main/?un=%E7%BE%8E%E7%BA%AF%E8%BD%B6&ie=utf-8&fr=frs" target="_blank">美纯轶</a></span><span class="icon_wrap  icon_wrap_theme1 frs_bright_icons "></span>    </span>
    <span class="pull-right is_show_create_time" title="创建时间">2013-10</span>
</div>
            </div>
                            <div class="threadlist_detail clearfix">
                    <div class="threadlist_text pull_left">
                                <div class=