求解爬虫爬取的链接后缀名为什么会变啊

wswwsw_123

于 2020-12-03 15:21:16 发布

阅读量535

点赞数

文章标签： python html

本文链接：https://blog.csdn.net/wswwsw_123/article/details/110532491

版权

在使用Scrapy爬取网页时，发现原本为.htm的链接在爬取过程中变为.html。虽然网页中未显示.html链接，但爬取结果显示为.html。问题可能涉及到URL重定向或网页动态加载。寻求解决方案。

摘要由CSDN通过智能技术生成

用scrapy爬取图片的链接，规则如下

name = 'bizhixiazai'
allowed_domains = ['netbian.com']
start_urls = ['http://www.netbian.com']


rules = (
    Rule(LinkExtractor(allow=r'/index.+htm',restrict_xpaths=['//div[@class="page"]//a']),follow=True),
    Rule(LinkExtractor(allow=r'.+htm',restrict_xpaths=['//div[@class="list"]//a']),callback='parse_detail',follow=False)

已经指定了爬取的范围，为什么爬取出来的路径后缀由htm变为了html？网页链接显示是htm的链接啊，没有发现有html的链接啊，小白求解

这是爬取的结果：
、、、
2020-12-03 14:55:06 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.netbian.com/index_1260.html> (referer: http://www.netbian.com)
2020-12-03 14:55:07 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.netbian.com/index_10.html> (referer: http://www.netbian.com)
2020-12-03 14:5