用scrapy爬取 图片的链接,规则如下
name = 'bizhixiazai'
allowed_domains = ['netbian.com']
start_urls = ['http://www.netbian.com']
rules = (
Rule(LinkExtractor(allow=r'/index.+htm',restrict_xpaths=['//div[@class="page"]//a']),follow=True),
Rule(LinkExtractor(allow=r'.+htm',restrict_xpaths=['//div[@class="list"]//a']),callback='parse_detail',follow=False)
已经指定了爬取的范围,为什么爬取出来的路径后缀由htm变为了html?网页链接显示是htm的链接啊,没有发现有html的链接啊,小白求解
这是爬取的结果:
、、、
2020-12-03 14:55:06 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.netbian.com/index_1260.html> (referer: http://www.netbian.com)
2020-12-03 14:55:07 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.netbian.com/index_10.html> (referer: http://www.netbian.com)
2020-12-03 14:5