今天在使用scrapy框架爬取网页时,使用正确的xpath来爬取时,爬取的缺失空列表,代码如下:
# -*- coding: utf-8 -*-
import scrapy
class HaodfSpider(scrapy.Spider):
name = 'haodf'
start_urls = ['http://bbs.tnbz.com/forum-6-2.html']
def parse(self, response):
for item in response.xpath(r'//table[@summary="forum_6"]/tbody[not(contains(@id,"separatorline"))]'):
url_s = item.xpath('./tr/th/a[3]/@href').get()
yield scrapy.Request(url_s, callback=self.parse_s)
def parse_s(self, response):
print(response.xpath('//div/table[@class="plhin"]/tbody/tr/td//div[@class="t_fsz"]//td').extract())
后来发现删除tbody就行了,原因是浏览器会规范这个html文档