html5lib解析丢失span标签,xpath - 可以用Beautiful Soup的html5lib解析器替换Scrapy的默认lxml解析器吗? - 堆栈内存溢出...

问题:有没有办法将BeautifulSoup的html5lib解析器集成到scrapy项目中,而不是scrapy的默认lxml解析器中?

Scrapy的解析器在某些抓取页面上失败(对于某些元素)。

每20页中的每2页仅发生一次。

作为修复,我已经将BeautifulSoup的解析器添加到了该项目(可以正常工作)。

就是说,我觉得我要把条件和多个解析器的工作加倍……在某个时候,使用Scrapy解析器的原因是什么?

该代码确实有效。。。

我不是专家,有没有更优雅的方法?

提前多谢

更新:

将中间件类添加到scrapy中(从python包scrapy-beautifulsoup中获取 )就像是一种魅力。 显然,来自Scrapy的lxml不如BeautifulSoup的lxml健壮。 我不必求助于html5lib解析器-慢30倍以上。

class BeautifulSoupMiddleware(object):

def __init__(self, crawler):

super(BeautifulSoupMiddleware, self).__init__()

self.parser = crawler.settings.get('BEAUTIFULSOUP_PARSER', "html.parser")

@classmethod

def from_crawler(cls, crawler):

return cls(crawler)

def process_response(self, request, response, spider):

"""Overridden process_response would "pipe" response.body through BeautifulSoup."""

return response.replace(body=str(BeautifulSoup(response.body, self.parser)))

原版的:

import scrapy

from scrapy.item import Item, Field

from scrapy.loader.processors import TakeFirst, MapCompose

from scrapy import Selector

from scrapy.loader import ItemLoader

from w3lib.html import remove_tags

from bs4 import BeautifulSoup

class SimpleSpider(scrapy.Spider):

name = 'SimpleSpider'

allowed_domains = ['totally-above-board.com']

start_urls = [

'https://totally-above-board.com/nefarious-scrape-page.html'

]

custom_settings = {

'ITEM_PIPELINES': {

'crawler.spiders.simple_spider.Pipeline': 400

}

}

def parse(self, response):

yield from self.parse_company_info(response)

yield from self.parse_reviews(response)

def parse_company_info(self, response):

print('parse_company_info')

print('==================')

loader = ItemLoader(CompanyItem(), response=response)

loader.add_xpath('company_name',

'//h1[contains(@class,"sp-company-name")]//span//text()')

yield loader.load_item()

def parse_reviews(self, response):

print('parse_reviews')

print('=============')

# Beautiful Soup

selector = Selector(response)

# On the Page (Total Reviews) # 49

search = '//span[contains(@itemprop,"reviewCount")]//text()'

review_count = selector.xpath(search).get()

review_count = int(float(review_count))

# Number of elements Scrapy's LXML Could find # 0

search = '//div[@itemprop ="review"]'

review_element_count = len(selector.xpath(search))

# Use Scrapy or Beautiful Soup?

if review_count > review_element_count:

# Try Beautiful Soup

soup = BeautifulSoup(response.text, "lxml")

root = soup.findAll("div", {"itemprop": "review"})

for review in root:

loader = ItemLoader(ReviewItem(), selector=review)

review_text = review.find("span", {"itemprop": "reviewBody"}).text

loader.add_value('review_text', review_text)

author = review.find("span", {"itemprop": "author"}).text

loader.add_value('author', author)

yield loader.load_item()

else:

# Try Scrapy

review_list_xpath = '//div[@itemprop ="review"]'

selector = Selector(response)

for review in selector.xpath(review_list_xpath):

loader = ItemLoader(ReviewItem(), selector=review)

loader.add_xpath('review_text',

'.//span[@itemprop="reviewBody"]//text()')

loader.add_xpath('author',

'.//span[@itemprop="author"]//text()')

yield loader.load_item()

yield from self.paginate_reviews(response)

def paginate_reviews(self, response):

print('paginate_reviews')

print('================')

# Try Scrapy

selector = Selector(response)

search = '''//span[contains(@class,"item-next")]

//a[@class="next"]/@href

'''

next_reviews_link = selector.xpath(search).get()

# Try Beautiful Soup

if next_reviews_link is None:

soup = BeautifulSoup(response.text, "lxml")

try:

next_reviews_link = soup.find("a", {"class": "next"})['href']

except Exception as e:

pass

if next_reviews_link:

yield response.follow(next_reviews_link, self.parse_reviews)

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值