html中h2标记,在h3/h2标记之间使用Xpath/BeautifulSoup的HTML

您可以在Scrapy中使用XPath和CSS选择器。在

下面是一个示例解决方案(在ipython会话中,我只将第二个块中的#1和#2改为#3和#4,这样更明显):In [1]: import scrapy

In [2]: selector = scrapy.Selector(text="""

First title

...:

...:

Text for the first title... li #1

...:

Text for the first title... li #2

...:

...:

Second title

...:

...:

Text for the second title... li #3

...:

Text for the second title... li #4

...:

""")

In [3]: for title_list in selector.css('h3 + ul'):

...: print title_list.xpath('./li/text()').extract()

...:

[u'Text for the first title... li #1', u'Text for the first title... li #2']

[u'Text for the second title... li #3', u'Text for the second title... li #4']

In [4]: for title_list in selector.css('h3 + ul'):

print title_list.css('li::text').extract()

...:

[u'Text for the first title... li #1', u'Text for the first title... li #2']

[u'Text for the second title... li #3', u'Text for the second title... li #4']

In [5]:

编辑,在评论中的OP问题之后:Every

tag is enclosed in its own
  • (...) Is there any way to extend that to make it look for all the ul tags below the h3 tag?

如果h3和ul都是同级的,选择下一个h3之前的ul的方法是计数preceding ^{} siblings

考虑以下输入HTML片段:

^{pr2}$

第一条

  • 行有1个前进的h3同级,第三个
    • 行有2个前面的h3同级。在

      因此,对于每个h3,您需要下面的ul同级,它们的数量正好是您目前所看到的h3。在

      第一个:

      following-sibling::ul[count(preceding-sibling::h3)=1]

      然后

      following-sibling::ul[count(preceding-sibling::h3)=2]

      等等。在

      这是在enumerate()对h3选择的帮助下实现的这个想法(记住XPath positions start at 1,而不是0):In [1]: import scrapy

      In [2]: selector = scrapy.Selector(text="""

      First title

      • Text for the first title... li #1
      • Text for the first title... li #2

      Second title

      • Text for the second title... li #3
      • Text for the second title... li #4

      """)

      In [3]: for cnt, title in enumerate(selector.css('h3'), start=1):

      ...: print title.xpath('following-sibling::ul[count(preceding-sibling::h3)=%d]/li/text()' % cnt).extract()

      ...:

      [u'Text for the first title... li #1', u'Text for the first title... li #2']

      [u'Text for the second title... li #3', u'Text for the second title... li #4']

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值