html中h2标记,在h3/h2标记之间使用Xpath/BeautifulSoup的HTML

最新推荐文章于 2023-10-10 15:59:53 发布

卖家胖蝌蚪

最新推荐文章于 2023-10-10 15:59:53 发布

阅读量287

点赞数

文章标签： html中h2标记

您可以在Scrapy中使用XPath和CSS选择器。在

下面是一个示例解决方案(在ipython会话中，我只将第二个块中的#1和#2改为#3和#4，这样更明显)：In [1]: import scrapy

In [2]: selector = scrapy.Selector(text="""

First title

...:

...:

Text for the first title... li #1

...:

Text for the first title... li #2

...:

...:

Second title

...:

...:

Text for the second title... li #3

...:

Text for the second title... li #4

...:

""")

In [3]: for title_list in selector.css('h3 + ul'):

...: print title_list.xpath('./li/text()').extract()

...:

[u'Text for the first title... li #1', u'Text for the first title... li #2']

[u'Text for the second title... li #3', u'Text for the second title... li #4']

In [4]: for title_list in selector.css('h3 + ul'):

print title_list.css('li::text').extract()

...:

[u'Text for the first title... li #1', u'Text for the first title... li #2']

[u'Text for the second title... li #3', u'Text for the second title... li #4']

In [5]:

编辑，在评论中的OP问题之后：Every

tag is enclosed in its own

(...) Is there any way to extend that to make it look for all the ul tags below the h3 tag?

如果h3和ul都是同级的，选择下一个h3之前的ul的方法是计数preceding ^{} siblings

考虑以下输入HTML片段：

^{pr2}$

第一条

行有1个前进的h3同级，第三个
- 行有2个前面的h3同级。在
  因此，对于每个h3，您需要下面的ul同级，它们的数量正好是您目前所看到的h3。在
  第一个：
  following-sibling::ul[count(preceding-sibling::h3)=1]
  然后
  following-sibling::ul[count(preceding-sibling::h3)=2]
  等等。在
  这是在enumerate()对h3选择的帮助下实现的这个想法(记住XPath positions start at 1，而不是0)：In [1]: import scrapy
  In [2]: selector = scrapy.Selector(text="""
  First title
  - Text for the first title... li #1
  - Text for the first title... li #2
  Second title
  - Text for the second title... li #3
  - Text for the second title... li #4
  """)
  In [3]: for cnt, title in enumerate(selector.css('h3'), start=1):
  ...: print title.xpath('following-sibling::ul[count(preceding-sibling::h3)=%d]/li/text()' % cnt).extract()
  ...:
  [u'Text for the first title... li #1', u'Text for the first title... li #2']
  [u'Text for the second title... li #3', u'Text for the second title... li #4']

卖家胖蝌蚪

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。