您可以在Scrapy中使用XPath和CSS选择器。在
下面是一个示例解决方案(在ipython会话中,我只将第二个块中的#1和#2改为#3和#4,这样更明显):In [1]: import scrapy
In [2]: selector = scrapy.Selector(text="""
First title
...:
...:
Text for the first title... li #1...:
Text for the first title... li #2...:
...:
Second title
...:
...:
Text for the second title... li #3...:
Text for the second title... li #4...:
""")In [3]: for title_list in selector.css('h3 + ul'):
...: print title_list.xpath('./li/text()').extract()
...:
[u'Text for the first title... li #1', u'Text for the first title... li #2']
[u'Text for the second title... li #3', u'Text for the second title... li #4']
In [4]: for title_list in selector.css('h3 + ul'):
print title_list.css('li::text').extract()
...:
[u'Text for the first title... li #1', u'Text for the first title... li #2']
[u'Text for the second title... li #3', u'Text for the second title... li #4']
In [5]:
编辑,在评论中的OP问题之后:Every
tag is enclosed in its own- (...) Is there any way to extend that to make it look for all the ul tags below the h3 tag?
如果h3和ul都是同级的,选择下一个h3之前的ul的方法是计数preceding ^{} siblings
考虑以下输入HTML片段:
^{pr2}$
第一条
- 行有1个前进的h3同级,第三个
- 行有2个前面的h3同级。在
因此,对于每个h3,您需要下面的ul同级,它们的数量正好是您目前所看到的h3。在
第一个:
following-sibling::ul[count(preceding-sibling::h3)=1]
然后
following-sibling::ul[count(preceding-sibling::h3)=2]
等等。在
这是在enumerate()对h3选择的帮助下实现的这个想法(记住XPath positions start at 1,而不是0):In [1]: import scrapy
In [2]: selector = scrapy.Selector(text="""
First title
- Text for the first title... li #1
- Text for the first title... li #2
Second title
- Text for the second title... li #3
- Text for the second title... li #4
""")
In [3]: for cnt, title in enumerate(selector.css('h3'), start=1):
...: print title.xpath('following-sibling::ul[count(preceding-sibling::h3)=%d]/li/text()' % cnt).extract()
...:
[u'Text for the first title... li #1', u'Text for the first title... li #2']
[u'Text for the second title... li #3', u'Text for the second title... li #4']
- 行有2个前面的h3同级。在