html中h2标签的位置,python-使用Xpath / BeautifulSoup在h3 / h2标签之间的HTML

最新推荐文章于 2023-10-10 15:59:53 发布

Paris李晶

最新推荐文章于 2023-10-10 15:59:53 发布

阅读量330

点赞数

html中h2标签的位置

我正在为项目使用Scrapy,并且得到以下html：

First title

Text for the first title... li #1

Text for the first title... li #2

Second title

Text for the second title... li #1

Text for the second title... li #2

现在,当我使用response.xpath(“ .// ul / li / text()”).extract()时,它确实为我提供了[“第一个标题的文字… li#1”,“ Text对于第一个标题… li#2“,”第二个标题的文本… li#1“,”第二个标题的文本… li#2“]]但是,这部分是我想要的.

我想要两个列表,一个用于第一个标题,另一个用于第二个标题.

这样,结果将是：

first_title = ["Text for the first title... li #1", "Text for the first title... li #2"]

second_title = ["Text for the second title... li #1", "Text for the second title... li #2"]

我仍然不知道如何实现这一目标.我目前正在使用Scrapy来获取HTML；将xpath与纯Python结合使用的解决方案对我来说是理想的.但是我以某种方式相信BeautifulSoup将对此类任务有用.

您对如何在Python中执行此操作有任何想法吗？

解决方法:

您可以在Scrapy中使用XPath和CSS选择器.

这是一个示例解决方案(在ipython会话中；我只将第2块中的#1和#2更改为#3和#4,以使其更加明显)：

In [1]: import scrapy

In [2]: selector = scrapy.Selector(text="""

First title

...:

...:

Text for the first title... li #1

...:

Text for the first title... li #2

...:

...:

Second title

...:

...:

Text for the second title... li #3

...:

Text for the second title... li #4

...:

""")

In [3]: for title_list in selector.css('h3 + ul'):

...: print title_list.xpath('./li/text()').extract()

...:

[u'Text for the first title... li #1', u'Text for the first title... li #2']

[u'Text for the second title... li #3', u'Text for the second title... li #4']

In [4]: for title_list in selector.css('h3 + ul'):

print title_list.css('li::text').extract()

...:

[u'Text for the first title... li #1', u'Text for the first title... li #2']

[u'Text for the second title... li #3', u'Text for the second title... li #4']

In [5]:

在OP提出问题后进行编辑：

Every

tag is enclosed in its own

(…) Is there any way to extend that to make it look for all the ul tags below the h3 tag?

如果h3和ul都是兄弟姐妹,则选择下一个h3之前的ul的一种方法是计数preceding h3 siblings

考虑以下输入HTML代码段：

First title

Text for the first title... li #1

Text for the first title... li #2

Second title

Text for the second title... li #3

Text for the second title... li #4

第一< li>线具有1个前置h3兄弟,第3个ul表示同级.该行有2个先前的h3同级.

因此,对于每个h3,您都希望跟随ul兄弟姐妹,这些兄弟姐妹具有您到目前为止已看到的h3的数目.

第一：

following-sibling :: ul [count(preceding-sibling :: h3)= 1]

然后,

following-sibling :: ul [count(preceding-sibling :: h3)= 2]

等等.

这是在枚举h3选择时借助enumerate()起作用的想法(请记住XPath positions start at 1,而不是0)：

In [1]: import scrapy

In [2]: selector = scrapy.Selector(text="""

First title

Text for the first title... li #1

Text for the first title... li #2

Second title

Text for the second title... li #3

Text for the second title... li #4

""")

In [3]: for cnt, title in enumerate(selector.css('h3'), start=1):

...: print title.xpath('following-sibling::ul[count(preceding-sibling::h3)=%d]/li/text()' % cnt).extract()

...:

[u'Text for the first title... li #1', u'Text for the first title... li #2']

[u'Text for the second title... li #3', u'Text for the second title... li #4']

标签：beautifulsoup,xpath,scrapy,html,python

来源： https://codeday.me/bug/20191119/2039944.html

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
html中h2标签的位置,python-使用Xpath / BeautifulSoup在h3 / h2标签之间的HTML

我正在为项目使用Scrapy,并且得到以下html：First titleText for the first title... li #1Text for the first title... li #2Second titleText for the second title... li #1Text for the second title... li #2现在,当我使用response.x...
复制链接

扫一扫

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。