html提取标题,用lxm提取两个HTML标题之间的所有文本

最新推荐文章于 2022-09-05 16:47:05 发布

勤劳课代表

最新推荐文章于 2022-09-05 16:47:05 发布

阅读量164

点赞数

文章标签： html提取标题

我试图用Python中的lxml解析HTML页面。在

在HTML中有以下结构：

Some text with other tags.

More text.

More text[2].

Description.

Description[1].

Description[2].

***

and so on...

***

我需要将此HTML解析为以下JSON：

^{pr2}$

我可以读取所有带有标题的h5标记，并使用以下代码将它们写入JSON：array = []

for title in tree.xpath('//h5/text()'):

data = {

"title" : title,

"text" : ""

}

array.append(data)

with io.open('data.json', 'w', encoding='utf8') as outfile:

str_ = json.dumps(array,

indent=4, sort_keys=True,

separators=(',', ' : '), ensure_ascii=False)

outfile.write(to_unicode(str_))

问题是，我不知道如何读取

确定要放弃本次机会？

福利倒计时

: :

立减 ¥

普通VIP年卡可用

关注关注