html精确元素位置,Python从精确的位置获取HTML元素/节点/标记

最新推荐文章于 2024-05-08 16:31:25 发布

刘为龙

最新推荐文章于 2024-05-08 16:31:25 发布

阅读量470

点赞数

文章标签： html精确元素位置

您可以将整个HTML文档的内容作为字符串读取。然后，您可以使用标记(具有唯一id的HTML锚元素)获取修改后的字符串，并使用xml.etree.ElementTree将其解析为原始文档中的标记。然后可以使用XPath找到标记的父元素，并删除辅助标记。结果包含的结构就好像原始文档被解析了一样。但现在你知道文本的元素了！在

警告：您必须知道位置是字节位置还是抽象字符位置。(考虑多字节编码或编码某些字符的非固定长度序列。还可以考虑以一个或两个字节结尾的行。)

请尝试使用Windows行结尾将问题中的示例存储在data.html中的示例：#!python3

import xml.etree.ElementTree as ET

fname = 'doc.html'

pos = 64

with open(fname, encoding='utf-8') as f:

content = f.read()

# The position_id will be used in XPath, the position_anchor

# uses the variable only for readability. The position anchor

# has the form of an HTML element to be found easily using

# the XPath expression.

position_id = 'my_unique_position_{}'.format(pos)

position_anchor = ''.format(position_id)

# The modified content has one extra anchor as the position marker.

modified_content = content[:pos] + position_anchor + content[pos:]

root = ET.fromstring(modified_content)

ET.dump(root)

print(' ')

# Now some examples for getting the info around the point.

# '.' = from here; '//' = wherever; 'a[@id=...]' = anchor (a) element

# with the attribute id with the value.

# We will not use it later only for demonstration.

anchor_element = root.find('.//a[@id="{}"]'.format(position_id))

ET.dump(anchor_element)

print(' ')

# The text at the original position the text became the tail

# of the element.

print(repr(anchor_element.tail))

print('================')

# Now, from scratch, get the nearest parent from the position.

parent = root.find('.//a[@id="{}"]/..'.format(position_id))

ET.dump(parent)

print(' ')

# ... and the anchor element (again) as the nearest child

# with the attributes.

anchor = parent.find('./a[@id="{}"]'.format(position_id))

ET.dump(anchor)

print(' ')

# If the marker split the text, part of the text belongs to

# the parent, part is the tail of the anchor marker.

print(repr(parent.text))

print(repr(anchor.tail))

print(' ')

# Modify the parent to remove the anchor element (to get

# the original structure without the marker. Do not forget

# that the text became the part of the marker element as the tail.

parent.remove(anchor)

parent.text += anchor.tail

ET.dump(parent)

print(' ')

# The structure of the whole document now does not contain

# the added anchor marker, and you get the reference

# to the nearest parent.

ET.dump(root)

print(' ')

它打印以下内容：

^{pr2}$

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
html精确元素位置,Python从精确的位置获取HTML元素/节点/标记

您可以将整个HTML文档的内容作为字符串读取。然后，您可以使用标记(具有唯一id的HTML锚元素)获取修改后的字符串，并使用xml.etree.ElementTree将其解析为原始文档中的标记。然后可以使用XPath找到标记的父元素，并删除辅助标记。结果包含的结构就好像原始文档被解析了一样。但现在你知道文本的元素了！在警告：您必须知道位置是字节位置还是抽象字符位置。(考虑多字节编码或编码某些字符的...
复制链接

扫一扫

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。