您可以将整个HTML文档的内容作为字符串读取。然后,您可以使用标记(具有唯一id的HTML锚元素)获取修改后的字符串,并使用xml.etree.ElementTree将其解析为原始文档中的标记。然后可以使用XPath找到标记的父元素,并删除辅助标记。结果包含的结构就好像原始文档被解析了一样。但现在你知道文本的元素了!在
警告:您必须知道位置是字节位置还是抽象字符位置。(考虑多字节编码或编码某些字符的非固定长度序列。还可以考虑以一个或两个字节结尾的行。)
请尝试使用Windows行结尾将问题中的示例存储在data.html中的示例:#!python3
import xml.etree.ElementTree as ET
fname = 'doc.html'
pos = 64
with open(fname, encoding='utf-8') as f:
content = f.read()
# The position_id will be used in XPath, the position_anchor
# uses the variable only for readability. The position anchor
# has the form of an HTML element to be found easily using
# the XPath expression.
position_id = 'my_unique_position_{}'.format(pos)
position_anchor = ''.format(position_id)
# The modified content has one extra anchor as the position marker.
modified_content = content[:pos] + position_anchor + content[pos:]
root = ET.fromstring(modified_content)
ET.dump(root)
print(' ')
# Now some examples for getting the info around the point.
# '.' = from here; '//' = wherever; 'a[@id=...]' = anchor (a) element
# with the attribute id with the value.
# We will not use it later only for demonstration.
anchor_element = root.find('.//a[@id="{}"]'.format(position_id))
ET.dump(anchor_element)
print(' ')
# The text at the original position the text became the tail
# of the element.
print(repr(anchor_element.tail))
print('================')
# Now, from scratch, get the nearest parent from the position.
parent = root.find('.//a[@id="{}"]/..'.format(position_id))
ET.dump(parent)
print(' ')
# ... and the anchor element (again) as the nearest child
# with the attributes.
anchor = parent.find('./a[@id="{}"]'.format(position_id))
ET.dump(anchor)
print(' ')
# If the marker split the text, part of the text belongs to
# the parent, part is the tail of the anchor marker.
print(repr(parent.text))
print(repr(anchor.tail))
print(' ')
# Modify the parent to remove the anchor element (to get
# the original structure without the marker. Do not forget
# that the text became the part of the marker element as the tail.
parent.remove(anchor)
parent.text += anchor.tail
ET.dump(parent)
print(' ')
# The structure of the whole document now does not contain
# the added anchor marker, and you get the reference
# to the nearest parent.
ET.dump(root)
print(' ')
它打印以下内容:
^{pr2}$