我目前正在尝试提取html元素,这些元素本身有一个文本,并用一个特殊的标记将它们包装起来。在
例如,我的HTML如下所示:
This text still has children
Simple Text
Hello
World
我试图将标记仅包装在标记周围,以便以后可以进一步解析它们,因此我尝试使其看起来像这样:
^{pr2}$
我目前还不能编辑我的脚本的位置,但是我还不能确定它的位置:def parseSection(node):
b = str(node)
changes = set()
tag_start, tag_end = extractTags(b)
# index 0 is the element itself
for cell in node.findChildren()[1:]:
if cell.findChildren():
cell = parseSection(cell)
else:
# safe to extract with regular expressions, only 1 standardized tag created by BeautifulSoup
subtag_start, subtag_end = extractTags(str(cell))
changes.add((str(cell), "[/EditableText]{0}[EditableText]{1}[/EditableText]{2}[EditableText]".format(subtag_start, str(cell.text), subtag_end)))
text = extractText(b)
for change in changes:
text = text.replace(change[0], change[1])
return bs("{0}[EditableText]{1}[/EditableText]{2}".format(tag_start, text, tag_end), "html.parser")
脚本生成以下内容:
[EditableText]
This text still has children
[/EditableText]
[EditableText]
Simple Text
[/EditableText]
[EditableText]
Hello [/EditableText]
[EditableText][/EditableText]
[EditableText]
World
[/EditableText]
我怎样才能检查并修复它?我很感激每一个可能的答案。在