lxml html all text,在lxml中获取标签内的所有文本

最新推荐文章于 2023-11-20 17:43:25 发布

知飞翀

最新推荐文章于 2023-11-20 17:43:25 发布

阅读量1.7k

点赞数

文章标签： lxml html all text

尝试：

def stringify_children(node): from lxml.etree import tostring from itertools import chain parts = ([node.text] + list(chain(*([c.text, tostring(c), c.tail] for c in node.getchildren()))) + [node.tail]) # filter removes possible Nones in texts and tails return ''.join(filter(None, parts))

例：

from lxml import etree node = etree.fromstring(""" Text outside tag

Text inside tag

""") stringify_children(node)

产生： '\nText outside tag

Text inside tag

\n'

是否text_content()做你所需要的？

只需使用node.itertext()方法，如下所示：

"".join([x for x in node.itertext()])

albertov的stringify内容的版本解决了hoju报告的错误：

def stringify_children(node): from lxml.etree import tostring from itertools import chain parts = ([node.text] + list(chain(*([tostring(c, with_tail=False), c.tail] for c in node.getchildren()))) + [node.tail]) # filter removes possible Nones in texts and tails return ''.join(filter(None, parts))

下面的代码使用python生成器完美的工作，是非常有效的。

''.join(node.itertext()).strip()

import urllib2 from lxml import etree url = 'some_url'

获取url

test = urllib2.urlopen(url) page = test.read()

获取包含表格标签的所有html代码

tree = etree.HTML(page)

xpathselect器

table = tree.xpath("xpath_here") res = etree.tostring(table)

res是表格的html代码，这是为我做的工作。

所以你可以用xpath_text()和标签(包括它们的内容)使用tostring()来提取标签内容

div = tree.xpath("//div") div_res = etree.tostring(div)

text = tree.xpath_text("//content")

或text = tree.xpath(“// content / text()”)

div_3 = tree.xpath("//content") div_3_res = etree.tostring(div_3).strip('').rstrip('')

这最后一个使用strip方法的行并不好，但它只是起作用

为了回应@ Richard的评论，如果你将stringify_children修补为：

parts = ([node.text] + -- list(chain(*([c.text, tostring(c), c.tail] for c in node.getchildren()))) + ++ list(chain(*([tostring(c)] for c in node.getchildren()))) + [node.tail])

似乎避免了他所指的重复。

用这种方式定义stringify_children可能不那么复杂：

from lxml import etree def stringify_children(node): s = node.text if s is None: s = '' for child in node: s += etree.tostring(child, encoding='unicode') return s

或在一行

return (node.text if node.text is not None else '') + ''.join((etree.tostring(child, encoding='unicode') for child in node))

基本原理与此答案相同：将子节点的序列化保留为lxml。在这种情况下node的tail是node ，因为它在结束标签“后面”。请注意， encoding参数可以根据需要进行更改。

另一种可能的解决scheme是序列化节点本身，然后剥离开始和结束标记：

def stringify_children(node): s = etree.tostring(node, encoding='unicode', with_tail=False) return s[s.index(node.tag) + 1 + len(node.tag): s.rindex(node.tag) - 2]

这有点可怕。这个代码是正确的，只有node没有属性，我不认为有人会想要使用它。

我知道这是一个古老的问题，但这是一个普遍的问题，我有一个解决scheme，似乎比迄今为止build议的更简单：

def stringify_children(node): """Given a LXML tag, return contents as a string >>> html = "

Sample sentence with tags.

" >>> node = lxml.html.fragment_fromstring(html) >>> extract_html_content(node) " Sample sentence with tags." """ if node is None or (len(node) == 0 and not getattr(node, 'text', None)): return "" node.attrib.clear() opening_tag = len(node.tag) + 2 closing_tag = -(len(node.tag) + 3) return lxml.html.tostring(node)[opening_tag:closing_tag]

与这个问题的其他答案不同，这个解决scheme保留了其中包含的所有标签，并从与其他工作解决scheme不同的angular度攻击问题。

etree.tostring(html, method="text")

etree是一个节点/标签，它的完整文本，你正在尝试阅读。不过，它并没有摆脱脚本和样式标签。

这是一个工作解决scheme。我们可以用父标签获取内容，然后从输出中剪切父标签。

import re from lxml import etree def _tostr_with_tags(parent_element, html_entities=False): RE_CUT = r'^(.*)([\w-]+)>$' content_with_parent = etree.tostring(parent_element) def _replace_html_entities(s): RE_ENTITY = r'(\d+);' def repl(m): return unichr(int(m.group(1))) replaced = re.sub(RE_ENTITY, repl, s, flags=re.MULTILINE|re.UNICODE) return replaced if not html_entities: content_with_parent = _replace_html_entities(content_with_parent) content_with_parent = content_with_parent.strip() # remove 'white' characters on margins start_tag, content_without_parent, end_tag = re.findall(RE_CUT, content_with_parent, flags=re.UNICODE|re.MULTILINE|re.DOTALL)[0] if start_tag != end_tag: raise Exception('Start tag does not match to end tag while getting content with tags.') return content_without_parent

parent_element必须具有Elementtypes。

请注意，如果你想要文本内容(而不是HTML文本)，请将html_entities参数设为False。

lxml有一个方法：

node.text_content()

如果这是一个标签，你可以尝试：

node.values()

import re from lxml import etree node = etree.fromstring(""" Text before inner tag

Text inside tag

Text after inner tag """) print re.search("\A]*>(.*)[^<>]*>\Z", etree.tostring(node), re.DOTALL).group(1)

知飞翀

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
lxml html all text,在lxml中获取标签内的所有文本

尝试：def stringify_children(node): from lxml.etree import tostring from itertools import chain parts = ([node.text] + list(chain(*([c.text, tostring(c), c.tail] for c in node.getchildren()))) + [node.ta...
复制链接

扫一扫