python lxml中etree的简单应用2

最新推荐文章于 2024-07-21 21:39:32 发布

锅炉房刘大爷

最新推荐文章于 2024-07-21 21:39:32 发布

阅读量1.4k

点赞数 1

分类专栏： python2.7 文章标签： lxml.etree

本文链接：https://blog.csdn.net/u012067766/article/details/79904797

版权

python2.7 专栏收录该内容

37 篇文章 5 订阅

订阅专栏

通过python lxml中etree的简单应用1的介绍可以知道，有了etree.HTML()和etree.tostrint()，我们就可以很方便的在字符串对象和_Element对象进行转换了，接着介绍一下如何用etree对html源码进行过滤。

假设现有如下html代码：“<h1>12345</h1><script>js</script><h2>67890</h2>”

在上面的html代码中，如果想要去除所有的<script>标签该怎么办，通常情况下可以使用正则表达式进行过滤，这里介绍一下用etree对html进行过滤的思路。

1.将html源码转换成_Element对象

data = etree.HTML(html)

2.通过xpath表达式“//script”找到script标签

data.xpath('//script')

3.删除所有的script标签

这里要注意的是，我们没办法直接删除script标签，但是我们知道data.xpath('//script')返回的列表里的每个元素都是_Element对象：

对于_Element对象，我们可以通过getparent()获取它的父对象，然后通过父对象再删除它。

将上面的思路封装成函数就是：

def clean(html, filter):
	data = etree.HTML(html)
	trashs = data.xpath(filter)
	for item in trashs:
		item.getparent().remove(item)
	return etree.tostring(data, method='html')

在上一篇中，tostring函数的method方法，我们传递是“text”，这里传递的是“html”，表示将_Element对象转换成html文档式的字符串对象。

完整的代码如下：

# encoding=utf8

from lxml import etree

def clean(html, filter):
	data = etree.HTML(html)
	trashs = data.xpath(filter)
	for item in trashs:
		item.getparent().remove(item)
	return etree.tostring(data, method='html')

html = '''<h1>12345</h1><script>js</script><h2>67890</h2>'''
_html = clean(html, '//script')
print _html

运行结果：

<html><body><h1>12345</h1><h2>67890</h2></body></html>

通过结果可以知道，我们成功将html中的script标签去掉了。

除了上面的方法，还有一种比较方便的方法去除script标签，通过lxml.html.clean的Cleaner也可以很方便的去除script标签，直接上代码：

# encoding=utf8

from lxml.html.clean import Cleaner

cleaner = Cleaner()
html = '''<h1>12345</h1><script>js</script><h2>67890</h2>'''
print cleaner.clean_html(html)

运行结果：

<div><h1>12345</h1><h2>67890</h2></div>

Cleaner的详细介绍如下：

class Cleaner(__builtin__.object)
 |  Instances cleans the document of each of the possible offending
 |  elements.  The cleaning is controlled by attributes; you can
 |  override attributes in a subclass, or set them in the constructor.
 |  
 |  ``scripts``:
 |      Removes any ``<script>`` tags.
 |  
 |  ``javascript``:
 |      Removes any Javascript, like an ``onclick`` attribute. Also removes stylesheets
 |      as they could contain Javascript.
 |  
 |  ``comments``:
 |      Removes any comments.
 |  
 |  ``style``:
 |      Removes any style tags.
 |  
 |  ``inline_style``
 |      Removes any style attributes.  Defaults to the value of the ``style`` option.
 |  
 |  ``links``:
 |      Removes any ``<link>`` tags
 |  
 |  ``meta``:
 |      Removes any ``<meta>`` tags
 |  
 |  ``page_structure``:
 |      Structural parts of a page: ``<head>``, ``<html>``, ``<title>``.
 |  
 |  ``processing_instructions``:
 |      Removes any processing instructions.
 |  
 |  ``embedded``:
 |      Removes any embedded objects (flash, iframes)
 |  
 |  ``frames``:
 |      Removes any frame-related tags
 |  
 |  ``forms``:
 |      Removes any form tags
 |  
 |  ``annoying_tags``:
 |      Tags that aren't *wrong*, but are annoying.  ``<blink>`` and ``<marquee>``
 |  
 |  ``remove_tags``:
 |      A list of tags to remove.  Only the tags will be removed,
 |      their content will get pulled up into the parent tag.
 |  
 |  ``kill_tags``:
 |      A list of tags to kill.  Killing also removes the tag's content,
 |      i.e. the whole subtree, not just the tag itself.
 |  
 |  ``allow_tags``:
 |      A list of tags to include (default include all).
 |  
 |  ``remove_unknown_tags``:
 |      Remove any tags that aren't standard parts of HTML.
 |  
 |  ``safe_attrs_only``:
 |      If true, only include 'safe' attributes (specifically the list
 |      from the feedparser HTML sanitisation web site).
 |  
 |  ``safe_attrs``:
 |      A set of attribute names to override the default list of attributes
 |      considered 'safe' (when safe_attrs_only=True).
 |  
 |  ``add_nofollow``:
 |      If true, then any <a> tags will have ``rel="nofollow"`` added to them.
 |  
 |  ``host_whitelist``:
 |      A list or set of hosts that you can use for embedded content
 |      (for content like ``<object>``, ``<link rel="stylesheet">``, etc).
 |      You can also implement/override the method
 |      ``allow_embedded_url(el, url)`` or ``allow_element(el)`` to
 |      implement more complex rules for what can be embedded.
 |      Anything that passes this test will be shown, regardless of
 |      the value of (for instance) ``embedded``.
 |  
 |      Note that this parameter might not work as intended if you do not
 |      make the links absolute before doing the cleaning.
 |  
 |      Note that you may also need to set ``whitelist_tags``.
 |  
 |  ``whitelist_tags``:
 |      A set of tags that can be included with ``host_whitelist``.
 |      The default is ``iframe`` and ``embed``; you may wish to
 |      include other tags like ``script``, or you may want to
 |      implement ``allow_embedded_url`` for more control.  Set to None to
 |      include all tags.
 |  
 |  This modifies the document *in place*.
 |  
 |  Methods defined here:
 |  
 |  __call__(...)
 |      Cleans the document.
 |  
 |  __init__(...)
 |  
 |  allow_element(...)
 |  
 |  allow_embedded_url(...)
 |  
 |  allow_follow(...)
 |      Override to suppress rel="nofollow" on some anchors.
 |  
 |  clean_html(...)
 |  
 |  kill_conditional_comments(...)
 |      IE conditional comments basically embed HTML that the parser
 |      doesn't normally see.  We can't allow anything like that, so
 |      we'll kill any comments that could be conditional.
 |  
 |  ----------------------------------------------------------------------
 |  Data descriptors defined here:
 |  
 |  __dict__
 |      dictionary for instance variables (if defined)
 |  
 |  __weakref__
 |      list of weak references to the object (if defined)
 |  
 |  ----------------------------------------------------------------------
 |  Data and other attributes defined here:
 |  
 |  __qualname__ = 'Cleaner'
 |  
 |  add_nofollow = False
 |  
 |  allow_tags = None
 |  
 |  annoying_tags = True
 |  
 |  comments = True
 |  
 |  embedded = True
 |  
 |  forms = True
 |  
 |  frames = True
 |  
 |  host_whitelist = ()
 |  
 |  inline_style = None
 |  
 |  javascript = True
 |  
 |  kill_tags = None
 |  
 |  links = True
 |  
 |  meta = True
 |  
 |  page_structure = True
 |  
 |  processing_instructions = True
 |  
 |  remove_tags = None
 |  
 |  remove_unknown_tags = True
 |  
 |  safe_attrs = frozenset(['abbr', 'accept', 'accept-charset', 'accesskey...
 |  
 |  safe_attrs_only = True
 |  
 |  scripts = True
 |  
 |  style = False
 |  
 |  whitelist_tags = set(['embed', 'iframe'])

请原谅我懒，实在是有点晚了。。。。。。