通过python lxml中etree的简单应用1的介绍可以知道,有了etree.HTML()和etree.tostrint(),我们就可以很方便的在字符串对象和_Element对象进行转换了,接着介绍一下如何用etree对html源码进行过滤。
假设现有如下html代码:“<h1>12345</h1><script>js</script><h2>67890</h2>”
在上面的html代码中,如果想要去除所有的<script>标签该怎么办,通常情况下可以使用正则表达式进行过滤,这里介绍一下用etree对html进行过滤的思路。
1.将html源码转换成_Element对象
data = etree.HTML(html)
2.通过xpath表达式“//script”找到script标签
data.xpath('//script')
3.删除所有的script标签
这里要注意的是,我们没办法直接删除script标签,但是我们知道data.xpath('//script')返回的列表里的每个元素都是_Element对象:
对于_Element对象,我们可以通过getparent()获取它的父对象,然后通过父对象再删除它。
将上面的思路封装成函数就是:
def clean(html, filter):
data = etree.HTML(html)
trashs = data.xpath(filter)
for item in trashs:
item.getparent().remove(item)
return etree.tostring(data, method='html')
在上一篇中,tostring函数的method方法,我们传递是“text”,这里传递的是“html”,表示将_Element对象转换成html文档式的字符串对象。
完整的代码如下:
# encoding=utf8
from lxml import etree
def clean(html, filter):
data = etree.HTML(html)
trashs = data.xpath(filter)
for item in trashs:
item.getparent().remove(item)
return etree.tostring(data, method='html')
html = '''<h1>12345</h1><script>js</script><h2>67890</h2>'''
_html = clean(html, '//script')
print _html
运行结果:
<html><body><h1>12345</h1><h2>67890</h2></body></html>
通过结果可以知道,我们成功将html中的script标签去掉了。
除了上面的方法,还有一种比较方便的方法去除script标签,通过lxml.html.clean的Cleaner也可以很方便的去除script标签,直接上代码:
# encoding=utf8
from lxml.html.clean import Cleaner
cleaner = Cleaner()
html = '''<h1>12345</h1><script>js</script><h2>67890</h2>'''
print cleaner.clean_html(html)
运行结果:
<div><h1>12345</h1><h2>67890</h2></div>
Cleaner的详细介绍如下:
class Cleaner(__builtin__.object)
| Instances cleans the document of each of the possible offending
| elements. The cleaning is controlled by attributes; you can
| override attributes in a subclass, or set them in the constructor.
|
| ``scripts``:
| Removes any ``<script>`` tags.
|
| ``javascript``:
| Removes any Javascript, like an ``onclick`` attribute. Also removes stylesheets
| as they could contain Javascript.
|
| ``comments``:
| Removes any comments.
|
| ``style``:
| Removes any style tags.
|
| ``inline_style``
| Removes any style attributes. Defaults to the value of the ``style`` option.
|
| ``links``:
| Removes any ``<link>`` tags
|
| ``meta``:
| Removes any ``<meta>`` tags
|
| ``page_structure``:
| Structural parts of a page: ``<head>``, ``<html>``, ``<title>``.
|
| ``processing_instructions``:
| Removes any processing instructions.
|
| ``embedded``:
| Removes any embedded objects (flash, iframes)
|
| ``frames``:
| Removes any frame-related tags
|
| ``forms``:
| Removes any form tags
|
| ``annoying_tags``:
| Tags that aren't *wrong*, but are annoying. ``<blink>`` and ``<marquee>``
|
| ``remove_tags``:
| A list of tags to remove. Only the tags will be removed,
| their content will get pulled up into the parent tag.
|
| ``kill_tags``:
| A list of tags to kill. Killing also removes the tag's content,
| i.e. the whole subtree, not just the tag itself.
|
| ``allow_tags``:
| A list of tags to include (default include all).
|
| ``remove_unknown_tags``:
| Remove any tags that aren't standard parts of HTML.
|
| ``safe_attrs_only``:
| If true, only include 'safe' attributes (specifically the list
| from the feedparser HTML sanitisation web site).
|
| ``safe_attrs``:
| A set of attribute names to override the default list of attributes
| considered 'safe' (when safe_attrs_only=True).
|
| ``add_nofollow``:
| If true, then any <a> tags will have ``rel="nofollow"`` added to them.
|
| ``host_whitelist``:
| A list or set of hosts that you can use for embedded content
| (for content like ``<object>``, ``<link rel="stylesheet">``, etc).
| You can also implement/override the method
| ``allow_embedded_url(el, url)`` or ``allow_element(el)`` to
| implement more complex rules for what can be embedded.
| Anything that passes this test will be shown, regardless of
| the value of (for instance) ``embedded``.
|
| Note that this parameter might not work as intended if you do not
| make the links absolute before doing the cleaning.
|
| Note that you may also need to set ``whitelist_tags``.
|
| ``whitelist_tags``:
| A set of tags that can be included with ``host_whitelist``.
| The default is ``iframe`` and ``embed``; you may wish to
| include other tags like ``script``, or you may want to
| implement ``allow_embedded_url`` for more control. Set to None to
| include all tags.
|
| This modifies the document *in place*.
|
| Methods defined here:
|
| __call__(...)
| Cleans the document.
|
| __init__(...)
|
| allow_element(...)
|
| allow_embedded_url(...)
|
| allow_follow(...)
| Override to suppress rel="nofollow" on some anchors.
|
| clean_html(...)
|
| kill_conditional_comments(...)
| IE conditional comments basically embed HTML that the parser
| doesn't normally see. We can't allow anything like that, so
| we'll kill any comments that could be conditional.
|
| ----------------------------------------------------------------------
| Data descriptors defined here:
|
| __dict__
| dictionary for instance variables (if defined)
|
| __weakref__
| list of weak references to the object (if defined)
|
| ----------------------------------------------------------------------
| Data and other attributes defined here:
|
| __qualname__ = 'Cleaner'
|
| add_nofollow = False
|
| allow_tags = None
|
| annoying_tags = True
|
| comments = True
|
| embedded = True
|
| forms = True
|
| frames = True
|
| host_whitelist = ()
|
| inline_style = None
|
| javascript = True
|
| kill_tags = None
|
| links = True
|
| meta = True
|
| page_structure = True
|
| processing_instructions = True
|
| remove_tags = None
|
| remove_unknown_tags = True
|
| safe_attrs = frozenset(['abbr', 'accept', 'accept-charset', 'accesskey...
|
| safe_attrs_only = True
|
| scripts = True
|
| style = False
|
| whitelist_tags = set(['embed', 'iframe'])
请原谅我懒,实在是有点晚了。。。。。。