python lxml中etree的简单应用2

通过python lxml中etree的简单应用1的介绍可以知道,有了etree.HTML()和etree.tostrint(),我们就可以很方便的在字符串对象和_Element对象进行转换了,接着介绍一下如何用etree对html源码进行过滤。

假设现有如下html代码:“<h1>12345</h1><script>js</script><h2>67890</h2>”

在上面的html代码中,如果想要去除所有的<script>标签该怎么办,通常情况下可以使用正则表达式进行过滤,这里介绍一下用etree对html进行过滤的思路。

1.将html源码转换成_Element对象

data = etree.HTML(html)

2.通过xpath表达式“//script”找到script标签

data.xpath('//script')

3.删除所有的script标签

这里要注意的是,我们没办法直接删除script标签,但是我们知道data.xpath('//script')返回的列表里的每个元素都是_Element对象:

对于_Element对象,我们可以通过getparent()获取它的父对象,然后通过父对象再删除它。

 

将上面的思路封装成函数就是:

def clean(html, filter):
	data = etree.HTML(html)
	trashs = data.xpath(filter)
	for item in trashs:
		item.getparent().remove(item)
	return etree.tostring(data, method='html')
在上一篇中,tostring函数的method方法,我们传递是“text”,这里传递的是“html”,表示将_Element对象转换成html文档式的字符串对象。

完整的代码如下:

# encoding=utf8
 
from lxml import etree
 
def clean(html, filter):
	data = etree.HTML(html)
	trashs = data.xpath(filter)
	for item in trashs:
		item.getparent().remove(item)
	return etree.tostring(data, method='html')
 
html = '''<h1>12345</h1><script>js</script><h2>67890</h2>'''
_html = clean(html, '//script')
print _html

运行结果:

<html><body><h1>12345</h1><h2>67890</h2></body></html>

通过结果可以知道,我们成功将html中的script标签去掉了。

除了上面的方法,还有一种比较方便的方法去除script标签,通过lxml.html.clean的Cleaner也可以很方便的去除script标签,直接上代码:

# encoding=utf8
 
from lxml.html.clean import Cleaner
 
cleaner = Cleaner()
html = '''<h1>12345</h1><script>js</script><h2>67890</h2>'''
print cleaner.clean_html(html)

运行结果:

<div><h1>12345</h1><h2>67890</h2></div>

Cleaner的详细介绍如下:

class Cleaner(__builtin__.object)
 |  Instances cleans the document of each of the possible offending
 |  elements.  The cleaning is controlled by attributes; you can
 |  override attributes in a subclass, or set them in the constructor.
 |  
 |  ``scripts``:
 |      Removes any ``<script>`` tags.
 |  
 |  ``javascript``:
 |      Removes any Javascript, like an ``onclick`` attribute. Also removes stylesheets
 |      as they could contain Javascript.
 |  
 |  ``comments``:
 |      Removes any comments.
 |  
 |  ``style``:
 |      Removes any style tags.
 |  
 |  ``inline_style``
 |      Removes any style attributes.  Defaults to the value of the ``style`` option.
 |  
 |  ``links``:
 |      Removes any ``<link>`` tags
 |  
 |  ``meta``:
 |      Removes any ``<meta>`` tags
 |  
 |  ``page_structure``:
 |      Structural parts of a page: ``<head>``, ``<html>``, ``<title>``.
 |  
 |  ``processing_instructions``:
 |      Removes any processing instructions.
 |  
 |  ``embedded``:
 |      Removes any embedded objects (flash, iframes)
 |  
 |  ``frames``:
 |      Removes any frame-related tags
 |  
 |  ``forms``:
 |      Removes any form tags
 |  
 |  ``annoying_tags``:
 |      Tags that aren't *wrong*, but are annoying.  ``<blink>`` and ``<marquee>``
 |  
 |  ``remove_tags``:
 |      A list of tags to remove.  Only the tags will be removed,
 |      their content will get pulled up into the parent tag.
 |  
 |  ``kill_tags``:
 |      A list of tags to kill.  Killing also removes the tag's content,
 |      i.e. the whole subtree, not just the tag itself.
 |  
 |  ``allow_tags``:
 |      A list of tags to include (default include all).
 |  
 |  ``remove_unknown_tags``:
 |      Remove any tags that aren't standard parts of HTML.
 |  
 |  ``safe_attrs_only``:
 |      If true, only include 'safe' attributes (specifically the list
 |      from the feedparser HTML sanitisation web site).
 |  
 |  ``safe_attrs``:
 |      A set of attribute names to override the default list of attributes
 |      considered 'safe' (when safe_attrs_only=True).
 |  
 |  ``add_nofollow``:
 |      If true, then any <a> tags will have ``rel="nofollow"`` added to them.
 |  
 |  ``host_whitelist``:
 |      A list or set of hosts that you can use for embedded content
 |      (for content like ``<object>``, ``<link rel="stylesheet">``, etc).
 |      You can also implement/override the method
 |      ``allow_embedded_url(el, url)`` or ``allow_element(el)`` to
 |      implement more complex rules for what can be embedded.
 |      Anything that passes this test will be shown, regardless of
 |      the value of (for instance) ``embedded``.
 |  
 |      Note that this parameter might not work as intended if you do not
 |      make the links absolute before doing the cleaning.
 |  
 |      Note that you may also need to set ``whitelist_tags``.
 |  
 |  ``whitelist_tags``:
 |      A set of tags that can be included with ``host_whitelist``.
 |      The default is ``iframe`` and ``embed``; you may wish to
 |      include other tags like ``script``, or you may want to
 |      implement ``allow_embedded_url`` for more control.  Set to None to
 |      include all tags.
 |  
 |  This modifies the document *in place*.
 |  
 |  Methods defined here:
 |  
 |  __call__(...)
 |      Cleans the document.
 |  
 |  __init__(...)
 |  
 |  allow_element(...)
 |  
 |  allow_embedded_url(...)
 |  
 |  allow_follow(...)
 |      Override to suppress rel="nofollow" on some anchors.
 |  
 |  clean_html(...)
 |  
 |  kill_conditional_comments(...)
 |      IE conditional comments basically embed HTML that the parser
 |      doesn't normally see.  We can't allow anything like that, so
 |      we'll kill any comments that could be conditional.
 |  
 |  ----------------------------------------------------------------------
 |  Data descriptors defined here:
 |  
 |  __dict__
 |      dictionary for instance variables (if defined)
 |  
 |  __weakref__
 |      list of weak references to the object (if defined)
 |  
 |  ----------------------------------------------------------------------
 |  Data and other attributes defined here:
 |  
 |  __qualname__ = 'Cleaner'
 |  
 |  add_nofollow = False
 |  
 |  allow_tags = None
 |  
 |  annoying_tags = True
 |  
 |  comments = True
 |  
 |  embedded = True
 |  
 |  forms = True
 |  
 |  frames = True
 |  
 |  host_whitelist = ()
 |  
 |  inline_style = None
 |  
 |  javascript = True
 |  
 |  kill_tags = None
 |  
 |  links = True
 |  
 |  meta = True
 |  
 |  page_structure = True
 |  
 |  processing_instructions = True
 |  
 |  remove_tags = None
 |  
 |  remove_unknown_tags = True
 |  
 |  safe_attrs = frozenset(['abbr', 'accept', 'accept-charset', 'accesskey...
 |  
 |  safe_attrs_only = True
 |  
 |  scripts = True
 |  
 |  style = False
 |  
 |  whitelist_tags = set(['embed', 'iframe'])

请原谅我懒,实在是有点晚了。。。。。。

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值