python之htmlParser入门教程

最新推荐文章于 2024-04-24 13:35:29 发布

萧萧九宸

最新推荐文章于 2024-04-24 13:35:29 发布

阅读量1w

点赞数 2

分类专栏： python 文章标签： python htmlParser 教程使用

python 专栏收录该内容

4 篇文章 0 订阅

订阅专栏

来源《TextProcessing in Python》和python官方 tutorial

HTMLParser.HTMLParser()
htmlParser模块包含了类HTMLParser 这个类本身很有用.因为当产生事件时，本身并不做任何工作。对HTMLParser.HTMLParser() 的利用需要实现其子类，并且编写处理你感兴趣事件的方法

这里插入一段来自python 官网的htmlparser介绍，可以更清晰的了解htmlparser的使用方法

HTMLPaser模块定义一个类HTMLParser ，可以用作解析html和xhtml 的基础.和htmllib中的parser不同，这个parser并不是基于sgmllib实现

class HTMLParser. HTMLParser

当给htmlParser中填充html数据时（调用feed()方法），parser会在遇到开始标签（如<html><head><p>）结束标签如</html></head></p>时调用相应的处理方法.使用者应该实现htmlparser的子类，并且重写(重载)这些处理方法，来实现特定的行为（ call the end-tag handler for elements which are closed implicitly by closing an outer element.

HTMLParser不需要参数就可以实例化。

和htmllib不同。HTMLParser不会检查结束标签是否和开始标签是否匹配，也不会对隐式闭合的元素调用end-tag handler方法，这里的隐式闭合是指只闭合外层元素（ does not call the end-tag handler for elements which are closed implicitly by closing an outer element）
。如 title可以被看做一个隐式闭合的标签

<html><head><title>Test</head>

这里的隐式闭合(close implicityly)是个人理解,之前简单google没有找到满意答案

一个简单htmlparser 使用样例

from HTMLParser import HTMLParser

# create a subclass and override the handler methods
class MyHTMLParser(HTMLParser):
    def handle_starttag(self, tag, attrs):
        print "Encountered a start tag:", tag
    def handle_endtag(self, tag):
        print "Encountered an end tag :", tag
    def handle_data(self, data):
        print "Encountered some data  :", data

# instantiate the parser and fed it some HTML
parser = MyHTMLParser()
parser.feed('<html><head><title>Test</title></head>'
            '<body><h1>Parse me!</h1></body></html>')

输出结果

Encountered a start tag: html
Encountered a start tag: head
Encountered a start tag: title
Encountered some data  : Test
Encountered an end tag : title
Encountered an end tag : head
Encountered a start tag: body
Encountered a start tag: h1
Encountered some data  : Parse me!
Encountered an end tag : h1
Encountered an end tag : body
Encountered an end tag : html

下面内容继续来自《

TextProcessing in Python
》

If it is important to keep track of the structural position of the current event within the document, you will need to maintain a data structure with this information. If you are certain that the document you are processing is well-formed XHTML, a stack suffices. For example:

如果要记录当前标签在整个html文档中的结构位置，则需要维护一个记录位置信息的数据结构。如果你可以确定要处理html文档是严格遵循xhtml标准的，一个栈结构就足够了。

使用栈结构进行html标签匹配的思想，如果不理解可以参考括号匹配内容-----来源《数据结构》

#!/usr/bin/env python
import HTMLParser
html = """<html><head><title>Advice</title></head><body>
<p>The <a href="http://ietf.org">IETF admonishes:
   <i>Be strict in what you <b>send</b>.</i></a></p>
</body></html>
"""
tagstack = []
class ShowStructure(HTMLParser.HTMLParser):
    def handle_starttag(self, tag, attrs): tagstack.append(tag)
    def handle_endtag(self, tag): tagstack.pop()
    def handle_data(self, data):
        if data.strip():
            for tag in tagstack: sys.stdout.write('/'+tag)
            sys.stdout.write(' >> %s\n' % data[:40].strip())
ShowStructure().feed(html)

运行结果

% ./HTMLParser_stack.py
/html/head/title >> Advice
/html/body/p >> The
/html/body/p/a >> IETF admonishes:
/html/body/p/a/i >> Be strict in what you
/html/body/p/a/i/b >> send
/html/body/p/a/i >> .

如果要处理的数据不那么良好，就需要实现一个更复杂的栈，我们可以定义一个新的对象，这个对象可以删除和endtag相对应的最近的一个starttag,同时还可以避免没有被闭合的<p> 和<blockquote>嵌套在其中。你可以为一个应用，顺着这种方式做更多的工作，这里的TagStack是一个很好的例子，可以作为开端

class TagStack:
    def __init__(self, lst=[]): self.lst  = lst
    def __getitem__(self, pos): return self.lst[pos]
    def append(self, tag):
        #当遇到一个<p>标签时，删除tagstack中之前存在的p标签，因为p中不会嵌套标签，所以遇到<p>标签时，tagstack中的<p>标签都是没有闭合的，所以删除
        if tag.lower() in ('p','blockquote'):
            self.lst = [t for t in self.lst
                          if t not in ('p','blockquote')]
        self.lst.append(tag)
    def pop(self, tag):
        # "Pop" by tag from nearest pos, not only last item
        self.lst.reverse()
        try:
            pos = self.lst.index(tag)
        except ValueError:
            raise HTMLParser.HTMLParseError, "Tag not on stack"
        del self.lst[pos]
        self.lst.reverse()
tagstack = TagStack()

对pop方法的一点简单说明，因为刚开始学习python ，这里曾产生困惑:
pop操作首先对lst进行反转，然后self.lst.index(tag)，注意，index()方法返回的是第一个匹配查找目标的位置，所以这里可以获得与endtag相匹配的最近的一个starttag的位置

萧萧九宸

关注

2
点赞
踩
12

收藏

觉得还不错? 一键收藏
0
评论
python之htmlParser入门教程

来源《TextProcessing in Python》和python官方 tutorial HTMLParser.HTMLParser()htmlParser模块包含了类HTMLParser 这个类本身很有用.因为当产生事件时，本身并不做任何工作。对HTMLParser.HTMLParser() 的利用需要实现其子类，并且编写处理你感兴趣事件的方法这里插入一段来
复制链接

扫一扫