python xml sax 例子

最新推荐文章于 2023-03-10 09:43:03 发布

zzllabcd

最新推荐文章于 2023-03-10 09:43:03 发布

阅读量6.7k

点赞数 1

文章标签： xml python mobile parsing attributes import

本文链接：https://blog.csdn.net/zzllabcd/article/details/3070120

版权

本文介绍了使用Python的SAX解析器处理大型XML文件的方法。SAX适用于处理超过1MB的大文件，通过创建继承自ContentHandler的自定义处理器类，实现在解析过程中找到特定信息。文章提供了一个查找特定人物手机的示例代码，演示了如何启动解析、处理元素开始和结束、以及字符数据。

摘要由CSDN通过智能技术生成

最近要用xml parser来分析xml文件,查了下python官网上面的文档下载了一个pdf文件.
链接ftp://ftp.logilab.org/pub/talks/python-uk-2002.pdf
里面给我比较了sax和dom,
SAX vs DOM
A common question is about choosing which API to use when dealing
with XML documents
Use DOM when:
• read-write access to the document is required
• the processing requires random access to the document
Use SAX when:
• dealing with big documents (>1MB)
• looking for a precise information in the document
• instantiating custom objects from the document

说的很清楚了,大于1MB的xml文件用sax

再有就是
                Using a SAX parser
Using a SAX parser is generally much more work than using a DOM
implementation
• write Handler classes that will receive callbacks from the parser,
and use these callbacks to maintain a state of the parsing being
done.
• Possible handlers include the ContentHandler and the
ErrorHandler
• instantiate a parser and connect the various Handlers
• lauch the parsing, and finally get the results

说的什么呢?就是告诉我们怎么用sax的
那是英文的,说明,简单来说就是要写一个继承ContentHandler类的handler类来处理我们要处理的xml文件
下面是书里的一个例子
<addressbook>
<person>
    <name>Eric Idle</name>
    <phone type='fix'>999-999-999</phone>
    <phone type='mobile'>555-555-555</phone>
    <address>
       <street>12, spam road</street>
       <city>London</city>
       <zip>H4B 1X3</zip>
    </address>
</person>
<person>
    <name>Terry Gilliam</name>
    <phone type='mobile'>555-555-554</phone>
    <phone type='fix'>999-999-998</phone>
    <address>
       <street>3, Brazil Lane</street>
       <city>Leeds</city>
       <zip>F2A 2S5</zip>
    </address>
</person>
</addressbook>

from xml.sax import make_parser, SAXException
from xml.sax.handler import ContentHandler
class PhoneContentHandler(ContentHandler):#定义一个handler类
    def __init__(self,name):
        self.look_for = name
        self.is_name, self.is_mobile = None, None#定义两个flag
        self.buffer = ''
    def startElement(self,name,attrs):
        if name == 'phone' and attrs.get('type') == 'mobile':# 判断师傅是phone tag,并且属性名是type, 属性值是mobile(注:我在2.52里面查reference看到应为attrs是属于 Attributes 抽象类的对象,所以,他的发应该是attrs.getValue('type') == 'mobile' )
            self.is_mobile = 1
        elif name == 'name': self.is_name = 1
    def endElement(self,name):
        if self.is_name:#判断是否是tag的结尾.
            self.current_name = self.buffer.strip()#得到tag里面的内容,这个是unicode的string,根据自己要的字符集可以用encode方法来转换一下
            self.buffer = ''
            self.is_name = None
        elif self.is_mobile and self.current_name == self.look_for:
            self.mobile = self.buffer
            raise SAXException('Found mobile phone') # stop parsing
    def characters(self,chars):
        if self.is_name or self.is_mobile: self.buffer += chars

def find_mobile_phone(name):
    handler = PhoneContentHandler(name)
    parser = make_parser()
    parser.setContentHandler(handler)
    try:
        parser.parse(open('addressbook.xml'))
    except SAXException:
        return handler.mobile
    return None
if __name__ == '__main__':
    import sys
    name = ' '.join(sys.argv[1:])
    phone = find_mobile_phone(name)
    if phone:
        print 'Mobile phone is',phone
    else:
        print 'No mobile phone found for',name

这个就是简单的sax的简单用法,反正我处理的文件都比较大,也就用这个机器还能用,否则就机毁程亡了