最近要用xml parser来分析xml文件,查了下python官网上面的文档下载了一个pdf文件.
链接ftp://ftp.logilab.org/pub/talks/python-uk-2002.pdf
里面给我比较了sax和dom,
SAX vs DOM
A common question is about choosing which API to use when dealing
with XML documents
Use DOM when:
• read-write access to the document is required
• the processing requires random access to the document
Use SAX when:
• dealing with big documents (>1MB)
• looking for a precise information in the document
• instantiating custom objects from the document
说的很清楚了,大于1MB的xml文件用sax
再有就是
Using a SAX parser
Using a SAX parser is generally much more work than using a DOM
implementation
• write Handler classes that will receive callbacks from the parser,
and use these callbacks to maintain a state of the parsing being
done.
• Possible handlers include the ContentHandler and the
ErrorHandler
• instantiate a parser and connect the various Handlers
• lauch the parsing, and finally get the results
说的什么呢?就是告诉我们怎么用sax的
那是英文的,说明,简单来说就是要写一个继承ContentHandler类的handler类来处理我们要处理的xml文件
下面是书里的一个例子
<addressbook>
<person>
<name>Eric Idle</name>
<phone type='fix'>999-999-999</phone>
<phone type='mobile'>555-555-555</phone>
<address>
<street>12, spam road</street>
<city>London</city>
<zip>H4B 1X3</zip>
</address>
</person>
<person>
<name>Terry Gilliam</name>
<phone type='mobile'>555-555-554</phone>
<phone type='fix'>999-999-998</phone>
<address>
<street>3, Brazil Lane</street>
<city>Leeds</city>
<zip>F2A 2S5</zip>
</address>
</person>
</addressbook>
from xml.sax import make_parser, SAXException
from xml.sax.handler import ContentHandler
class PhoneContentHandler(ContentHandler):#定义一个handler类
def __init__(self,name):
self.look_for = name
self.is_name, self.is_mobile = None, None#定义两个flag
self.buffer = ''
def startElement(self,name,attrs):
if name == 'phone' and attrs.get('type') == 'mobile':# 判断师傅是phone tag,并且属性名是type, 属性值是mobile(注:我在2.52里面查reference看到应为attrs是属于 Attributes 抽象类的对象,所以,他的发应该是attrs.getValue('type') == 'mobile' )
self.is_mobile = 1
elif name == 'name': self.is_name = 1
def endElement(self,name):
if self.is_name:#判断是否是tag的结尾.
self.current_name = self.buffer.strip()#得到tag里面的内容,这个是unicode的string,根据自己要的字符集可以用encode方法来转换一下
self.buffer = ''
self.is_name = None
elif self.is_mobile and self.current_name == self.look_for:
self.mobile = self.buffer
raise SAXException('Found mobile phone') # stop parsing
def characters(self,chars):
if self.is_name or self.is_mobile: self.buffer += chars
def find_mobile_phone(name):
handler = PhoneContentHandler(name)
parser = make_parser()
parser.setContentHandler(handler)
try:
parser.parse(open('addressbook.xml'))
except SAXException:
return handler.mobile
return None
if __name__ == '__main__':
import sys
name = ' '.join(sys.argv[1:])
phone = find_mobile_phone(name)
if phone:
print 'Mobile phone is',phone
else:
print 'No mobile phone found for',name
这个就是简单的sax的简单用法,反正我处理的文件都比较大,也就用这个机器还能用,否则就机毁程亡了
链接ftp://ftp.logilab.org/pub/talks/python-uk-2002.pdf
里面给我比较了sax和dom,
SAX vs DOM
A common question is about choosing which API to use when dealing
with XML documents
Use DOM when:
• read-write access to the document is required
• the processing requires random access to the document
Use SAX when:
• dealing with big documents (>1MB)
• looking for a precise information in the document
• instantiating custom objects from the document
说的很清楚了,大于1MB的xml文件用sax
再有就是
Using a SAX parser
Using a SAX parser is generally much more work than using a DOM
implementation
• write Handler classes that will receive callbacks from the parser,
and use these callbacks to maintain a state of the parsing being
done.
• Possible handlers include the ContentHandler and the
ErrorHandler
• instantiate a parser and connect the various Handlers
• lauch the parsing, and finally get the results
说的什么呢?就是告诉我们怎么用sax的
那是英文的,说明,简单来说就是要写一个继承ContentHandler类的handler类来处理我们要处理的xml文件
下面是书里的一个例子
<addressbook>
<person>
<name>Eric Idle</name>
<phone type='fix'>999-999-999</phone>
<phone type='mobile'>555-555-555</phone>
<address>
<street>12, spam road</street>
<city>London</city>
<zip>H4B 1X3</zip>
</address>
</person>
<person>
<name>Terry Gilliam</name>
<phone type='mobile'>555-555-554</phone>
<phone type='fix'>999-999-998</phone>
<address>
<street>3, Brazil Lane</street>
<city>Leeds</city>
<zip>F2A 2S5</zip>
</address>
</person>
</addressbook>
from xml.sax import make_parser, SAXException
from xml.sax.handler import ContentHandler
class PhoneContentHandler(ContentHandler):#定义一个handler类
def __init__(self,name):
self.look_for = name
self.is_name, self.is_mobile = None, None#定义两个flag
self.buffer = ''
def startElement(self,name,attrs):
if name == 'phone' and attrs.get('type') == 'mobile':# 判断师傅是phone tag,并且属性名是type, 属性值是mobile(注:我在2.52里面查reference看到应为attrs是属于 Attributes 抽象类的对象,所以,他的发应该是attrs.getValue('type') == 'mobile' )
self.is_mobile = 1
elif name == 'name': self.is_name = 1
def endElement(self,name):
if self.is_name:#判断是否是tag的结尾.
self.current_name = self.buffer.strip()#得到tag里面的内容,这个是unicode的string,根据自己要的字符集可以用encode方法来转换一下
self.buffer = ''
self.is_name = None
elif self.is_mobile and self.current_name == self.look_for:
self.mobile = self.buffer
raise SAXException('Found mobile phone') # stop parsing
def characters(self,chars):
if self.is_name or self.is_mobile: self.buffer += chars
def find_mobile_phone(name):
handler = PhoneContentHandler(name)
parser = make_parser()
parser.setContentHandler(handler)
try:
parser.parse(open('addressbook.xml'))
except SAXException:
return handler.mobile
return None
if __name__ == '__main__':
import sys
name = ' '.join(sys.argv[1:])
phone = find_mobile_phone(name)
if phone:
print 'Mobile phone is',phone
else:
print 'No mobile phone found for',name
这个就是简单的sax的简单用法,反正我处理的文件都比较大,也就用这个机器还能用,否则就机毁程亡了