Python项目三：万能的XML

最新推荐文章于 2018-05-11 18:18:35 发布

ranky2009

最新推荐文章于 2018-05-11 18:18:35 发布

阅读量877

点赞数

分类专栏： python 文章标签： python 数据 html xml 库

本文链接：https://blog.csdn.net/ranky2009/article/details/46759743

版权

python 专栏收录该内容

17 篇文章 0 订阅

订阅专栏

分析python基础教程（第二版）中的项目3：万能的XML
本项目从XML文件中读取数据，创建多个html网页，使用python中自带的xml.sax库，代码逻辑比较少。

代码地址：https://code.csdn.net/ranky2009/pythonsmallproject

XML文件website.xml如下：

<website>
  <page name="index" title="Home Page">
    <h1>Welcome to My Home Page</h1>
    
    <p>Hi, there. My name is Mr.Gumby, and this is my home page. Here
    are some of my interests:</p>
    
    <ul>
      <li><a href="interests/shouting.html">Shouting</a></li>
      <li><a href="interests/sleeping.html">Sleeping</a></li>
      <li><a href="interests/eating.html">Eating</a></li>
    </ul>
  </page>
  <directory name="interests">
    <page name="shouting" title="Shouting">
      <h1>Mr. Gumby's Shouting Page</h1>
      
      <p>...</p>
    </page>
    <page name="sleeping" title="Sleeping">
      <h1>Mr. Gumby's Sleeping Page</h1>
      
      <p>...</p>
    </page>
    <page name="eating" title="Eating">
      <h1>Mr. Gumby's Eating Page</h1>
      
      <p>...</p>
    </page>
  </directory>
</website>

代码如下：

from xml.sax.handler import ContentHandler
from xml.sax import parse
import os
 
class WebsiteConstructor(ContentHandler):
    passthrough = False #设置读取状态，当一个标签开始读时，为True，标签读完，为False
    
    def dispatch(self, prefix, name, attrs=None):
        '''
        消息分发函数，根据输入的函数，来判断需要调用哪个函数
        与Switch的作用类似，相对于Switch函数来说，这样可以一劳永逸
        不许要对每一个不用条件做判断
        '''
        
        #应该调用函数的名称，capitalize函数大写字符串第一个字符
        mname = prefix + name.capitalize()
        dname = 'default' + prefix.capitalize()#默认函数的名称
        method = getattr(self, mname, None)
        myargs = []
        if not callable(method):#如果应该调用的函数不存在该类中
            method = getattr(self, dname, None)#调用默认的函数defaultStart or defaultEnd
            myargs.append(name)#调用默认函数需要name参数
        if prefix == 'start': myargs.append(attrs)#startXXX函数需要attrs参数
        if callable(method): method(*myargs)#调用函数
    
    def startElement(self, name, attrs):
        '''
        元素开始函数，覆盖父类ContentHandler中的函数
        运行parse时，读取XML元素将被被自动调用
        '''
        self.dispatch('start', name, attrs)
     
    def endElement(self, name):
        '''
        元素结束函数，覆盖父类ContentHandler中的函数
        运行parse时，结束XML元素将被被自动调用
        '''
        self.dispatch('end', name)
    
    def __init__(self, directory):#构造函数
        self.directory = [directory]
        self.ensureDirectory()
    def ensureDirectory(self):#创建目录
        print(self.directory)
        path = os.path.join(*self.directory)
        if not os.path.isdir(path): os.makedirs(path)
        
    def characters(self, chars):#读XML元素开始与结尾之间的字符
        if self.passthrough: self.out.write(chars)
        
    def defaultStart(self, name, attrs):#默认元素起始函数
        if self.passthrough:
            self.out.write('<' + name)
            for key, val in attrs.items():
                self.out.write(' %s="%s"' % (key, val))
            self.out.write('>')
            
    def defaultEnd(self, name):#默认元素结束函数
        if self.passthrough:
            self.out.write('</%s>' % name)
            
    def startDirectory(self, attrs):#元素“directory”起始函数
        self.directory.append(attrs['name'])
        print('start dir %s' % str(self.directory))
        self.ensureDirectory()
        
    def endDirectory(self):
        self.directory.pop()
    
    def startPage(self, attrs):#输页面元素，元素开始响应
        #print(self.directory + [attrs['name'] + '.html'])
        #获取文件路径
        filename = os.path.join(*self.directory + [attrs['name'] + '.html'])
        #print(filename)
        self.out = open(filename, 'w')
        self.writeHeader(attrs['title'])
        self.passthrough = True
        
    def endPage(self):#输页面元素，元素结束响应
        self.passthrough = False
        self.writeFooter()
        self.out.close()
        
    def writeHeader(self, title):
        self.out.write('<html>\n <head>\n   <title>')
        self.out.write(title)
        self.out.write('</title>\n </head>\n    <body>\n')
        
    def writeFooter(self):
        self.out.write('\n <body>\n</html>\n')
        
parse('website.xml', WebsiteConstructor('public_html'))

该程序的函数主入口为parse函数

parse函数来源于xml.sax，参数如下：

xml.sax.parse(filename_or_stream, handler[, error_handler])

file_or_stream：xml文件名

handler：必须是一个ContentHandler对象，本例子中WebsiteConstruct继承与ContentHandler对象，因此可以作为该函数的参数

error_handler：该参数是可选参数，可有可无，当有该参数时，error_handler必须是一个SAX ErrorHandler对象

ContentHandler类的主要方法如下：

characters(content)

content是除了标签之外的字符串，但是其值有4种情况。

startDocument() 文档启动的时候调用

endDocument() 文档结尾的时候调用

startElement(name, attrs) 标签开始的时候调用，name为标签名，attrs为标签的属性