python解析xml有很多种方法,比较流行的由SAX,DOM和ElementTree,简要介绍一下这几种方法的异同:
方法 | 特点 |
SAX | SAX解析通过流模式在解析XML的过程中触发对应的事件(start_element、char_data、end_element)并调用用户定义的回调函数来处理XML文件。 |
DOM | DOM 将XML数据在内存中解析成一个树,通过对树的操作来操作XML,占用内存大,解析速度较慢,优点是可以任意遍历树的节点。 |
ElementTree | ElementTree 类似一个轻量级的DOM |
接下来主要介绍ElementTree,该模块实现了一个简单而有效的API来解析和创建xml数据,是的,他不仅可以解析xml而且可以修改xml哦!!!
- xml是什么?
xml是一种固有的分层数据格式,最自然的代表他的方式就是使用一棵树。xml.etree.ElementTree有两个类用于将xml文档表示为树,并且Element代表树的单个节点。与整个xml文档的交互是在ElementTree级别上的完成的,与单个xml元素及其子元素的交互是在Element级别上完成的。 - 解析xml
巧妇难为无米之炊,首先我们准备一个xml文档作为示例数(这是我爬虫的一个xml文件,我直接用这个文档为大家演示里边的信息,大家可以自动忽略):
<?xml version="1.0" encoding="UTF-8"?> <rules> <rule type="base" name="iaaf" allow_domains="iaaf.org" > <urls> <host_url>https://www.iaaf.org</host_url> <start_urls>https://www.iaaf.org/news</start_urls> <urllist pagestep="12" pages="1000" regex=" \'Articleurl\':\'(.*?)\'"> <next_page>https://www.iaaf.org/data/news/typegroup/?take=12&skip={}</next_page> </urllist> </urls> <xpath> <title>//div[@id="news"]/div/div/h1[@itemprop="name"]/text()</title> <time>//div[@id="news"]/div/div/span/span[@itemprop="datePublished"]/text()</time> <type>//div[@id="news"]/div/div/span/span[@class="_label type"]/a/text()</type> <publish>//div[@id="news"]/div[1]/div[3]/span/span[@class="_label location"]/text()</publish> <html regexexclude="<ul class='col-md-12 prev-next'.*?</ul>">//div[@id="news"</html> <text>//div[@id="news"]/div/div/article[@itemprop="articleBody"]/p/text()</text> <imglink>//div[@id="news"]/div/div/ul/li/picture/img/@src | //div[@id="news"]/div/div/ul/li/picture/source/@srcset</imglink> <filelink></filelink> </xpath> </rule> <rule type="base" name="mp" allow_domains="kuaizhan.com" > <urls> <host_url>https://482809.kuaizhan.com</host_url> <start_urls>https://482809.kuaizhan.com/</start_urls> <urllist pagestep="1" pages="1000" regex="href=\'(.*?)\'"> <next_page>https://www.kuaizhan.com/post/ajax-postlist?site_id=4216466368&param=a891b9bfac46d41ebace9eccf88f5bbb&cur_page={}</next_page> </urllist> </urls> <xpath> <title>//div[@id="page-content"]/div/div[@class="mod-title t0 "]/h2/text()</title> <time>//div[@id="page-content"]/div/div/span[@class="time"]/text()</time> <type>//div[@id="news"]/div/div/span/span[@class="_label type"]/a/text()</type> <publish>/html/body/div/div/div[@class="cell site-title"]/div/a/p/text()</publish> <html>//div[@id="page-content"]/div[@class="mod mod-layout_floor article-hd"] | //div[@id="page-content"]/div/div/div[@class="mod mod-html"]</html> <text>//div[@id="page-content"]/div/div/div/div[@class="mp-content"]/p/span/text()</text> <imglink>//div[@id="page-content"]/div/div/div/div[@class="mp-content"]/p/img/@src</imglink> <filelink></filelink> </xpath> </rule> <rule type="crawl" name="athletics" allow_domains="athletics.org.cn" > <urls> <host_url/> <start_urls>http://www.athletics.org.cn</start_urls > <urllist list_url=".*/list.html" allow_url=".*/[0-9]{4}-[0-9]{2}-[0-9]{2}/[0-9]*?\.html"> <next_page>//div[@class="nav styfff fl clear"]/ul/li/a | //div[@class="wjxz styff"]/ul/li/a</next_page> </urllist > </urls> <xpath> <title>//div[@class="main"]/div[@class="atitle"]/text() | //div[@class="main"]/div[@class="atitle"]/font/text()</title> <time>//div[@class="main"]/div[@class="a01 sty999"]/span/text()</time> <type>//div[@class="wei"]/a[2]/text()</type> <publish>//div[@class="main"]/div[@class="a01 sty999"]/a/text()</publish> <html>//div[@class="main"]</html> <text>//div[@class="main"]/div[@class="atext"]/p/text()</text> <imglink>//div[@class="main"]/div[@class="atext"]/p/img/@src</imglink> <filelink>//div[@class="main"]/div[@class="atext"]/p/a/@href</filelink> </xpath> </rule> </rules> <!-- < < 小于号 > > 大于号 & & 和 ' ' 单引号 " " 双引号 -->
有了xml文件,首先我们需要将xml文件导入数据:
1 import xml.etree.ElementTree as ET 2 import logging 3 def parsexml(xmlpath): 4 try: 5 tree=ET.parse(xmlpath) 6 except Exception as e: 7 logging.error('cannot parse file %s,error code:%s',xmlpath,e)
import xml.etree.ElementTree as ET import logging def parsexml(str) try: root=ET.fromstring(str) except Exception as e: loggging.error('cannot parse str ,error code:%s',e)
那么接下来我们就用ET来把上边展示的xml的所有数据都解析出来并输出,代码如下:
import xml.etree.ElementTree as ET import os import logging def parsexml(xmlpath): try: tree=ET.parse(xmlpath) root=tree.getroot() for child in root: print(child.tag,child.attrib) print('-----------------------------------------------') for rule in root.findall('rule'): type=rule.get('type') name=rule.get('name') allow_domains=rule.get('allow_domains') print(type,name,allow_domains) urls=rule.find('urls') host_url=urls.find('host_url').text start_urls=urls.find('start_urls').text print(host_url,start_urls) urllist=urls.find('urllist') pagestep=urllist.get('pagestep') pages=urllist.get('pages') regex=urllist.get('regex') next_page=urllist.find('next_page').text print(pagestep,pages,regex,next_page) xpath = rule.find('xpath') title=xpath.find('title').text time=xpath.find('time').text type=xpath.find('type').text print(title,time,type) except Exception as e: logging.error('Error:cannot parse file:%s',e) if __name__=='__main__': xmlpath=os.getcwd()+'\\rules.xml' parsexml(xmlpath)
rule {'type': 'base', 'name': 'iaaf', 'allow_domains': 'iaaf.org'} rule {'type': 'base', 'name': 'mp', 'allow_domains': 'kuaizhan.com'} rule {'type': 'crawl', 'name': 'athletics', 'allow_domains': 'athletics.org.cn'} ----------------------------------------------- base iaaf iaaf.org https://www.iaaf.org https://www.iaaf.org/news 12 1000 \'Articleurl\':\'(.*?)\' https://www.iaaf.org/data/news/typegroup/?take=12&skip={} //div[@id="news"]/div/div/h1[@itemprop="name"]/text() //div[@id="news"]/div/div/span/span[@itemprop="datePublished"]/text() //div[@id="news"]/div/div/span/span[@class="_label type"]/a/text() base mp kuaizhan.com https://482809.kuaizhan.com https://482809.kuaizhan.com/ 1 1000 href=\'(.*?)\' https://www.kuaizhan.com/post/ajax-postlist?site_id=4216466368¶m=a891b9bfac46d41ebace9eccf88f5bbb&cur_page={} //div[@id="page-content"]/div/div[@class="mod-title t0 "]/h2/text() //div[@id="page-content"]/div/div/span[@class="time"]/text() //div[@id="news"]/div/div/span/span[@class="_label type"]/a/text() crawl athletics athletics.org.cn None http://www.athletics.org.cn None None None //div[@class="nav styfff fl clear"]/ul/li/a | //div[@class="wjxz styff"]/ul/li/a //div[@class="main"]/div[@class="atitle"]/text() | //div[@class="main"]/div[@class="atitle"]/font/text() //div[@class="main"]/div[@class="a01 sty999"]/span/text() //div[@class="wei"]/a[2]/text()