#coding:utf8 # BeautifulSoup可以将lxml作为默认的解析器使用,lxml亦可以单独使用; # 比较BeautifulSoup和lxml: #(1) #BeaufulSoup基于DOM,会在如整个文档,解析整个DOM树,比较消耗内存和时间; #lxml是使用XPath技术查询和处理HTML/XML文档库,只会局部遍历,所以速度较快。 #现在BeautifulSoup可以使用lxml作为默认解析库’ #(2) #BeautifulSoup较简单,API非常人性化,支持CSS选择器。 # lxml的XPath比较麻烦,开发效率不如BeautifulSoup #使用lxml解析网页,实例: from lxml import etree html_str = """ <html><head><title>The Dormouse's story</title></head> <body> <p class="title"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a href="http://example.com/elseie" class="sister" id="link1"><!-- Elsie --></a>, <a href="http://example.com/lacie" class="sister" id="link2"><!-- Lacie --></a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> """ html = etree.HTML(html_str) result = etree.tostring(html) print result #lxml还可以自动修正html代码 #除了读取字符串之外,lxml还可以直接读取html文件 #将html_str存储为index.html文件,理由parse方法进行解析: from lxml import etree html = etree.parse('index.html') result = etree.tostring(html, pretty_print=True) print result #用XPath语法抽取所有的URL: html = etree.HTML(html_str) urls = html.xpath(".//*[@class='sister']/@href") print urls
HTML解析之五:lxml的XPath解析
最新推荐文章于 2024-07-31 09:03:16 发布