HTML解析之五：lxml的XPath解析

最新推荐文章于 2024-07-31 09:03:16 发布

磊布斯

最新推荐文章于 2024-07-31 09:03:16 发布

阅读量646

点赞数

分类专栏：爬虫文章标签：网络爬虫爬虫 html lxml XPath

本文链接：https://blog.csdn.net/zhang__init__/article/details/78322688

版权

爬虫专栏收录该内容

13 篇文章 0 订阅

订阅专栏

#coding:utf8
# BeautifulSoup可以将lxml作为默认的解析器使用，lxml亦可以单独使用;
# 比较BeautifulSoup和lxml：
#（1）
#BeaufulSoup基于DOM，会在如整个文档，解析整个DOM树，比较消耗内存和时间；
#lxml是使用XPath技术查询和处理HTML/XML文档库，只会局部遍历，所以速度较快。
#现在BeautifulSoup可以使用lxml作为默认解析库’
#（2）
#BeautifulSoup较简单，API非常人性化，支持CSS选择器。
# lxml的XPath比较麻烦，开发效率不如BeautifulSoup

#使用lxml解析网页，实例：
from lxml import etree
html_str = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elseie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2"><!-- Lacie --></a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
html = etree.HTML(html_str)
result = etree.tostring(html)
print result
#lxml还可以自动修正html代码

#除了读取字符串之外，lxml还可以直接读取html文件
#将html_str存储为index.html文件，理由parse方法进行解析：
from lxml import etree
html = etree.parse('index.html')
result = etree.tostring(html, pretty_print=True)
print result

#用XPath语法抽取所有的URL:
html = etree.HTML(html_str)
urls = html.xpath(".//*[@class='sister']/@href")
print urls