xpath解析
xpath解析是最常用且最便捷高效的一种解析方式,比较通用
xpath解析原理
实例化一个etree对象,且需要将被解析的页面源码数据加载到该对象中。
调用etree对象中的xpath方法结合着xpath表达式实现标签的定位和内容的捕获
环境安装
pip install lxml
如何实例化一个etree对象:from lxml import etree
将本地的html文档中的源码数据加载到etree对象中:
etree.parse(filePath)
可以将互联网上获取到的源码数据加载到该对象中
etree.HTML(‘page_text’)
xpth表达式
示例:
import requests
from lxml import etree
if __name__ == '__main__':
url="https://www.qiushibaike.com/imgrank/"
headers = {
"User-Agent": "Mozilla/5.0(Windows NT 10.0;Win64;x64) AppleWebKit / 537.36(KHTML, likeGecko) Chrome / 88.0.4324.150 Safari / 537.36"
}
resp=requests.get(url=url,headers=headers)
resp.encoding="utf-8"
page_text=resp.text
tree=etree.HTML(page_text)
print(tree.xpath('//div[@class="thumb"]'))
爬取热门城市和全部城市
import requests
from lxml import etree
if __name__ == '__main__':
url="https://www.aqistudy.cn/historydata/"
headers = {
"User-Agent": "Mozilla/5.0(Windows NT 10.0;Win64;x64) AppleWebKit / 537.36(KHTML, likeGecko) Chrome / 88.0.4324.150 Safari / 537.36"
}
resp=requests.get(url=url,headers=headers)
resp.encoding="utf-8"
page_text=resp.text
tree=etree.HTML(page_text)
# 热门城市
hot_li_list=tree.xpath("//div[@class='bottom']/ul/li")
hot_city=[]
for hot_li in hot_li_list:
hot_city.append(hot_li.xpath("./a/text()")[0])
print(hot_city)
# 全部城市
all_li_list=tree.xpath("//div[@class='bottom']/ul/div/li/a/text()")
print(all_li_list)