xpath在python中的使用

最新推荐文章于 2024-04-07 08:00:00 发布

logoai

最新推荐文章于 2024-04-07 08:00:00 发布

阅读量285

点赞数 1

分类专栏： python 文章标签： python html xpath

本文链接：https://blog.csdn.net/logoai/article/details/104847327

版权

python 专栏收录该内容

6 篇文章 0 订阅

订阅专栏

个人认为xpath是python中最好用的爬虫工具

获取所有的tr标签
获取所有class=para的div标签
获取a标签中的href链接``
获取标签中的纯文本信息

html代码

<!DOCTYPE html>
<html>
	<head>
		<meta charset="utf-8">
		<title></title>
	</head>
	<body>
		<tr>
			<td><a href="http://www.baidu.com">百度</a></td>
			<td><a href="https://www.360.cn/">360</a></td>
			<td><a href="https://www.iqiyi.com/">爱奇艺</a></td>
			<td><a href="https://www.qq.com/">腾讯</a></td>
			<td><a href="https://www.huawei.com/cn/?ic_medium=direct&ic_source=surlent">华为</a></td>
		</tr>
		<div id="para">
			<div id="parse">
				<p>你好，HTML。你将改变世界</p>
				<p>你好，python，你将改变世界</p>
			</div>
			<div id="parse">
				<ul>
					<li><a href="http://www.baidu.com">百度</a></li>
					<li><a href="http://www.360.com/">360</a></li>
					<li><a href="http://www.qq.com">腾讯</a></li>
				</ul>
			</div>
			
		</div>
		<div id="para">
			<a href="https://www.qq.com/">腾讯</a>
			<a href="https://www.iqiyi.com/">爱奇艺</a>
			<a href="https://www.huawei.com/cn/?ic_medium=direct&ic_source=surlent">华为</a>
		</div>
		<div id="para">
			
		</div>
	</body>
</html>

import requests
from lxml import etree
//lxml需要下载(pip install lxml)
url = "xxxxxx"
header={"user-ange":"xxxxxx"}
text = requests.get(url, headers=header).text
html = etree.HTML(text)
#获取到网页中的html

获取所有tr标签

trs = html.xpath("//tr")
#返回一个列表
for tr in trs:
	print(tr)
	print(etree.toString(tr, encoding="utf-8").decode("utf-8"))#将tr转换为utf-8，也就是能看懂的形式

获取所有class=para的div标签

divs = html.xpath("//div[@class='para']")
for div in divs:
	print(div)

获取a标签中的href

/a/@href 是获取href , /a[@href=‘xxx’] 是获取href为xxx的a标签
href可以获得图片、视频等资源的链接，可进行下载资源分析资源

hrefs = html.xpath("//a/@href")
for href in hrefs:
	print(href)

获取所有纯文本信息

divs= html.xpath("//div[position()>3]")
#position()>3,position是起到过滤作用，是指选取所获得的div列表中下标大于3的div
for div in divs:
	text= div.xpath(".//a/text()")	# . 是指在该div下进行信息提取
	print(text)