xpath笔记

最新推荐文章于 2022-08-23 22:49:03 发布

Uridis

最新推荐文章于 2022-08-23 22:49:03 发布

阅读量221

点赞数

分类专栏：爬虫 Python xpath

本文链接：https://blog.csdn.net/Uridis/article/details/86826415

版权

Python 同时被 3 个专栏收录

11 篇文章 0 订阅

订阅专栏

爬虫

8 篇文章 0 订阅

订阅专栏

xpath

2 篇文章 0 订阅

订阅专栏

python爬虫--xpath

xpath安装
什么是xpath？
常用的路径表达式
chrome安装xpath插件
启动和关闭插件
属性定位
层级定位
索引定位
逻辑定位
模糊定位
取文本
取属性
代码中使用xpath

xpath安装

pip install xpath

什么是xpath？

xml是用来存储和传输数据使用的
和html的不通有两点：
（1）html用来显示数据，xml是用来传输数据
（2）html标签是固定的，xml标签是自定义的
xpath用来在xml中查找指定的元素，它是一种路径表达式

常用的路径表达式

常用的路径表达式:
常用的路径表达式：

/	    不考虑位置的查找
./	    从当前节点开始往下查找
../	    从当前节点的父节点查找
@       选取属性

实例：

使用	解释
/bookstroe/book	选取根节点bookstore下面所有直接子节点的book
//book	选取所有book元素二不管它所在的位置
bookstroe//book	查找bookstore西面所有的book
//@lang	选取名为lang的所有属性的节点

谓语语法	解释
/bookstore/book[1]	bookstore子元素的第一个book
/bookstore/book[last()]	boolstore的最后一个book元素
/bookstore/book[position() < 3]	前两个book
//title[@lang]	所有的带有lang属性的title节点
//title[@lang=‘en’]	所有的lang属性值为en的title节点
*	任何
@*	匹配任何属性节点
node()	匹配任何类型的节点
/bookstore/*	匹配任何属性节点
//*	选取文档中的所有元

chrome安装xpath插件

	将xpath插件拖到谷歌浏览器扩展程序中，安装成功

启动和关闭插件

	ctrl + shift + x

属性定位

	//input[@id="kw"]
	//input[@class="btn self-btn bg s_btn"]

层级定位

	//div[@id="head"]/div[@id="u_sp"]/a[@id="s_username_top"]
	【注意】索引从1开始

索引定位

	//div[@id="head"]//a[@class="s-user-name-top"]
	【注意】双斜杠//代表线面所有的a节点，不管位置

逻辑定位

	//input[@class="s_ipt" and @name="wd"]

模糊定位

	contains
		//input[contains(@class, "s_i")]
		ret=tree.xpath('//li[contains(text(),"爱")]/text()')
		所有的input属性，有class属性，并且属性中带有s_i的节点
	starts-with
		//input[starts-with(@class, "s")]
		所有的input，有class属性，并且属性以s开头

取文本

	//div[@id="u_sp"]//a[@class="mnav"][last()-1]/text()        获取节点内容
	//div[@id="u_sp"]//text()
	获取节点里面不带标签的所有内容
	# 直接将所有的内容拼接起来返回给你
	ret=tree.xpath('//div[@class="song"]')
	string=ret[0].xpath('string(.)')
	print(string.replace('\n','').replace('',''))

取属性

	//div[@id="u_sp"]//a[@class="mnav" 				    
	[last()-1]/@href

代码中使用xpath

导入方式
	from lxml import etree
	两种方式使用： 将html文档变成一个对象，然后调用对象的方法去查找指定的节点
	（1）本地文件
		tree = etree.parse('文件名')
	（2）网络文件
		tree = etree.HTML(网页字符串)
		
	ret = tree.xpath(路径表达式)
	【注】ret是一个列表
	ret=tree.xpath('//div[@class="tang"]/ul/li//a/@href')
	#ret=tree.xpath('//div[@class="tang"]/ul/li[@class="love"and@name="yang"]')
	#ret=tree.xpath('//div[@class="tang"]/ul/li[contains(@class,"l")]')
	ret=tree.xpath('//li[contains(text(),"爱")]/text()')