python进阶（lxml的用法）

最新推荐文章于 2024-07-17 14:13:18 发布

自学AI的鲨鱼儿

最新推荐文章于 2024-07-17 14:13:18 发布

阅读量8.2k

点赞数 9

文章标签： python_进阶

本文链接：https://blog.csdn.net/qq_16555103/article/details/84198546

版权

python_简易的爬虫专栏收录该内容

10 篇文章 5 订阅

订阅专栏

本节处理的文件如下，文件名为：webhtml.html

<!DOCTYPE html>
<html>
<head>
	<title>漏斗图</title>
	<script type="text/javascript" src="./echarts.js"></script>
</head>
<body>
	<div id="main" style="width: 800px;height: 600px">1111</div>
	<article id="main2" style="width: 800px;height: 600px">
		<span>
			logo
			<a href="http://www.baidu.com" style="font-size:15px;">taobao</a>
			<b>hahaha<em>3333</em></b>
			<a href="www.baidu.com">taobao2</a>
		</span>
	</article>
	<div id="last">last... ...</div>
	<div class="one">11111111111111111111111</div>
	<div class="one two" name="sec" data-foo="value">22222222222222222222222</div>
	<div id="left">
		<a href="http://www.taobao1.com">11111</a>
		<a href="http://www.taobao2.com">333333</a>
		<a href="http://www.taobao3.com">4444</a>
		<a href="http://www.taobao4.com">55555</a>
	</div>
	<script type="text/javascript">
		var myChart=echarts.init(document.getElementById('main'))
		var option={
			title:{
				text:"你的附近哪家自助货架比较多",
				subtext:"数据地区:上海",
			},
			tooltip:{
				// trigger:'item'   //not axis
			},
			legend:{
				orient:"vertical",
				left:"left",
				top:"center",
				data:['猩便利','小u货架','友宝','峰小柜','小e微店']
				//data中的名字和series 中data中的name相等
			},
			toolbox:{
				// show:true,
				feature:{              //feature  不是 true
					// mark:{
					// 	show:true
					// },
					dataView:{
						show:true,
						readOnly:true
					},
					restore:{
						show:true
					},
					saveAsImage:{
						show:true
					}
				}
			},
			series:[{
				name:"货架详情",
				type:"funnel",
				left:"30%",
				max:100,
				min:0,
				data:[
					{
						value:100,
						name:"猩便利",
					},{
						value:80,
						name:"友宝"
					},{
						value:60,
						name:"峰小柜"
					},{
						name:"小u货架",
						value:20
					},{
						name:"小e微店",
						value:40
					}
				]
			}]
		}
		myChart.setOption(option)
 
	</script>
</body>
</html>

一、lxml的基本知识：

①xpath路径可以放在浏览器中查看。

②string得到结果是str，/text()得到的结果是list。

③ /@属性名得到的结果也是list。

⑤ .xpath 也可以用于 etree对象筛选后的对象：

Python_tree_list = Python_tree.xpath('//div[@class="article-intro"]/ul/li/a')  #  [<Element a at 0x25efed7f4c8> ,.... ]
for i in taoBao_tree_list:
    print(i.xpath('text()'))   # ['Python 练习实例1']   list类型
    print(i.xpath('string()'))  # Python 练习实例1  str 类型

1、lxml对象的创建：

（1）通过resquests响应内容：

from lxml import etree
import requests
                              响应内容
responce1 = requests.get('https://www.baidu.com').content.decode('utf-8')
html_lxml = etree.HTML(responce1)    创建lxml对象

（2）打开本地文件：

2、将lxml对象序列化：

result = etree.tostring(html_lxml,pretty_print=True,encoding='utf-8').decode('utf-8')
print(result)

二、xpath语法：

1、选取节点：

--------------- 注意三者的区别 ----------------
last_div=html.xpath("//div[@class='bottom']")[0] # 搜索 class为bottom的div标签

print(last_div.xpath("span/text()"))  --------- 找到 last_div 下的span标签
print(last_div.xpath("//span/text()"))  --------- 搜索所有的span标签(具有全局性)
print(last_div.xpath(".//span/text()"))  --------- 搜索当前(last_div下)所有的span标签

2、谓语：

//input[contains(@name,'na')] 查找name属性中包含na的所有input标签；contains( )表示包含

etree_obj.xpath("//input[contains(@name,'na')]")  ------ 匹配input标签中包含属性name=na的标签

3、xpath通配符：

4、实例：

5、xpath运算符：

其中或（|）比较常用。

< >= 等运算符用于标签内容比较，如例：

| 与 and or 用法区别:
----------------- and or -------------------
print(html.xpath("//li[@class='first' or @class='last']/span/text()"))   -------- or 选取属性条件的时候使用
print(html.xpath("//ul[@class='ul_list']/li[position()>2 and position()<6]/text()"))
----------------- | ------------------------
print(html.xpath("//div[@class='top']/ul/li | //ul[@class='ul_list']/li"))  --------  | 两个结果的集合

6、xpath获得标签属性和标签内容：

获得是内容，而不是标签本身。

① /text() 获取多个节点下第一层节点的所有内容，不包括子节点，且结果是list。
② /@属性名：获得标签的属性，结果也是 list。
③ string 获得多个节点的第一个节点下所有节点的内容，包括子节点，结果是 str 。

# xpath 中 /text() 方法的特点
# 1、若标签下面是其他标签，没有同级的 文本内容，则提取的 标签内容为空，因此他不会提取下一层标签的值，所以若想要
#    提取出下一层的标签的内容，需要深入到下一层的标签中，使用/text()，数据内容是 list
print(xpath_obj.xpath('//div[@class="actcont-auto"]/text()'))   --- [  ] 空值
print(xpath_obj.xpath('//div[@class="actcont-auto"]')[0].xpath('.//a[@href="author_11549103"]/text()'))  
# 2、xpath 中的 /string() 方法的特点，他可以打印出当前标签下 第一个标签下的 所有内容，数据类型为 str
print(xpath_obj.xpath('//div[@class="actcont-auto"]')[0].xpath('string()') )

7、实例：

自学AI的鲨鱼儿

关注

9
点赞
踩
61

收藏

觉得还不错? 一键收藏
2
评论
python进阶（lxml的用法）

本节处理的文件如下，文件名为：webhtml.html&lt;!DOCTYPE html&gt;&lt;html&gt;&lt;head&gt; &lt;title&gt;漏斗图&lt;/title&gt; &lt;script type="text/javascript" src="./echarts.js"&gt;&a
复制链接

扫一扫

专栏目录