爬虫开发02--数据解析--xpath（首选）

最新推荐文章于 2023-12-02 16:35:00 发布

nikeylee

最新推荐文章于 2023-12-02 16:35:00 发布

阅读量1.7k

点赞数

分类专栏：爬虫

本文链接：https://blog.csdn.net/nikeylee/article/details/109305117

版权

一、xpath解析原理

1、获取HTML内容

1）加载本地html文件 etree.parse(local_file_path, parser=parser)

2）加载网络请求返回的html内容 etree.HTML(response_text)

3）etree对象的方法 etree.tostring()

1> 所有节点名为node，且包含class属性的节点 //node[@class]

2> 所有节点名为node，且不包含class属性的节点 //node[not(@class)]

3> 所有节点名为node，且同时包含class属性和id属性的节点 //node[@class and @id]

4> 所有节点名为node，指定id属性值的节点 //node[@id="myid"]

5> 所有节点名为node，且文本中包含substring的节点 //node[contains(text(), substring)]

6> 所有节点名为node，指定文本内容text()的节点 //node[text()="mytext"]

7> 查找标签名称对应的所有节点 //*[name()="标签名称"] 等同于 //node

8> 包含多个孩子节点的节点 //*[count(p)=2]

9> 查找多个标签节点，查找结果取并集 //node1 | //node2

10> 其他

二、xpath（案例）

1、爬取58二手房中的房源信息（总价，单价，房源名称）

2、解析下载4K图片数据

3、爬取全国城市名称

一、xpath解析原理

实例化一个etree对象，且需要将被解析的页面源码数据加载到该对象中；
调用etree对象中的xpath方法，结合xpath表达式，实现标签的定位，和内容的捕获；
环境安装：pip install --user lxml

1、获取HTML内容

1）加载本地html文件 etree.parse(local_file_path, parser=parser)

# 加载本地html文件
from lxml import etree

local_file_path = './test3.html'
parser = etree.HTMLParser(encoding="utf-8")
tree = etree.parse(local_file_path, parser=parser)  # 加载本地的html内容到etree对象
print(type(tree))  # <class 'lxml.etree._ElementTree'>
print(tree)  # <lxml.etree._ElementTree object at 0x0000019C52F16DC0>

输出结果：

2）加载网络请求返回的html内容 etree.HTML(response_text)

# 加载网络请求返回的html内容
from lxml import etree
import requests

get_url = 'http://www.baidu.com'
page_text = requests.get(get_url).text
print(type(page_text))  # <class 'str'>，响应内容是字符串
tree = etree.HTML(page_text)
print(type(tree))  # <class 'lxml.etree._Element'>
print(tree)  # <Element html at 0x22c6cc34a00>

输出结果：

3）etree对象的方法 etree.tostring()

from lxml import etree

local_file_path = './test3.html'
parser = etree.HTMLParser(encoding="utf-8")
tree = etree.parse(local_file_path, parser=parser)  # 加载本地的html内容到etree对象

result = etree.tostring(tree, pretty_print=True, encoding="utf-8")
print(result)

输出结果：

b'<!DOCTYPE html>\n<html>&#13;\n<head>&#13;\n<meta charset="utf-8"/>&#13;\n<title>my html head title content</title>&#13;\n</head>&#13;\n<body>&#13;\n    <h1>\xe6\x88\x91\xe7\x9a\x84\xe7\xac\xac\xe4\xb8\x80\xe4\xb8\xaa\xe6\xa0\x87\xe9\xa2\x98</h1>&#13;\n    <p>\xe6\x88\x91\xe7\x9a\x84\xe7\xac\xac\xe4\xb8\x80\xe4\xb8\xaa\xe6\xae\xb5\xe8\x90\xbd</p>&#13;\n    <a href="default.htm">\xe6\x88\x91\xe7\x9a\x84\xe8\xb6\x85\xe9\x93\xbe\xe6\x8e\xa5\xe6\x96\x87\xe5\xad\x97</a>&#13;\n    <div id="div1">&#13;\n        <p id="p1">text1</p>&#13;\n        <p id="p2">text2</p>&#13;\n        <p id="p3">text3</p>&#13;\n    </div>&#13;\n    <div class="div2">&#13;\n        <div id="div2-div01">&#13;\n        </div>&#13;\n    </div>&#13;\n    <h2>&#13;\n        <div id="div3">&#13;\n            <div id="div3-div01"/>&#13;\n        </div>&#13;\n    </h2>&#13;\n</body>&#13;\n</html>\n'

2、xpath表达式

<?xml version="1.0" encoding="UTF-8"?>
 
<bookstore>
 
<book>
  <title lang="eng">Harry Potter</title>
  <price>29.99</price>
</book>
 
<book>
  <title lang="eng">Learning XML</title>
  <price>39.95</price>
</book>
 
</bookstore>

1）路径表达式

表达式	路径表达式	描述
nodename	bookstore	选取此节点的所有子节点。选取 bookstore 元素的所有子节点。
/	/bookstore bookstore/book	从根节点选取。选取根元素 bookstore。注释：假如路径起始于正斜杠( / )，则此路径始终代表到某元素的绝对路径！选取属于 bookstore 的子元素的所有 book 元素。
//	//book bookstore//book	从匹配选择的当前节点选择文档中的节点，而不考虑它们的位置。选取所有 book 子元素，而不管它们在文档中的位置。选择属于 bookstore 元素的后代的所有 book 元素，而不管它们位于 bookstore 之下的什么位置。
.		选取当前节点。
..		选取当前节点的父节点。
@	//@lang //div[@class="my"]/img/@src	选取属性。选取名为 lang 的所有属性取img标签src属性的值
取文本	/text() li//text()	取标签的直系文本内容取li标签的子标签的文本内容

2）带有谓语的路径表达式

谓语用来查找某个特定的节点或者包含某个指定的值的节点。
谓语被嵌在方括号中。

路径表达式	结果
/bookstore/book[1]	选取属于 bookstore 子元素的第一个 book 元素。
/bookstore/book[last()]	选取属于 bookstore 子元素的最后一个 book 元素。
/bookstore/book[last()-1]	选取属于 bookstore 子元素的倒数第二个 book 元素。
/bookstore/book[position()<3]	选取最前面的两个属于 bookstore 元素的子元素的 book 元素。
//title[@lang]	选取所有拥有名为 lang 的属性的 title 元素。
//title[@lang='eng']	选取所有 title 元素，且这些元素拥有值为 eng 的 lang 属性。
/bookstore/book[price>35.00]	选取 bookstore 元素的所有 book 元素，且其中的 price 元素的值须大于 35.00。
/bookstore/book[price>35.00]//title	选取 bookstore 元素中的 book 元素的所有 title 元素，且其中的 price 元素的值须大于 35.00。

3）选取未知节点

XPath 通配符可用来选取未知的 XML 元素。

通配符	路径表达式	描述
*	/bookstore/* //*	匹配任何元素节点。选取 bookstore 元素的所有子元素。选取文档中的所有元素。
@*	//title[@*]	匹配任何属性节点。选取所有带有属性的 title 元素。
node()		匹配任何类型的节点。

通配符

路径表达式

描述

/bookstore/*

//*

匹配任何元素节点。

选取 bookstore 元素的所有子元素。

选取文档中的所有元素。

//title[@*]

匹配任何属性节点。

选取所有带有属性的 title 元素。

node()

匹配任何类型的节点。

4）选取若干路径

在路径表达式中使用"|"运算符，您可以选取若干个路径

路径表达式	结果
//book/title \| //book/price	选取 book 元素的所有 title 和 price 元素。
//title \| //price	选取文档中的所有 title 和 price 元素。
/bookstore/book/title \| //price	选取属于 bookstore 元素的 book 元素的所有 title 元素，以及文档中所有的 price 元素。