XPath即为XML路径语言(XML Path Language),它是一种用来确定XML文档中某部分位置的语言。
而在在Python爬虫中,我们经常使用xpath解析这种高效便捷的方式来提取信息。
xpath模块安装
国内镜像安装xpath
pip install lxml -i https://pypi.tuna.tsinghua.edu.cn/simple
基础知识
在xpath中html中的标签相当于一个元素,例如:
html ---> <html> ...</html>
div ---> <div> ...</div>
a ---> <a> ...</a>
xpath 的思想是通过 路径表达 去寻找节点。节点包括元素,属性,和内容
路径表达式 | 解释 |
---|---|
/ | 根节点,节点分隔符 |
// | 任意位置 |
. | 当前节点 |
@ | 属性 |
nodename | 选取此节点的所有子节点 |
nodename[@attrib=‘value’] | 选取给定属性具有给定值的指定元素。如 div[@class=‘cell’] 表示 class 属性的值为 cell 的所有 div 元素 |
常用函数text(),用来获取文本
用法
实例化一个etree的对象,且需要将被解析的页面源码数据加载到该对象中。有两种方式:
本地获取源码数据
from lxml import etree
html = etree.parse('./test.html', etree.HTMLParser()) # ./test.html为本地的html文件的路径
html.xpath('xpath表达式')
互联网获取源码数据
from lxml import etree # 导包
html = etree.HtML(response.text) # response.text为从页面获取的源码数据
html.xpath('xpath表达式')
xpaths实例
下面给一段HTML代码做演示
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<title>测试</title>
</head>
<body>
<div class="big">
<ul>
<li><a href="https://www.baidu.com/">百度</a></li>
<li><a href="https://weibo.com/">微博</a></li>
<li><a href="https://www.tmall.com/">天猫</a></li>
<p>test1</p>
</ul>
<div>
<a id="aa" href="https://www.iqiyi.com/">爱奇艺</a>
<a id="bb" href="https://v.qq.com/">腾讯视频</a>
<p>test2</p>
</div>
</div>
</body>
</html>
1.属性定位
result1 = html.xpath('//a[@id="aa"]')
print(result1) # 运行结果:[<Element a at 0x2718236fc00>]
2.取文本
在xpath中,使用 text() 即可取出网页中的文本信息
result2 = html.xpath('/html/body/div/ul/li[1]/a/text()') # text()获取文本
print(result2) # 运行结果:['百度']
# 可以看到获得的是一个列表,想取里面字符串如下所示
result3 = html.xpath('/html/body/div/ul/li[1]/a/text()')[0]
print(result3) # 运行结果:百度
3.取属性
如果想要获取标签内的属性,例如a标签的href
result4 = html.xpath('//a[@id="bb"]/@href') # 利用@属性这个表达式
print(result4) #运行结果:['https://v.qq.com/']
# 同样可以看到获得的是一个列表,想取里面字符串如下所示
result5 = html.xpath('//a[@id="bb"]/@href')[0] # 利用@属性这个表达式
print(result5) #运行结果:https://v.qq.com/
xpath解析的局限性
xpath获取的是网页源代码的数据,如果网页的数据是通过Ajax动态加载的,那将不能用xpath表达式来提取数据!!!
爬虫实战
目标是获取快代理IP列表的信息并存入到csv文件上
在上述表中,我们将获取IP,端口,匿名度,类型,位置,响应速度,最后验证时间和付费方式等列表
并设计可自由选择爬取几页
# 导入必要的库
import requests
from lxml import etree
import time
import csv
# 获取快代理网站IP列表
def getProoxy():
headers = {
"User-Agent": "Mozilla /5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.0.0 Safari/537.36",
"Referer": "https://www.kuaidaili.com/"
}
num = int(input("你想爬取前几页:"))
# 定义表头数组和数据数组
csv_headers = ['IP','端口','匿名度','类型','位置','响应速度','最后验证时间','付费方式']
csv_data = []
for n in range(num):
# url 是网址,这里使用字符串拼接网址
url = f'https://www.kuaidaili.com/free/inha/{n+1}'
# 如果请求超时,循环请求
try:
resp = requests.get(url, headers=headers,timeout=5)
except:
for i in range(4):
resp = requests.get(url, headers=headers,timeout=20)
if resp.status_code == 200:
break
# 获取网页数据
html = etree.HTML(resp.text)
len1 = html.xpath("//*[@id='list']/table/tbody//tr")
total = len(len1) # 获取条数
for i in range(total):
ip = html.xpath(f"//*[@id='list']/table/tbody/tr[{i+1}]/td/text()")[0] # IP
port = html.xpath(f"//*[@id='list']/table/tbody/tr[{i+1}]/td/text()")[1] # 端口
anonymity = html.xpath(f"//*[@id='list']/table/tbody/tr[{i+1}]/td/text()")[2] # 匿名度
types = html.xpath(f"//*[@id='list']/table/tbody/tr[{i+1}]/td/text()")[3] # 类型 http 或 https
position = html.xpath(f"//*[@id='list']/table/tbody/tr[{i+1}]/td/text()")[4] # 位置
response_speed = html.xpath(f"//*[@id='list']/table/tbody/tr[{i+1}]/td/text()")[5] # 响应速度
last_time = html.xpath(f"//*[@id='list']/table/tbody/tr[{i+1}]/td/text()")[6] # 最后验证时间
pay_method = html.xpath(f"//*[@id='list']/table/tbody/tr[{i+1}]/td/text()")[7] # 付费方式
temp = [ip,port,anonymity,types,position,response_speed,last_time,pay_method]
csv_data.append(temp) # 插入数据
print(f'正在添加第{n+1}页第{i+1}条数据')
time.sleep(0.8)
# 写入文件csv
with open('test2.csv','w')as file:
writer = csv.writer(file)
writer.writerow(csv_headers) # 单行插入 表头
writer.writerows(csv_data) # 多行插入 数据
print(csv_data)
if __name__ == "__main__":
getProoxy()
# end main
备注
如果连接超时无响应有些几点原因
- 被识别了
- 网址输入错误了
- 服务器停止提供服务器了
以下有几点方法可以解决
第一种解决方法:循环抓错,重复请求:
# 如果请求超时,循环请求
try:
resp = requests.get(url, headers=headers,timeout=5)
except:
for i in range(4):
resp = requests.get(url, headers=headers,timeout=20)
if resp.status_code == 200:
break
# 获取网页数据
html = etree.HTML(resp.text)
第二种解决方法:构建headers列表每次随机选一个
headers_list = [
{
'user-agent': 'Mozilla/5.0 (iPhone; CPU iPhone OS 13_2_3 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.0.3 Mobile/15E148 Safari/604.1'
}, {
'user-agent': 'Mozilla/5.0 (Linux; Android 8.0.0; SM-G955U Build/R16NW) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.141 Mobile Safari/537.36'
}, {
'user-agent': 'Mozilla/5.0 (Linux; Android 10; SM-G981B) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.162 Mobile Safari/537.36'
}, {
'user-agent': 'Mozilla/5.0 (iPad; CPU OS 13_3 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) CriOS/87.0.4280.77 Mobile/15E148 Safari/604.1'
}, {
'user-agent': 'Mozilla/5.0 (Linux; Android 8.0; Pixel 2 Build/OPD3.170816.012) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.0.0 Mobile Safari/537.36'
}, {
'user-agent': 'Mozilla/5.0 (Linux; Android) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.109 Safari/537.36 CrKey/1.54.248666'
}, {
'user-agent': 'Mozilla/5.0 (X11; Linux aarch64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.188 Safari/537.36 CrKey/1.54.250320'
}, {
'user-agent': 'Mozilla/5.0 (BB10; Touch) AppleWebKit/537.10+ (KHTML, like Gecko) Version/10.0.9.2372 Mobile Safari/537.10+'
}, {
'user-agent': 'Mozilla/5.0 (PlayBook; U; RIM Tablet OS 2.1.0; en-US) AppleWebKit/536.2+ (KHTML like Gecko) Version/7.2.1.0 Safari/536.2+'
}, {
'user-agent': 'Mozilla/5.0 (Linux; U; Android 4.3; en-us; SM-N900T Build/JSS15J) AppleWebKit/534.30 (KHTML, like Gecko) Version/4.0 Mobile Safari/534.30'
}, {
'user-agent': 'Mozilla/5.0 (Linux; U; Android 4.1; en-us; GT-N7100 Build/JRO03C) AppleWebKit/534.30 (KHTML, like Gecko) Version/4.0 Mobile Safari/534.30'
}, {
'user-agent': 'Mozilla/5.0 (Linux; U; Android 4.0; en-us; GT-I9300 Build/IMM76D) AppleWebKit/534.30 (KHTML, like Gecko) Version/4.0 Mobile Safari/534.30'
}, {
'user-agent': 'Mozilla/5.0 (Linux; Android 7.0; SM-G950U Build/NRD90M) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.84 Mobile Safari/537.36'
}, {
'user-agent': 'Mozilla/5.0 (Linux; Android 8.0.0; SM-G965U Build/R16NW) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.111 Mobile Safari/537.36'
}, {
'user-agent': 'Mozilla/5.0 (Linux; Android 8.1.0; SM-T837A) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.80 Safari/537.36'
}, {
'user-agent': 'Mozilla/5.0 (Linux; U; en-us; KFAPWI Build/JDQ39) AppleWebKit/535.19 (KHTML, like Gecko) Silk/3.13 Safari/535.19 Silk-Accelerated=true'
}, {
'user-agent': 'Mozilla/5.0 (Linux; U; Android 4.4.2; en-us; LGMS323 Build/KOT49I.MS32310c) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/102.0.0.0 Mobile Safari/537.36'
}, {
'user-agent': 'Mozilla/5.0 (Windows Phone 10.0; Android 4.2.1; Microsoft; Lumia 550) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2486.0 Mobile Safari/537.36 Edge/14.14263'
}, {
'user-agent': 'Mozilla/5.0 (Linux; Android 6.0.1; Moto G (4)) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.0.0 Mobile Safari/537.36'
}, {
'user-agent': 'Mozilla/5.0 (Linux; Android 6.0.1; Nexus 10 Build/MOB31T) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.0.0 Safari/537.36'
}, {
'user-agent': 'Mozilla/5.0 (Linux; Android 4.4.2; Nexus 4 Build/KOT49H) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.0.0 Mobile Safari/537.36'
}, {
'user-agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.0.0 Mobile Safari/537.36'
}, {
'user-agent': 'Mozilla/5.0 (Linux; Android 8.0.0; Nexus 5X Build/OPR4.170623.006) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.0.0 Mobile Safari/537.36'
}, {
'user-agent': 'Mozilla/5.0 (Linux; Android 7.1.1; Nexus 6 Build/N6F26U) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.0.0 Mobile Safari/537.36'
}, {
'user-agent': 'Mozilla/5.0 (Linux; Android 8.0.0; Nexus 6P Build/OPP3.170518.006) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.0.0 Mobile Safari/537.36'
}, {
'user-agent': 'Mozilla/5.0 (Linux; Android 6.0.1; Nexus 7 Build/MOB30X) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.0.0 Safari/537.36'
}, {
'user-agent': 'Mozilla/5.0 (compatible; MSIE 10.0; Windows Phone 8.0; Trident/6.0; IEMobile/10.0; ARM; Touch; NOKIA; Lumia 520)'
}, {
'user-agent': 'Mozilla/5.0 (MeeGo; NokiaN9) AppleWebKit/534.13 (KHTML, like Gecko) NokiaBrowser/8.5.0 Mobile Safari/534.13'
}, {
'user-agent': 'Mozilla/5.0 (Linux; Android 9; Pixel 3 Build/PQ1A.181105.017.A1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.158 Mobile Safari/537.36'
}, {
'user-agent': 'Mozilla/5.0 (Linux; Android 10; Pixel 4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Mobile Safari/537.36'
}, {
'user-agent': 'Mozilla/5.0 (Linux; Android 11; Pixel 3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.181 Mobile Safari/537.36'
}, {
'user-agent': 'Mozilla/5.0 (Linux; Android 5.0; SM-G900P Build/LRX21T) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.0.0 Mobile Safari/537.36'
}, {
'user-agent': 'Mozilla/5.0 (Linux; Android 8.0; Pixel 2 Build/OPD3.170816.012) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.0.0 Mobile Safari/537.36'
}, {
'user-agent': 'Mozilla/5.0 (Linux; Android 8.0.0; Pixel 2 XL Build/OPD1.170816.004) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.0.0 Mobile Safari/537.36'
}, {
'user-agent': 'Mozilla/5.0 (iPhone; CPU iPhone OS 10_3_1 like Mac OS X) AppleWebKit/603.1.30 (KHTML, like Gecko) Version/10.0 Mobile/14E304 Safari/602.1'
}, {
'user-agent': 'Mozilla/5.0 (iPhone; CPU iPhone OS 13_2_3 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.0.3 Mobile/15E148 Safari/604.1'
}, {
'user-agent': 'Mozilla/5.0 (iPad; CPU OS 11_0 like Mac OS X) AppleWebKit/604.1.34 (KHTML, like Gecko) Version/11.0 Mobile/15A5341f Safari/604.1'
}
]
# 随机循环一个
headers = random.choice(headers_list)
第三种方法解决方法:代码设置代理
proxies = {
'http': '127.0.0.1:1212',
'https': '127.0.0.1:1212'
}
res = requests.get(url, headers=headers, proxies=proxies, timeout=20)