xpath用法练习
找了一个小说网站作为例子 获取排行榜内容
一、分析网页结构
在一个id为main的div中有6个类名是box b1-b4的div
然后div中向下标签分别为ul li a
二、编写执行代码
两种方式去获取:
1.循环中获取所有排行榜 6个榜单
import requests
from lxml import html
if __name__ == '__main__':
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:105.0) Gecko/20100101 Firefox/105.0'}
web = requests.get('https://www.xxx.com/paihang.html', timeout=7, headers=headers)
selector = html.etree.HTML(web.text)
# 在循环中获取所有排行榜 总共6个榜单
for n in range(1, 5):
# 获取榜单标题
title = selector.xpath('//div[@class="box b%d"]/h3/text()' % n)
print(title)
# 获取小说名称
article = selector.xpath('//div[@class="box b%d"]/ul/li/a/text()' % n)
# 获取小说链接
hrefs = selector.xpath('//div[@class="box b%d"]/ul/li/a/@href' % n)
for i, x in enumerate(article):
print(i, x)
输出结果
2.使用共有属性 获取全部榜单
import requests
from lxml import html
if __name__ == '__main__':
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:105.0) Gecko/20100101 Firefox/105.0'}
web = requests.get('https://www.xxx.com/paihang.html', timeout=7, headers=headers)
selector = html.etree.HTML(web.text)
# 获取小说名称
article = selector.xpath('//div[starts-with(@class, "box")]/ul/li/a/text()')
# 获取小说链接
attr = selector.xpath('//div[starts-with(@class, "box")]/ul/li/a/@href')
for i, x in enumerate(article):
print(i, x)
输出结果: