导航
安装一下
pip install selectolax
优势
- 解析速度快
- 解析能力强,可解决其它解析库产生的数据残缺问题(少数情况下,一般人遇不着)
劣势
- 只支持css语法、不支持xpath
迅捷的解析速度,对比1000个文本,selectolax效率约为lxml的3倍
# -*- coding: utf-8 -*-
# @Author : markadc
import time
import requests
from lxml import etree
from selectolax.parser import HTMLParser
url = 'https://www.baidu.com'
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.5060.134 Safari/537.36 Edg/103.0.1264.71'
}
html = requests.get(url, headers=headers).text
def use_lxml():
start = time.time()
for _ in range(1000):
tree = etree.HTML(html)
lis = tree.xpath('//ul[@id="hotsearch-content-wrapper"]/li')
end = time.time()
print(f'耗时{end - start:.2f}秒 使用lxml')
def use_selectolax():
start = time.time()
for _ in range(1000):
html_parser = HTMLParser(html)
lis = html_parser.css('ul#hotsearch-content-wrapper > li')
end = time.time()
print(f'耗时{end - start:.2f}秒 使用selectolax')
if __name__ == '__main__':
use_lxml()
use_selectolax()
上方代码输出结果如下
强大的解析能力
- 上方图片为爬虫下载到本地的html文件,可以看到用浏览器xpath插件输入规则提取的结果是50个
- 但是在实际代码中用xpath提取的结果却没有50个,下方为详细代码
# -*- coding: utf-8 -*-
# @Author : markadc
from bs4 import BeautifulSoup
from lxml import etree
from selectolax.parser import HTMLParser
with open('./err.html', 'r') as f:
html = f.read()
# bs4
soup = BeautifulSoup(html, 'lxml')
result = soup.select('div#questions > div')
print(f'得到{len(result)}个div 使用bs4')
# lxml
tree = etree.HTML(html)
result = tree.xpath('//div[@id="questions"]/div')
print(f'得到{len(result)}个div 使用lxml的xpath')
result = tree.cssselect('div#questions > div')
print(f'得到{len(result)}个div 使用lxml的css')
# selectolax
html_parser = HTMLParser(html)
result = html_parser.css('div#questions > div')
print(f'得到{len(result)}个div 使用selectolax的css')
上方代码输出结果如下
- 可以发现只有selectolax的解析才是正确的50个
总结
- selectolax又快又强