爬虫 - 最强解析库selectolax

原创已于 2024-06-17 15:41:14 修改 · 置顶 · 4.3w 阅读

2 ·

CC 4.0 BY-SA版权

文章标签：

#爬虫 #python #开发语言

于 2022-07-25 18:05:07 首次发布

网络爬虫专栏收录该内容

32 篇文章

订阅专栏

博客介绍了Selectolax与lxml在网页解析速度和能力上的对比。通过代码示例展示，在处理1000个文本时，Selectolax的效率约为lxml的3倍。同时，对于复杂HTML文件的解析，Selectolax能够正确提取所有元素，而lxml和BeautifulSoup存在数据残缺问题。因此，Selectolax在速度和解析准确性上表现出色。

安装一下

pip install selectolax

优势

解析速度快
解析能力强，可解决其它解析库产生的数据残缺问题（少数情况下，一般人遇不着）

劣势

只支持css语法、不支持xpath

迅捷的解析速度，对比1000个文本，selectolax效率约为lxml的3倍

# -*- coding: utf-8 -*-
#  @Author  : markadc

import time
import requests
from lxml import etree
from selectolax.parser import HTMLParser

url = 'https://www.baidu.com'
headers = {
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.5060.134 Safari/537.36 Edg/103.0.1264.71'
}
html = requests.get(url, headers=headers).text


def use_lxml():
    start = time.time()
    for _ in range(1000):
        tree = etree.HTML(html)
        lis = tree.xpath('//ul[@id="hotsearch-content-wrapper"]/li')
    end = time.time()
    print(f'耗时{end - start:.2f}秒  使用lxml')


def use_selectolax():
    start = time.time()
    for _ in range(1000):
        html_parser = HTMLParser(html)
        lis = html_parser.css('ul#hotsearch-content-wrapper > li')
    end = time.time()
    print(f'耗时{end - start:.2f}秒  使用selectolax')


if __name__ == '__main__':
    use_lxml()
    use_selectolax()

上方代码输出结果如下

在这里插入图片描述

强大的解析能力

在这里插入图片描述

上方图片为爬虫下载到本地的html文件，可以看到用浏览器xpath插件输入规则提取的结果是50个
但是在实际代码中用xpath提取的结果却没有50个，下方为详细代码

# -*- coding: utf-8 -*-
#  @Author  : markadc

from bs4 import BeautifulSoup
from lxml import etree
from selectolax.parser import HTMLParser


with open('./err.html', 'r') as f:
    html = f.read()

# bs4
soup = BeautifulSoup(html, 'lxml')
result = soup.select('div#questions > div')
print(f'得到{len(result)}个div  使用bs4')

# lxml
tree = etree.HTML(html)
result = tree.xpath('//div[@id="questions"]/div')
print(f'得到{len(result)}个div  使用lxml的xpath')
result = tree.cssselect('div#questions > div')
print(f'得到{len(result)}个div  使用lxml的css')

# selectolax
html_parser = HTMLParser(html)
result = html_parser.css('div#questions > div')
print(f'得到{len(result)}个div  使用selectolax的css')