性能比较：lxml库,正则表达式，BeautifulSoup ，用数据证明lxml解析器速度快

最新推荐文章于 2024-08-05 09:01:07 发布

Hello，小高同学

最新推荐文章于 2024-08-05 09:01:07 发布

阅读量3.4k

点赞数 3

分类专栏： Python爬虫项目实战

本文链接：https://blog.csdn.net/huang1600301017/article/details/83478140

版权

Python爬虫项目实战专栏收录该内容

3 篇文章 0 订阅

订阅专栏

Beautiful Soup支持的解析器

解析器	使用方法	优势	劣势
Python标准库	`BeautifulSoup(markup, "html.parser")`	Python的内置标准库、执行速度适中、文档容错能力强	Python 2.7.3及Python 3.2.2之前的版本文档容错能力差
lxml HTML解析器	`BeautifulSoup(markup, "lxml")`	速度快、文档容错能力强	需要安装C语言库
lxml XML解析器	`BeautifulSoup(markup, "xml")`	速度快、唯一支持XML的解析器	需要安装C语言库
html5lib	`BeautifulSoup(markup, "html5lib")`	最好的容错性、以浏览器的方式解析文档、生成HTML5格式的文档	速度慢、不依赖外部扩展

通过以上对比可以看出，lxml解析器有解析HTML和XML的功能，而且速度快，容错能力强，所以推荐使用它。

为何很多python爬虫工程师都这样说呢，下面我用实例来证明

实例：通过爬去糗事百科文字内容中的信息来比较各解析器的性能，爬取的信息有：用户ID，发表的段子文字信息，好笑数量和评论数量。如图：

代码：

import re
from bs4 import BeautifulSoup
from lxml import etree
import lxml
import time
import requests

headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)'
                      'AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.92 Safari/537.36'
}
def get_lxml(url):
    datas = []
    res = requests.get(url,headers = headers)
    html = etree.HTML(res.text)
    infos = html.xpath('//div[@id="content-left"]/div')
    for info in infos:
        ids = info.xpath('div[1]/a[2]/h2/text()')
        texts_x = info.xpath('a[1]/div[1]/span[1]/text()')
        num_laughs = info.xpath('div[2]/span[1]/i[1]/text()')
        num_comments = info.xpath('// */ i/text()')
        for id,text_x,num_luagh,num_comment in zip(ids,texts_x,num_laughs,num_comments):
            data = {
                'id':id.strip(),
                'text':text_x.strip(),
                'num_luagh':num_luagh,
                'num_comment':num_comment
            }
            datas.append(data)
    #print(datas)



def get_re(url):
    datas=[]
    res = requests.get(url)
    ids = re.findall('<h2>(.*?)</h2>',res.text,re.S)
    texts_x = re.findall('<div class="content">.*?<span>(.*?)</span>',res.text,re.S)
    num_laughs = re.findall('<span class="stats-vote"><i class="number">(.*?)</i> 好笑</span>',res.text,re.S)
    num_comments = re.findall('<i class="number">(\d+)</i> 评论',res.text,re.S)
    for id,text_x,num_luagh,num_comment in zip(ids,texts_x,num_laughs,num_comments):
        data = {
            'id':id.strip(),
            'text':text_x.strip(),
            'num_luagh':num_luagh.strip(),
            'num_comment':num_comment.strip()
        }
        datas.append(data)
    #print(datas)

def get_BeautifulSoup(url):
    datas = []
    res = requests.get(url)
    soup = BeautifulSoup(res.text,'lxml')
    ids = soup.select('div.author.clearfix > a > h2')
    #print(ids)
    texts_x = soup.select('a.contentHerf > div > span')
    #print(texts_x)
    num_laughs = soup.select('div.stats > span.stats-vote > i')
    #print(num_laughs)
    num_comments = soup.select('i')
    #print(num_comments)
    for id,text_x,num_luagh,num_comment in zip(ids,texts_x,num_laughs,num_comments):
        data = {
            'id':id.get_text().strip(),
            'text':text_x.get_text().strip(),
            'num_luagh':num_luagh.get_text().strip(),
            'num_comment':num_comment.get_text().strip()
        }
        datas.append(data)
    #print(datas)


if __name__=='__main__':
    urls = ['https://www.qiushibaike.com/text/page/{}/'.format(str(i)) for i in range(1,14)]
    for name,function in [('lxml',get_lxml),('re',get_re),('BeautifulSoup',get_BeautifulSoup)]:
        start = time.time()
        for url in urls:
            function(url)
            time.sleep(0.2)
        end = time.time()
        print(name,end-start)

运行结果：