《python爬虫学习》之for循环中的try和if效率对比

最新推荐文章于 2023-06-21 17:58:07 发布

九圣残炎

最新推荐文章于 2023-06-21 17:58:07 发布

阅读量4.2k

点赞数 5

分类专栏： python爬虫学习文章标签： python 列表

本文链接：https://blog.csdn.net/qq_40878316/article/details/106423600

版权

python爬虫学习专栏收录该内容

6 篇文章 0 订阅

订阅专栏

在爬取b站数据时，因为有些视频没有简介或时长，导致使用xpath提取数据时出现IndexError错误，即

abstract = res.xpath('div[@class="r"]/div[@class="v-desc"]/text()')
times = res.xpath('div[@class="l"]//span[@class="dur"]/text()')
'''
这里如果爬取到的数据为空，得到的结果是abstract=[]和times=[]
因为这两个列表没有元素，所以使用abstract[0]和times[0]会导致越界，即索引值超出了列表长度
'''

所以在后面重构代码的时候，我加了if语句去限定它，当他为空时赋值其他值，不为空时赋予list[0]的值。

for res in result:
    image = res.xpath('div[@class="l"]//div[@class="lazy-img"]/img/@src')[0]
    url = res.xpath('div[@class="l"]//a/@href')[0].replace('//', '')
    times = res.xpath('div[@class="l"]//span[@class="dur"]/text()')
    title = res.xpath('div[@class="r"]/a/text()')[0]
    abstract = res.xpath('div[@class="r"]/div[@class="v-desc"]/text()')
    num = res.xpath('div[@class="r"]/div[@class="v-info"]/span/span/text()')
    # 在66页有简介为空的视频导致IndexError: list index out of range错误,所以需要做一下判断
    # 其他的判断同理
    if len(abstract) == 0:
        abstract = '无简介'
    else:
        abstract = abstract[0]
    if len(times) == 0:
        times = '未知'
    else:
        times = times[0]

但思考了一下，每一次循环都用if语句去指定输出值，会不会增加时耗，所以我后面又添加了try语句，让它出现异常后再加限定。

    for res in result:
        '''
        如何没有空数据,即不会出现索引问题,我们就正常爬取,如果报错,那么我们就加判断条件限制.
        这里不直接使用if限定res的值不为空
        '''
        try:
            image = res.xpath('div[@class="l"]//div[@class="lazy-img"]/img/@src')[0]
            url = res.xpath('div[@class="l"]//a/@href')[0].replace('//', '')
            times = res.xpath('div[@class="l"]//span[@class="dur"]/text()')[0]
            title = res.xpath('div[@class="r"]/a/text()')[0]
            abstract = res.xpath('div[@class="r"]/div[@class="v-desc"]/text()')[0]
            num = res.xpath('div[@class="r"]/div[@class="v-info"]/span/span/text()')
        except IndexError:
            image = res.xpath('div[@class="l"]//div[@class="lazy-img"]/img/@src')[0]
            url = res.xpath('div[@class="l"]//a/@href')[0].replace('//', '')
            times = res.xpath('div[@class="l"]//span[@class="dur"]/text()')
            title = res.xpath('div[@class="r"]/a/text()')[0]
            abstract = res.xpath('div[@class="r"]/div[@class="v-desc"]/text()')
            num = res.xpath('div[@class="r"]/div[@class="v-info"]/span/span/text()')
            # 在66页有简介为空的视频导致IndexError: list index out of range错误,所以需要做一下判断
            # 其他的判断同理
            if len(abstract) == 0:
                abstract = '无简介'
            else:
                abstract = abstract[0]
            if len(times) == 0:
                times = '未知'
            else:
                times = times[0]

那么，两种方法哪种效率最高呢？于是我做了一个比较不严谨的测试，就是对这两种方法进行一次爬取时长的测试对比。

首先是爬取1-99页数据（已知低66页会出现错误）

	if	try
1	255.25015354156494	253.46807074546814
2	253.84186816215515	252.08237195014954
3	253.26841568946838	251.44973397254944

然后是1-50页数据（这里没有导致异常的数据）

	if	try
1	128.0856523513794	128.18170762062073
2	127.2362470626831	125.75610876083374
3	126.44024205207825	125.43329238891602

最后是1和51-100页（有导致异常的数据）

	if	try
1	141.31432151794434	141.8306381702423
2	126.97643280029297	137.30469703674316
3	135.45368721348794	140.42737317085266

这里看起来是有异常的情况下使用if去限定效率会比较高点，无异常的情况下try效率会高点，但实际考虑到网络的不稳定性，这组对比并不算严谨，所以我又写了下面代码来做对比。

import random
import time
T=[1]
F=[]
list=[T,F]
demoList1=[]
demoList2=[]
demoList3=[]
demoList4=[]
# 随机空列表
for i in range(1000000):
    demoList1.append(list[random.randint(0,1)])
beginTime = time.time()
count=0
num=0
FNum=0
for res in demoList1:
    if len(res)>0:
        num+=res[0]
    else:
        num +=1
        FNum+=1
    count+=num
endTime = time.time()
print(f'F数量为{FNum}，本次if限定用时{endTime - beginTime}')

beginTime = time.time()
count=0
num=0
FNum=0
for res in demoList1:
    try:
        num+=res[0]
    except IndexError:
        num+=1
        FNum += 1
    finally:
        count+=num
endTime = time.time()
print(f'F数量为{FNum}，本次try限定用时{endTime - beginTime}')

# 极少空列表的情况
for i in range(1000000):
    demoList2.append(list[0])
for i in range(100):
    demoList2[random.randint(0,1000000)]=F

beginTime = time.time()
count=0
num=0
FNum=0
for res in demoList2:
    if len(res)>0:
        num+=res[0]
    else:
        num +=1
        FNum+=1
    count+=num
endTime = time.time()
print(f'F数量为{FNum}，本次if限定用时{endTime - beginTime}')

beginTime = time.time()
count=0
num=0
FNum=0
for res in demoList2:
    try:
        num+=res[0]
    except IndexError:
        num+=1
        FNum += 1
    finally:
        count+=num
endTime = time.time()
print(f'F数量为{FNum}，本次try限定用时{endTime - beginTime}')
# 无空列表的情况
for i in range(1000000):
    demoList3.append(T)

beginTime = time.time()
count=0
num=0
FNum=0
for res in demoList3:
    if len(res)>0:
        num+=res[0]
    else:
        num +=1
        FNum+=1
    count+=num
endTime = time.time()
print(f'F数量为{FNum}，本次if限定用时{endTime - beginTime}')

beginTime = time.time()
count=0
num=0
FNum=0
for res in demoList3:
    try:
        num+=res[0]
    except IndexError:
        num+=1
        FNum += 1
    finally:
        count+=num
endTime = time.time()
print(f'F数量为{FNum}，本次try限定用时{endTime - beginTime}')

# 全空列表的情况
for i in range(1000000):
    demoList4.append(F)
beginTime = time.time()
count=0
num=0
FNum=0
for res in demoList4:
    if len(res)>0:
        num+=res[0]
    else:
        num +=1
        FNum+=1
    count+=num
endTime = time.time()
print(f'F数量为{FNum}，本次if限定用时{endTime - beginTime}')

beginTime = time.time()
count=0
num=0
FNum=0
for res in demoList4:
    try:
        num+=res[0]
    except IndexError:
        num+=1
        FNum += 1
    finally:
        count+=num
endTime = time.time()
print(f'F数量为{FNum}，本次try限定用时{endTime - beginTime}')

第一次结果

F数量为500775，本次if限定用时0.7730720043182373
F数量为500775，本次try限定用时0.9919514656066895
F数量为100，本次if限定用时0.7140657901763916
F数量为100，本次try限定用时0.5791869163513184
F数量为0，本次if限定用时0.7121121883392334
F数量为0，本次try限定用时0.5681884288787842
F数量为1000000，本次if限定用时0.793057918548584
F数量为1000000，本次try限定用时1.504162073135376

第二次结果

F数量为499578，本次if限定用时0.8145346641540527
F数量为499578，本次try限定用时1.003842830657959
F数量为100，本次if限定用时0.812556266784668
F数量为100，本次try限定用时0.6121621131896973
F数量为0，本次if限定用时0.7909219264984131
F数量为0，本次try限定用时0.6411514282226562
F数量为1000000，本次if限定用时0.8359837532043457
F数量为1000000，本次try限定用时1.98175048828125

第三次结果

F数量为499816，本次if限定用时0.7385780811309814
F数量为499816，本次try限定用时0.9664478302001953
F数量为100，本次if限定用时0.6796119213104248
F数量为100，本次try限定用时0.5826675891876221
F数量为0，本次if限定用时0.6956019401550293
F数量为0，本次try限定用时0.5746748447418213
F数量为1000000，本次if限定用时0.775557279586792
F数量为1000000，本次try限定用时1.4921455383300781

从测试结果可以知道，没有空列表或极少空列表时，try效率比if要高，但如果空列表数量过大时，采用if去限定比try效率高。

我们爬取数据时经常会遇见空列表，并且不知道有多少空列表，所以比起if去限定，不如直接用try语句。即使我们不知道哪个地方会出现IndexError这类错误，也可以先打印出来并且跳过进行执行代码。

九圣残炎

关注

5
点赞
踩
4

收藏

觉得还不错? 一键收藏
1
评论
《python爬虫学习》之for循环中的try和if效率对比

在爬取b站数据时，因为有些视频没有简介或时长，导致使用xpath提取数据时出现IndexError错误，即abstract = res.xpath('div[@class="r"]/div[@class="v-desc"]/text()')times = res.xpath('div[@class="l"]//span[@class="dur"]/text()')'''这里如果爬取到的数据为空，得到的结果是abstract=[]和times=[]因为这两个列表没有元素，所以使用abstract
复制链接

扫一扫