并发编程(四)如何使用多线程,使用多线程对爬虫程序进行修改及比较

并发编程专栏系列博客

并发编程(一)python并发编程简介

并发编程(二)怎样选择多线程多进程和多协程

并发编程(三)Python编程慢的罪魁祸首。全局解释器锁GIL

并发编程(四)如何使用多线程,使用多线程对爬虫程序进行修改及比较

并发编程(五)python实现生产者消费者模式多线程爬虫

并发编程(六)线程安全问题以及lock解决方案

并发编程(七)好用的线程池ThreadPoolExecutor

并发编程(八)在web服务中使用线程池加速

并发编程(九)使用多进程multiprocessing加速程序运行

并发编程(十)在Flask服务中使用进程池加速

并发编程(十一)python异步IO实现并发编程

并发编程(十二)使用subprocess启动电脑任意程序(听歌、解压缩、自动下载等等)

 
 

Python创建多线程的方法
# 1.准备一个函数
def my_func(a, b):
    do_craw(a, b)
    
# 2.创建一个线程
import threading

t = threading.Thread(target=my_func, args=(100, 200))

# 3.启动线程
t.start()

# 4.等待结束
t.join()

 

使用最基本的方法爬取数据
# 导包
import requests

# 生成要爬取的url列表
urls = [
    f"https://w.cnblogs.com/#p{page}"
    for page in range(1, 51)
]

# 创建爬取数据函数
def craw(url):
    '''
    input:url,要爬取的网页链接
    '''
    # 请求网页
    r = requests.get(url)
    # 输出网页链接和网页文本长度
    print(url, len(r.text))

# 调用爬取数据函数
craw(urls[10])

'''
output:
https://w.cnblogs.com/#p11 70100
'''

 

分别使用单线程和多线程来比较程序执行速度
# -*- coding: utf-8 -*-
# @Time    : 2021-03-20 14:47:02
# @Author  : wlq
# @FileName: blog_spider.py
# @Email   :rd_wlq@163.com

# 导包
import requests
import threading
import time

# 创建装饰器,用于计算程序运行时间(如有疑问可参考https://blog.csdn.net/qq_42546127/article/details/115007989)
def get_time(func):
    def wrapper(*args, **kwargs):
        start_time = time.time()
        func(*args, **kwargs)
        end_time = time.time()
        print(end_time - start_time)

    return wrapper

# 生成爬取链接列表
urls = [
    f"https://w.cnblogs.com/#p{page}"
    for page in range(1, 51)
]

# 原生的爬取数据函数
def craw(url):
    r = requests.get(url)
    print(url, len(r.text))

# 使用单线程爬取数据
@get_time
def single_thread():
    print("single_thread begin")
    for url in urls:
        craw(url)
    print("single_thread end")

# 使用多线程爬取数据
@get_time
def multi_thread():
    print("multi_thread begin")
    threads = []

    for url in urls:
        threads.append(
            threading.Thread(target=craw, args=(url,))
        )

    for thread in threads:
        thread.start()

    for thread in threads:
        thread.join()

    print("multi_thread end")

# 执行函数
if __name__ == '__main__':
    single_thread()
    multi_thread()

'''
output:
single_thread begin
https://w.cnblogs.com/#p1 70100
https://w.cnblogs.com/#p2 70100
https://w.cnblogs.com/#p3 70100
https://w.cnblogs.com/#p4 70100
https://w.cnblogs.com/#p5 70100
https://w.cnblogs.com/#p6 70100
https://w.cnblogs.com/#p7 70100
https://w.cnblogs.com/#p8 70100
https://w.cnblogs.com/#p9 70100
https://w.cnblogs.com/#p10 70100
https://w.cnblogs.com/#p11 70100
https://w.cnblogs.com/#p12 70100
https://w.cnblogs.com/#p13 70100
https://w.cnblogs.com/#p14 70100
https://w.cnblogs.com/#p15 70100
https://w.cnblogs.com/#p16 70100
https://w.cnblogs.com/#p17 70100
https://w.cnblogs.com/#p18 70100
https://w.cnblogs.com/#p19 70100
https://w.cnblogs.com/#p20 70100
https://w.cnblogs.com/#p21 70100
https://w.cnblogs.com/#p22 70100
https://w.cnblogs.com/#p23 70100
https://w.cnblogs.com/#p24 70100
https://w.cnblogs.com/#p25 70100
https://w.cnblogs.com/#p26 70100
https://w.cnblogs.com/#p27 70100
https://w.cnblogs.com/#p28 70100
https://w.cnblogs.com/#p29 70100
https://w.cnblogs.com/#p30 70100
https://w.cnblogs.com/#p31 70100
https://w.cnblogs.com/#p32 70100
https://w.cnblogs.com/#p33 70100
https://w.cnblogs.com/#p34 70100
https://w.cnblogs.com/#p35 70100
https://w.cnblogs.com/#p36 70100
https://w.cnblogs.com/#p37 70100
https://w.cnblogs.com/#p38 70100
https://w.cnblogs.com/#p39 70100
https://w.cnblogs.com/#p40 70100
https://w.cnblogs.com/#p41 70100
https://w.cnblogs.com/#p42 70100
https://w.cnblogs.com/#p43 70100
https://w.cnblogs.com/#p44 70100
https://w.cnblogs.com/#p45 70100
https://w.cnblogs.com/#p46 70100
https://w.cnblogs.com/#p47 70100
https://w.cnblogs.com/#p48 70100
https://w.cnblogs.com/#p49 70100
https://w.cnblogs.com/#p50 70100
single_thread end
9.648617267608643
multi_thread begin
https://w.cnblogs.com/#p7 70100
https://w.cnblogs.com/#p2 70100
https://w.cnblogs.com/#p1https://w.cnblogs.com/#p4 70100 70100

https://w.cnblogs.com/#p6 70100
https://w.cnblogs.com/#p12https://w.cnblogs.com/#p10 70100
 https://w.cnblogs.com/#p3https://w.cnblogs.com/#p5 7010070100
https://w.cnblogs.com/#p13 70100
 70100https://w.cnblogs.com/#p8 70100

https://w.cnblogs.com/#p11https://w.cnblogs.com/#p9 70100

 https://w.cnblogs.com/#p1670100 
70100
https://w.cnblogs.com/#p19 70100
https://w.cnblogs.com/#p18https://w.cnblogs.com/#p15 70100
https://w.cnblogs.com/#p17 70100 
70100
https://w.cnblogs.com/#p14 70100
https://w.cnblogs.com/#p25 70100
https://w.cnblogs.com/#p24 https://w.cnblogs.com/#p21https://w.cnblogs.com/#p27 70100
70100https://w.cnblogs.com/#p23https://w.cnblogs.com/#p20  70100
70100 70100


https://w.cnblogs.com/#p26 70100
https://w.cnblogs.com/#p30 70100
https://w.cnblogs.com/#p32https://w.cnblogs.com/#p35 70100
https://w.cnblogs.com/#p22https://w.cnblogs.com/#p31 70100
 70100
https://w.cnblogs.com/#p33https://w.cnblogs.com/#p37 70100  70100https://w.cnblogs.com/#p29

70100 
70100
https://w.cnblogs.com/#p28 70100
https://w.cnblogs.com/#p34https://w.cnblogs.com/#p39 70100
 70100
https://w.cnblogs.com/#p36 70100
https://w.cnblogs.com/#p41 70100
https://w.cnblogs.com/#p38 70100
https://w.cnblogs.com/#p40 70100
https://w.cnblogs.com/#p42 70100
https://w.cnblogs.com/#p44https://w.cnblogs.com/#p45 https://w.cnblogs.com/#p46 70100
 70100
70100
https://w.cnblogs.com/#p48 70100
https://w.cnblogs.com/#p47 70100
https://w.cnblogs.com/#p43 70100
https://w.cnblogs.com/#p50 70100
https://w.cnblogs.com/#p49 70100
multi_thread end
0.5046284198760986
'''

从上面程序输出来看,单线程是顺序一个一个执行的,执行时间是9.648617267608643。而多线程是随机执行的,执行时间是0.5046284198760986。

可计算得
m u l t i p l e = 9.648617267608643 0.5046284198760986 = 19.120241523411753 ≈ 19 multiple = \frac{9.648617267608643}{0.5046284198760986} = 19.120241523411753 \approx 19 multiple=0.50462841987609869.648617267608643=19.12024152341175319

两者相差19倍,从而体现出使用多线程的必要性。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值