并发编程(二)，python 多线程爬虫

最新推荐文章于 2024-07-10 18:18:26 发布

两三行#

最新推荐文章于 2024-07-10 18:18:26 发布

阅读量268

点赞数 2

文章标签： python 并发编程多线程

本文链接：https://blog.csdn.net/xubishenghua/article/details/120383229

版权

本文对比了Python单线程与多线程爬虫的性能差异，阐述了多线程的无序性，并详细解释了`join()`函数的作用。通过实例介绍了使用生产者消费者模式构建爬虫的优势，以及如何利用`queue.Queue`进行线程间安全的数据通信。在实际应用中，展示了如何编写生产者消费者爬虫的代码，展示了队列在多线程中的阻塞特性。

摘要由CSDN通过智能技术生成

单线程爬取数据与多线程对比

import requests, time
import threading

urls = [
    f"https://q.cnblogs.com/list/unsolved?page={page}"
    for page in range(1, 50 + 1)
]


def crawling(url):
    data = requests.get(url)
    print(url, len(data.text))


# 单线程
def single_thread():
    for url in urls:
        crawling(url)

# 多线程
def single_threading():
    threads = []
    for url in urls:  # 线程执行的函数，不能加()否则调用，args元祖传参
        threads.append(threading.Thread(target=crawling, args=(url,)))
    for thread in threads:
        thread.start()
    for thread in threads:
        thread.join()


if __name__ == '__main__':
    print('start to perform!')
    start = time.time()
    single_thread()
    # single_threading()
    end = time.time()
    print(end - start)
# 单线程

>>start to perform!
https://q.cnblogs.com/list/unsolved?page=1 48108
https://q.cnblogs.com/list/unsolved?page=2 48842

.......

https://q.cnblogs.com/list/unsolved?page=50 49580

9.569056272506714

# 多线程

>>start to perform!
https://q.cnblogs.com/list/unsolved?page=6 49070
https://q.cnblogs.com/list/unsolved?page=19 48875

0.3545997142791748

9.5690/0.3545
26.992947813822287 整整26倍差距

从上述代码中我们不难看出线程的速度之快，同时也可以看到多线程的无序性，线程的调度是由cpu调度决定的，所以是无序的。同时也看到一个叫做join()的函数，下面对join()做一个讲解。

join():子线程执行完毕后在执行主线程，当这个线程执行完毕后在执行其它线程。从三个案例中去分析：

案例一：无join

def single_threading():
    threads = []
    for url in urls:  # 线程执行的函数，不能加()否则调用，args元祖传参
        threads.append(threading.Thread(target=crawling, args=(url,)))
    for thread in threads:
        thread.start()
    # for thread in threads:
    #     thread.join()

>>start to perform!
0.06383013725280762
https://q.cnblo