目录
单线程爬取数据与多线程对比
import requests, time import threading urls = [ f"https://q.cnblogs.com/list/unsolved?page={page}" for page in range(1, 50 + 1) ] def crawling(url): data = requests.get(url) print(url, len(data.text)) # 单线程 def single_thread(): for url in urls: crawling(url) # 多线程 def single_threading(): threads = [] for url in urls: # 线程执行的函数,不能加()否则调用,args元祖传参 threads.append(threading.Thread(target=crawling, args=(url,))) for thread in threads: thread.start() for thread in threads: thread.join() if __name__ == '__main__': print('start to perform!') start = time.time() single_thread() # single_threading() end = time.time() print(end - start)# 单线程
>>start to perform!
https://q.cnblogs.com/list/unsolved?page=1 48108
https://q.cnblogs.com/list/unsolved?page=2 48842.......
https://q.cnblogs.com/list/unsolved?page=50 49580
9.569056272506714
# 多线程
>>start to perform!
https://q.cnblogs.com/list/unsolved?page=6 49070
https://q.cnblogs.com/list/unsolved?page=19 488750.3545997142791748
9.5690/0.3545
26.992947813822287 整整26倍差距
从上述代码中我们不难看出线程的速度之快,同时也可以看到多线程的无序性,线程的调度是由cpu调度决定的,所以是无序的。同时也看到一个叫做join()的函数,下面对join()做一个讲解。
join():子线程执行完毕后在执行主线程,当这个线程执行完毕后在执行其它线程。从三个案例中去分析:
案例一:无join
def single_threading(): threads = [] for url in urls: # 线程执行的函数,不能加()否则调用,args元祖传参 threads.append(threading.Thread(target=crawling, args=(url,))) for thread in threads: thread.start() # for thread in threads: # thread.join()>>start to perform!
0.06383013725280762
https://q.cnblo