提高爬虫速度问题（线程）

eqwaak0

于 2024-03-27 15:37:26 发布

阅读量758

点赞数 24

文章标签：爬虫 linux 运维 python

本文链接：https://blog.csdn.net/eqwaak0/article/details/137076939

版权

本文介绍了使用Python的requests库进行单线程爬虫，并探讨了多线程爬虫的两种方法：函数式和类包装式，以及如何通过Threading库实现线程控制。通过实例展示了多线程在Wudi文本爬取中的应用，强调了多线程对于提高爬虫速度的重要性。

摘要由CSDN通过智能技术生成

我们先开始使用单线程问题来看爬虫速度：（在wudi中有爬虫网站的地址）

import requests
import time

link_list=[]
with open('wudi.txt','r') as file:
    file_list=file.readlines()
    for eachone in file_list:
        link=eachone.split('\t')[1]
        link=link.replace('\n','')
        link_list.append(link)

    start=time.time()
    for eachone in link_list:
        try:
            r=requests.get(eachone)
            print(r.status_code,eachone)
        except Exception as e:
            print('Error:',e)
            
    end=time.time()

我们爬去5个时间是：100.428

如果是多线程的是话（有两种方法）

1.函数式：调用_thread模块里面的start_new_thread（）函数产生的新线程。

2.类包装式：调用Threading库创建线程，从threading.Thread继承。

import _thread
import time

def print_time(threadName,delay):
    count=0
    while count<3:
        time.sleep(delay)
        count+=1
        print(threadName,time.ctime())

_thread.start_new_thread(print_time,("THread1",1))
_thread.start_new_thread(print_time,("THread2",2))
print('运行结果：')

我们可以看见代码的运行

在_thread里面的用法：

_thread.start_new_thread(function,args[,kwargs]).但是它相比于threading模块相对于还是太局限了。

threading的方法：

run()：用以表示线程活动的方法。
start():启动线程活动。
join([time]):等待线程的中止。
isAlive():返回线程是否是活动的。
getName():返回线程名。
setName():设置线程名。

import threading
import time
class All(threading.Thread):
    def __init__(self,name,dealy):
        threading.Thread.__init__(self)
        self.name=name
        self.dealy=dealy
    def run(self):
        print('Starting'+self.name)
        print_time(self.name,self.dealy)
        print('Exiting'+self.name)

def print_time(threadName,delay):
    count=0
    while count<3:
        time.sleep(delay)
        print(threadName,time.ctime())
        count+=1
#创建线程
threads=[]
thread1=All('Thread1',1)
thread2=All('Thread2',2)
#开启线程
thread1.start()
thread2.start()

#添加线程到列表
thread.append(thread1)
thread.append(thread2)
#等待全部结束
for t in threads:
    t.join()
print('运行结束')

threading能够有效地控制线程。

多线程爬虫：

import threading
import time
import requests

link_list = []
with open('wudi.txt', 'r') as file:
    file_list = file.readlines()
    for eachone in file_list:
        link = eachone.split('\t')[1]
        link = link.replace('\n', '')
        link_list.append(link)

start = time.time()
class myThread(threading.Thread):
    def __init__(self,name,link_range):
        threading.Thread.__init__(self)
        self.name=name
        self.link_range=link_range
    def run(self):
        print('starting'+self.name)
        crawler(self.name,self.link_range)
        print('exiting'+self.name)
def crawler(threadName,link_range):
        for i in range(link_range[0],link_range[1]+1):
            try:
                r = requests.get(link_list[i],timeout=20)
                print(threadName,r.status_code,link_list[i])
            except Exception as e:
                print(threadName,'Error:', e)

thread_list=[]
link_range_list=[{0,200},{201,400},{401,600},{601,800},{801,1000}]

for i in range(0,6):
    thread=myThread('Thread'+str(i),link_range_list[i-1])
    thread.start()
    thread_list.append(thread)

for thread in thread_list:
    thread.join()

end = time.time()
print('多线程爬取时间：',end-start)
print('over')

wudi的text是自己的的文件网页地址*3

多线程是我们把它们分为5份，先爬取完的退出，到最后还是单线程。

下次我们讲Queue爬取可以同时快速的爬取。