python中我们经常会用到多线程,虽说对于cpu密集型的运算多线程毫无作用,但对于一些阻塞性操作来说还是非常好用的,比如爬虫,单线程爬取网站时间都耗在了阻塞等待上了,因此多线程爬虫就显得尤为重要,为了更好的了解多线程,自己写一个多线程爬取网站的简单框架
需要用到的模块:requests,threading,lxml,queue,time
爬取的网站:https://www.jd.com/,京东
京东太大了,只爬取手机类
线程数:10
上代码
# 多线程爬取京东手机类
import requests
import threading
from lxml.html import etree
from queue import Queue
import time
proxies = {"http":"http://222.189.246.31:9999"} # 使用代理ip,也可以使用自己的ip,京东一般不会封ip
header = {
"User-Agent":"Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36"
} # 伪装谷歌浏览器
# 自定义线程
class BookTread(threading.Thread):
def __init__(self,url_queue):
super(BookTread,self).__init__() # 继承父类的属性、方法
self.url_queue = url_queue # 接收参数
self.file = open("E:/python/spider/jd_phone1.txt","a",encoding="utf-8") # 打开文件
self.lock = threading.Lock() # 创建线程锁,写入文件时使用
def run(self):
while True:
if self.url_queue.empty(): # 如果队列为空,则结束爬取
break
url = self.url_queue.get() # 获取队列里的url
response = requests.get(url,headers=header,proxies=proxies) # 请求网站
data = response.content.decode() # 获取utf-8格式网页数据
html = etree.HTML(data) # 将网页数据变成xml文档格式,用xpath提取信息
title_list = html.xpath("//div[@class='p-name']//em/text()") # 提取手机标题信息,如想提取其他信息可自行提取,方法都是一样,主要看过程
self.lock.acquire()
for title in title_list: # 写入文件
title = title.strip()
self.file.write(title+"\n")
self.lock.release()
def __del__(self):
self.file.close() # 关闭文件
if __name__ == "__main__":
start_time = time.time()
q = Queue() #创建队列
for i in range(1,141): # 将所有的url加入队列
url = "https://list.jd.com/list.html?cat=9987,653,655&page={}&sort=sort_rank_asc&trans=1&JL=6_0_0&ms=9#J_main".format(i)
q.put(url)
thread_list = []
# 启动10个线程
for j in range(10):
t = BookTread(q) # 将队列作为参数传给线程对象
t.start() 开启线程
thread_list.append(t)
#阻塞主线程,使所有子线程都结束后在执行时间打印的主线程
for thread in thread_list:
thread.join()
end_time = time.time()
print("爬取网站手机类标题需要:{}秒".format(end_time-start_time)) # 打印所用时间
执行结果:
可以看到,总共爬取到8404个手机信息。
所用时间:
没有和单线程的对比可能无法感受到多线程的速度,下面我使用单线程在爬取一次
import requests
from lxml.html import etree
import time
proxies = {"http":"http://222.189.246.31:9999"}
header = {
"User-Agent":"Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36"
}
start_time = time.time()
file = open("E:/python/spider/jd_phone1.txt","w",encoding="utf-8")
for i in range(1,141):
url = "https://list.jd.com/list.html?cat=9987,653,655&page={}&sort=sort_rank_asc&trans=1&JL=6_0_0&ms=9#J_main".format(i)
response = requests.get(url, headers=header, proxies=proxies)
data = response.content.decode()
html = etree.HTML(data)
title_list = html.xpath("//div[@class='p-name']//em/text()")
for title in title_list:
title = title.strip()
file.write(title+"\n")
file.close()
end_time = time.time()
print("爬取网站手机类标题需要:{}秒".format(end_time-start_time))
5秒对24秒,完美胜出,这只是140个网页,如果大量网页的话,他们之间的差距会更加明显。