爬取豆瓣电影top75测试多线程

用threading模块写一个简单的多线程爬虫和单线程爬虫对比爬取速度

import requests
import re
import threading
import time

# 单线程爬取
def spider(url,headers):
    response = requests.get(url,headers).text
    pattern = re.compile('<div class="pic">.*?<a href="(.*?)">',re.S)
    linkList = pattern.findall(response)
    for link in linkList:
        html = requests.get(link,headers).text
        p1 = re.compile('<span class="top250-no">(.*?)</span>',re.S)
        p2 = re.compile('<span property="v:itemreviewed">(.*?)</span>',re.S)
        num = re.findall(p1,html)
        title = re.findall(p2,html)
        print(num[0],':',title[0])


# 多线程爬取(三线程)
lock = threading.RLock() # 线程中的锁机制
#爬取每个电影的排名和电影名称
def infoSpider(link,headers):
    html = requests.get(link, headers).text
    p1 = re.compile('<span class="top250-no">(.*?)</span>', re.S)
    p2 = re.compile('<span property="v:itemreviewed">(.*?)</span>', re.S)
    num = re.findall(p1, html)
    title = re.findall(p2, html)
    print(num[0], ':', title[0])

def A(linkList,headers):
    # lock.acquire()
    for i in range(0, 25, 3):
        url = linkList[i]
        infoSpider(url, headers)
    # lock.release()

def B(linkList,headers):
    # lock.acquire()
    for i in range(1,25, 3):
        url = linkList[i]
        infoSpider(url, headers)
    # lock.release()
    
def C(linkList,headers):
    # lock.acquire()
    for i in range(2,25, 3):
        url = linkList[i]
        infoSpider(url, headers)
    # lock.release()

def spider2(url,headers):
    response = requests.get(url,headers).text
    pattern = re.compile('<div class="pic">.*?<a href="(.*?)">',re.S)
    linkList = pattern.findall(response)
    t1 = threading.Thread(target=A, args=(linkList,headers))
    t2 = threading.Thread(target=B, args=(linkList,headers))
    t3 = threading.Thread(target=C, args=(linkList,headers))
    t1.start()
    t2.start()
    t3.start()
    t1.join()
    t2.join()
    t3.join()

def main():
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36'
    }
	#单线程测试
    start1 = time.time()
    for i in range(3):
        url = 'https://movie.douban.com/top250?start=%d'%(i*25)
        spider(url,headers)
    end1 = time.time()
	#多线程测试
    start2 = time.time()
    for i in range(3):
        url = 'https://movie.douban.com/top250?start=%d'%(i*25)
        spider2(url,headers)
    end2 = time.time()
    print(end1-start1)#单线程运行时间
    print(end2-start2)#多线程运行时间


if __name__ == '__main__':
    main()

三线程爬取时间基本为单线程时间的三倍

  • 0
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值