爬虫的多线程爬出网页的URL

最新推荐文章于 2024-07-07 23:56:48 发布

洲小洲

最新推荐文章于 2024-07-07 23:56:48 发布

阅读量347

点赞数 2

分类专栏：爬虫文章标签： python 爬虫

本文链接：https://blog.csdn.net/weixin_43907174/article/details/124460683

版权

爬虫专栏收录该内容

2 篇文章 0 订阅

订阅专栏

代码

import urllib.request
import time
import queue
from threading import Thread
import threading
url = "https://www.pythontab.com/html/pythonjichu/"（可以根据自己的实际情况选取url）
#构建队列
queue = queue.Queue()
queue.put(url)
for i in range(2,10):
    new_url = url+str(i)+'.html'
    queue.put(new_url)
#获取url，多线程获取url
def fetchurl(urlQueue):
    while True:
        try:
            urlQueue.get_nowait()
            number = urlQueue.qsize()
            print(number)
        except Exception as e:
            break
        print('当前的url是：','url')
        print('当前线性{}获取的url是：{}'.format(threading.currentThread().name,url))
        try:
            #获取url
            response = urllib.request.urlopen(url)
            status_code = response.getcode()
            if status_code == 200:
                time.sleep(0.5)
        except Exception as e:
            continue
stat_time = time.time()
#准备线程列表
threads = []
thread_num = 10(线程的个数)
for i in range(thread_num):
    thread = Thread(target=fetchurl,args=(queue,))
    threads.append(thread)
for t in threads:
    t.start()
for t in threads:
    t.join()
end_time = time.time()
print('消耗的时间是：',end_time-stat_time)