python爬虫网络请求超时_Python-爬虫-多线程、线程池模拟（urllib、requests、UserAgent、超时等）...

最新推荐文章于 2024-01-14 12:38:10 发布

weixin_39796116

最新推荐文章于 2024-01-14 12:38:10 发布

阅读量179

点赞数

文章标签： python爬虫网络请求超时

我想这样：

1.创建一个线程池，线程池数量可以定为初始化16大小（如果无可用线程，则再次分配16个线程加入到线程池 - 目前线程编号有重复）

2.然后url列表装载到一个队列Queue里面

3.接下来遍历url列表数量（无需获取url，只是为了启动一个线程来处理url），同时启动一个线程（该线程会从队列里面去获取url进行爬取）

4（attention）. 然后主线程等待子线程运行完毕（过程中加入了运行线程是否活着的判断，如果运行了就不用join了）

5（attention）. 网络请求添加了超时请求，github模拟会比较慢，懒得等

So，看代码

thread_pool.py

#!/usr/bin/python3

# -*- coding: UTF-8 -*-

# 文件名：thread_pool.py

from threading import Thread

from queue import Queue

import time as Time

from urllib import request

tread_pool_len = 16

threads_pool = []

running_thread = []

url_list = [

'http://www.baidu.com',

'https://github.com/FanChael/DocPro',

'http://www.baidu.com',

'https://github.com/FanChael/DocPro',

'http://www.baidu.com',

'https://github.com/FanChael',

]

# url列表长度

url_len = len(url_list)

# 创建队列并初始化

queue = Queue(url_len)

for url in url_list:

queue.put(url)

# 伪装浏览器

headers = {

'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.70 Safari/537.36',

}

# 自定义线程

class my_thread(Thread):

def __init__(self):

Thread.__init__(self)

def run(self):

if not queue.empty():

print(self.getName(), '运行中')

data = ''

try:

req = request.Request(queue.get(), None, headers)

with request.urlopen(req, timeout=5) as uf:

while True:

data_temp = uf.read(1024)

if not data_temp:

break

data += data_temp.decode('utf-8', 'ignore')

# print('线程', self.getName(), '获取数据=', data)

except Exception as err:

print(self.getName(), str(err))

else:

pass

# 初始化线程池

def init_thread(count):

thread_count = len(threads_pool)

for i in range(thread_count, count):

thead = my_thread()

thead.setName('第' + str(i) + '号线程')

threads_pool.append(thead)

# 获取可用线程 - 优化思路：每次都遍历一遍效率低，可以封装对象，设置标示位，执行结束后改变标志位状态；但这样还是要循环一遍；此时取到一定数量或者快到头了，然后再从头遍历

def get_available():

for c_thread in threads_pool:

if not c_thread.isAlive():

threads_pool.remove(c_thread)

return c_thread

# 扩容线程

init_thread(tread_pool_len)

return get_available()

if __name__ == '__main__':

# 初始化线程池

init_thread(tread_pool_len)

# 启动时间

start_time = Time.time()

# 启动线程去从队列获取url执行请求

for i in range(url_len):

a_thread = get_available()

if a_thread:

running_thread.append(a_thread)

a_thread.start()

# 主线程等所有子线程运行完毕

for t in running_thread:

if t.isAlive():

t.join()

# 结束时间

end_time = Time.time()

print(len(running_thread), '个线程, ', '运行时间: ', end_time - start_time, '秒')

print('空余线程数: ', len(threads_pool))

Result :

D:\PycharmProjects\python_study\venv3.x\Scripts\python.exe D:/PycharmProjects/python_study/protest/thread_pool.py

第0号线程运行中

第1号线程运行中

第2号线程运行中

第3号线程运行中

第4号线程运行中

第5号线程运行中

第6号线程运行中

第7号线程运行中

第8号线程运行中

第9号线程运行中

第0号线程运行中

第1号线程运行中

第2号线程运行中

第1号线程

第2号线程 The read operation timed out

13 个线程, 运行时间: 20.04409170150757 秒

空余线程数: 7

Process finished with exit code 0

补充....这个地方我还会去看哈主流的线程池爬虫方案（其中官方线程池的用法参考： python线程池 ThreadPoolExecutor 的用法及实战），然后学习下，然后补充

threadpoolexecutor_practice.py

#!/usr/bin/python3

# -*- coding: UTF-8 -*-

# 文件名：threadpoolexecutor_practice.py

from concurrent.futures import ThreadPoolExecutor, wait, FIRST_COMPLETED, ALL_COMPLETED, as_completed

from urllib import request

# 伪装浏览器

headers = {

'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.70 Safari/537.36',

}

url_list = [

'http://www.baidu.com',

'https://github.com/FanChael/DocPro',

'http://www.baidu.com',

'https://github.com/FanChael/DocPro',

'http://www.baidu.com',

'https://github.com/FanChael',

]

def spider(url_path):

data_html = ''

try:

req = request.Request(url_path, None, headers)

# 爬到内容不对的还需要结合selenium等获取动态js内容

with request.urlopen(req, timeout=5) as uf:

while True:

data_temp = uf.read(1024)

if not data_temp:

break

data_html += data_temp.decode('utf-8', 'ignore')

# 爬到的数据可以本地或者数据库 - 总之进行一系列后续处理

print(url_path, " 完成")

except Exception as err:

print(str(err))

else:

pass

return data_html

# 创建一个最大容量为1的线程

executor = ThreadPoolExecutor(max_workers=16)

if __name__ == '__main__':

tasks = []

# 执行蜘蛛并加入执行列表

for url in url_list:

# 执行函数，并传入参数

task = executor.submit(spider, url)

tasks.append(task)

# 等待方式1：结束

# wait(tasks, return_when=ALL_COMPLETED)

# 等待方式2：结束

for future in as_completed(tasks):

# spider方法无返回，则返回为None

data = future.result()

print(f"main:{data[0:10]}")

# 等待方式3: 结束 - 替代submit并伴随等待！

# for data in executor.map(spider, url_list):

# print(data)

print('结束啦')

用官方的线程池，更简单一些，别人都做好了处理线程的管理。其实点击进去看看源码，大概也知道，也有类似的扩容处理，然后调用封装，任务也都是放到的队列里面的。比如下面一段源码：

OK，先酱紫...下一步数据库连接，正则匹配学哈。。差不多公司项目就可以看看了....具体其他的再深入...

weixin_39796116

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫