python queue 多线程取_python-非常简单的多线程并行URL提取（无队列）

最新推荐文章于 2021-10-04 21:03:09 发布

虎山行不行

最新推荐文章于 2021-10-04 21:03:09 发布

阅读量278

点赞数

文章标签： python queue 多线程取

本文链接：https://blog.csdn.net/weixin_30531679/article/details/114390243

版权

python-非常简单的多线程并行URL提取(无队列)

我花了一整天的时间来寻找Python中最简单的多线程URL提取程序，但是我发现的大多数脚本都使用队列或多处理或复杂的库。

最终，我写了一个我自己的东西，我正在回答这个问题。请随时提出任何改进建议。

我想其他人可能一直在寻找类似的东西。

5个解决方案

43 votes

尽可能简化您的原始版本：

import threading

import urllib2

import time

start = time.time()

urls = ["http://www.google.com", "http://www.apple.com", "http://www.microsoft.com", "http://www.amazon.com", "http://www.facebook.com"]

def fetch_url(url):

urlHandler = urllib2.urlopen(url)

html = urlHandler.read()

print "'%s\' fetched in %ss" % (url, (time.time() - start))

threads = [threading.Thread(target=fetch_url, args=(url,)) for url in urls]

for thread in threads:

thread.start()

for thread in threads:

thread.join()

print "Elapsed Time: %s" % (time.time() - start)

唯一的新技巧是：

跟踪您创建的线程。

如果您只想知道线程何时完成，请不要打扰线程计数器。 Thread已经告诉您了。

如果不需要任何状态或外部API，则不需要Thread子类，而只需target函数。

abarnert answered 2020-02-09T03:20:34Z

29 votes

multiprocessing有一个不会启动其他进程的线程池：

#!/usr/bin/env python

from multiprocessing.pool import ThreadPool

from time import time as timer

from urllib2 import urlopen

urls = ["http://www.google.com", "http://www.apple.com", "http://www.microsoft.com", "http://www.amazon.com", "http://www.facebook.com"]

def fetch_url(url):

try:

response = urlopen(url)

return url, response.read(), None

except Exception as e:

return url, None, e

start = timer()

results = ThreadPool(20).imap_unordered(fetch_url, urls)

for url, html, error in results:

if error is None:

print("%r fetched in %ss" % (url, timer() - start))

else:

print("error fetching %r: %s" % (url, error))

print("Elapsed Time: %s" % (timer() - start,))

与基于from urllib.request import urlopen的解决方案相比的优势：

from urllib.request import urlopen允许限制最大并发连接数(在代码示例中为20)

输出不乱码，因为所有输出都在主线程中

错误记录

该代码可在Python 2和3上运行而无需更改(假设Python 3上为from urllib.request import urlopen)。

jfs answered 2020-02-09T03:21:16Z

14 votes

concurrent.futures中的主要示例可以轻松完成所需的一切。另外，它一次只能处理5个网址，因此可以处理大量的URL，并且可以更好地处理错误。

当然，该模块仅内置于Python 3.2或更高版本中……但是，如果您使用的是2.5-3.1，则可以从PyPI上安装backport concurrent.futures。您只需从示例代码中进行更改，就是使用futures搜索并替换concurrent.futures，对于2.x，请使用urllib2搜索并替换urllib.request。

这是反向移植到2.x的示例，已修改为使用您的URL列表并添加时间：

import concurrent.futures

import urllib2

import time

start = time.time()

urls = ["http://www.google.com", "http://www.apple.com", "http://www.microsoft.com", "http://www.amazon.com", "http://www.facebook.com"]

# Retrieve a single page and report the url and contents

def load_url(url, timeout):

conn = urllib2.urlopen(url, timeout=timeout)

return conn.readall()

# We can use a with statement to ensure threads are cleaned up promptly

with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:

# Start the load operations and mark each future with its URL

future_to_url = {executor.submit(load_url, url, 60): url for url in urls}

for future in concurrent.futures.as_completed(future_to_url):

url = future_to_url[future]

try:

data = future.result()

except Exception as exc:

print '%r generated an exception: %s' % (url, exc)

else:

print '"%s" fetched in %ss' % (url,(time.time() - start))

print "Elapsed Time: %ss" % (time.time() - start)

但是，您可以使其更简单。确实，您需要做的是：

def load_url(url):

conn = urllib2.urlopen(url, timeout)

data = conn.readall()

print '"%s" fetched in %ss' % (url,(time.time() - start))

return data

with futures.ThreadPoolExecutor(max_workers=5) as executor:

pages = executor.map(load_url, urls)

print "Elapsed Time: %ss" % (time.time() - start)

abarnert answered 2020-02-09T03:21:50Z

1 votes

我现在正在发布一种不同的解决方案，方法是使工作线程处于非守护状态并将它们连接到主线程(这意味着在所有工作线程都完成之前阻塞主线程)，而不是通过一个通知工作线程的结束。回调到全局函数(如我在上一个答案中所做的那样)，正如在某些评论中指出的那样，这种方式不是线程安全的。

import threading

import urllib2

import time

start = time.time()

urls = ["http://www.google.com", "http://www.apple.com", "http://www.microsoft.com", "http://www.amazon.com", "http://www.facebook.com"]

class FetchUrl(threading.Thread):

def __init__(self, url):

threading.Thread.__init__(self)

self.url = url

def run(self):

urlHandler = urllib2.urlopen(self.url)

html = urlHandler.read()

print "'%s\' fetched in %ss" % (self.url,(time.time() - start))

for url in urls:

FetchUrl(url).start()

#Join all existing threads to main thread.

for thread in threading.enumerate():

if thread is not threading.currentThread():

thread.join()

print "Elapsed Time: %s" % (time.time() - start)

Daniele B answered 2020-02-09T03:22:11Z

-1 votes

该脚本从数组中定义的一组URL中获取内容。它为每个要获取的URL产生一个线程，因此只能用于有限的一组URL。

每个线程不使用队列对象，而是通过对全局函数的回调来通知其结束，该全局函数保留正在运行的线程数的计数。

import threading

import urllib2

import time

start = time.time()

urls = ["http://www.google.com", "http://www.apple.com", "http://www.microsoft.com", "http://www.amazon.com", "http://www.facebook.com"]

left_to_fetch = len(urls)

class FetchUrl(threading.Thread):

def __init__(self, url):

threading.Thread.__init__(self)

self.setDaemon = True

self.url = url

def run(self):

urlHandler = urllib2.urlopen(self.url)

html = urlHandler.read()

finished_fetch_url(self.url)

def finished_fetch_url(url):

"callback function called when a FetchUrl thread ends"

print "\"%s\" fetched in %ss" % (url,(time.time() - start))

global left_to_fetch

left_to_fetch-=1

if left_to_fetch==0:

"all urls have been fetched"

print "Elapsed Time: %ss" % (time.time() - start)

for url in urls:

"spawning a FetchUrl thread for each url to fetch"

FetchUrl(url).start()

Daniele B answered 2020-02-09T03:22:35Z

虎山行不行

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
python queue 多线程取_python-非常简单的多线程并行URL提取（无队列）

python-非常简单的多线程并行URL提取(无队列)我花了一整天的时间来寻找Python中最简单的多线程URL提取程序，但是我发现的大多数脚本都使用队列或多处理或复杂的库。最终，我写了一个我自己的东西，我正在回答这个问题。请随时提出任何改进建议。我想其他人可能一直在寻找类似的东西。5个解决方案43 votes尽可能简化您的原始版本：import threadingimport urllib2i...
复制链接

扫一扫