python-非常简单的多线程并行URL提取(无队列)
我花了一整天的时间来寻找Python中最简单的多线程URL提取程序,但是我发现的大多数脚本都使用队列或多处理或复杂的库。
最终,我写了一个我自己的东西,我正在回答这个问题。 请随时提出任何改进建议。
我想其他人可能一直在寻找类似的东西。
5个解决方案
43 votes
尽可能简化您的原始版本:
import threading
import urllib2
import time
start = time.time()
urls = ["http://www.google.com", "http://www.apple.com", "http://www.microsoft.com", "http://www.amazon.com", "http://www.facebook.com"]
def fetch_url(url):
urlHandler = urllib2.urlopen(url)
html = urlHandler.read()
print "'%s\' fetched in %ss" % (url, (time.time() - start))
threads = [threading.Thread(target=fetch_url, args=(url,)) for url in urls]
for thread in threads:
thread.start()
for thread in threads:
thread.join()
print "Elapsed Time: %s" % (time.time() - start)
唯一的新技巧是:
跟踪您创建的线程。
如果您只想知道线程何时完成,请不要打扰线程计数器。 Thread已经告诉您了。
如果不需要任何状态或外部API,则不需要Thread子类,而只需target函数。
abarnert answered 2020-02-09T03:20:34Z
29 votes
multiprocessing有一个不会启动其他进程的线程池:
#!/usr/bin/env python
from multiprocessing.pool import ThreadPool
from time import time as timer
from urllib2 import urlopen
urls = ["http://www.google.com", "http://www.apple.com", "http://www.microsoft.com", "http://www.amazon.com", "http://www.facebook.com"]
def fetch_url(url):
try:
response = urlopen(url)
return url, response.read(), None
except Exception as e:
return url, None, e
start = timer()
results = ThreadPool(20).imap_unordered(fetch_url, urls)
for url, html, error in results:
if error is None:
print("%r fetched in %ss" % (url, timer() - start))
else:
print("error fetching %r: %s" % (url, error))
print("Elapsed Time: %s" % (timer() - start,))
与基于from urllib.request import urlopen的解决方案相比的优势:
from urllib.request import urlopen允许限制最大并发连接数(在代码示例中为20)
输出不乱码,因为所有输出都在主线程中
错误记录
该代码可在Python 2和3上运行而无需更改(假设Python 3上为from urllib.request import urlopen)。
jfs answered 2020-02-09T03:21:16Z
14 votes
concurrent.futures中的主要示例可以轻松完成所需的一切。 另外,它一次只能处理5个网址,因此可以处理大量的URL,并且可以更好地处理错误。
当然,该模块仅内置于Python 3.2或更高版本中……但是,如果您使用的是2.5-3.1,则可以从PyPI上安装backport concurrent.futures。 您只需从示例代码中进行更改,就是使用futures搜索并替换concurrent.futures,对于2.x,请使用urllib2搜索并替换urllib.request。
这是反向移植到2.x的示例,已修改为使用您的URL列表并添加时间:
import concurrent.futures
import urllib2
import time
start = time.time()
urls = ["http://www.google.com", "http://www.apple.com", "http://www.microsoft.com", "http://www.amazon.com", "http://www.facebook.com"]
# Retrieve a single page and report the url and contents
def load_url(url, timeout):
conn = urllib2.urlopen(url, timeout=timeout)
return conn.readall()
# We can use a with statement to ensure threads are cleaned up promptly
with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
# Start the load operations and mark each future with its URL
future_to_url = {executor.submit(load_url, url, 60): url for url in urls}
for future in concurrent.futures.as_completed(future_to_url):
url = future_to_url[future]
try:
data = future.result()
except Exception as exc:
print '%r generated an exception: %s' % (url, exc)
else:
print '"%s" fetched in %ss' % (url,(time.time() - start))
print "Elapsed Time: %ss" % (time.time() - start)
但是,您可以使其更简单。 确实,您需要做的是:
def load_url(url):
conn = urllib2.urlopen(url, timeout)
data = conn.readall()
print '"%s" fetched in %ss' % (url,(time.time() - start))
return data
with futures.ThreadPoolExecutor(max_workers=5) as executor:
pages = executor.map(load_url, urls)
print "Elapsed Time: %ss" % (time.time() - start)
abarnert answered 2020-02-09T03:21:50Z
1 votes
我现在正在发布一种不同的解决方案,方法是使工作线程处于非守护状态并将它们连接到主线程(这意味着在所有工作线程都完成之前阻塞主线程),而不是通过一个通知工作线程的结束。 回调到全局函数(如我在上一个答案中所做的那样),正如在某些评论中指出的那样,这种方式不是线程安全的。
import threading
import urllib2
import time
start = time.time()
urls = ["http://www.google.com", "http://www.apple.com", "http://www.microsoft.com", "http://www.amazon.com", "http://www.facebook.com"]
class FetchUrl(threading.Thread):
def __init__(self, url):
threading.Thread.__init__(self)
self.url = url
def run(self):
urlHandler = urllib2.urlopen(self.url)
html = urlHandler.read()
print "'%s\' fetched in %ss" % (self.url,(time.time() - start))
for url in urls:
FetchUrl(url).start()
#Join all existing threads to main thread.
for thread in threading.enumerate():
if thread is not threading.currentThread():
thread.join()
print "Elapsed Time: %s" % (time.time() - start)
Daniele B answered 2020-02-09T03:22:11Z
-1 votes
该脚本从数组中定义的一组URL中获取内容。 它为每个要获取的URL产生一个线程,因此只能用于有限的一组URL。
每个线程不使用队列对象,而是通过对全局函数的回调来通知其结束,该全局函数保留正在运行的线程数的计数。
import threading
import urllib2
import time
start = time.time()
urls = ["http://www.google.com", "http://www.apple.com", "http://www.microsoft.com", "http://www.amazon.com", "http://www.facebook.com"]
left_to_fetch = len(urls)
class FetchUrl(threading.Thread):
def __init__(self, url):
threading.Thread.__init__(self)
self.setDaemon = True
self.url = url
def run(self):
urlHandler = urllib2.urlopen(self.url)
html = urlHandler.read()
finished_fetch_url(self.url)
def finished_fetch_url(url):
"callback function called when a FetchUrl thread ends"
print "\"%s\" fetched in %ss" % (url,(time.time() - start))
global left_to_fetch
left_to_fetch-=1
if left_to_fetch==0:
"all urls have been fetched"
print "Elapsed Time: %ss" % (time.time() - start)
for url in urls:
"spawning a FetchUrl thread for each url to fetch"
FetchUrl(url).start()
Daniele B answered 2020-02-09T03:22:35Z