python urllib2.urlopen(),Python urllib2.urlopen（）很慢，需要更好的方法来读几个网址...

最新推荐文章于 2022-06-24 14:55:07 发布

weixin_39731807

最新推荐文章于 2022-06-24 14:55:07 发布

阅读量212

点赞数

文章标签： python urllib2.urlopen()

As the title suggests, I'm working on a site written in python and it makes several calls to the urllib2 module to read websites. I then parse them with BeautifulSoup.

As I have to read 5-10 sites, the page takes a while to load.

I'm just wondering if there's a way to read the sites all at once? Or anytricks to make it faster, like should I close the urllib2.urlopen after each read, or keep it open?

Added: also, if I were to just switch over to php, would that be faster for fetching and Parsi g HTML and XML files from other sites? I just want it to load faster, as opposed to the ~20 seconds it currently takes

解决方案

I'm rewriting Dumb Guy's code below using modern Python modules like threading and Queue.

import threading, urllib2

import Queue

urls_to_load = [

'http://stackoverflow.com/',

'http://slashdot.org/',

'http://www.archive.org/',

'http://www.yahoo.co.jp/',

]

def read_url(url, queue):

data = urllib2.urlopen(url).read()

print('Fetched %s from %s' % (len(data), url))

queue.put(data)

def fetch_parallel():

result = Queue.Queue()

threads = [threading.Thread(target=read_url, args = (url,result)) for url in urls_to_load]

for t in threads:

t.start()

for t in threads:

t.join()

return result

def fetch_sequencial():

result = Queue.Queue()

for url in urls_to_load:

read_url(url,result)

return result

Best time for find_sequencial() is 2s. Best time for fetch_parallel() is 0.9s.

Also it is incorrect to say thread is useless in Python because of GIL. This is one of those case when thread is useful in Python because the the threads are blocked on I/O. As you can see in my result the parallel case is 2 times faster.

weixin_39731807

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
python urllib2.urlopen(),Python urllib2.urlopen（）很慢，需要更好的方法来读几个网址...

As the title suggests, I'm working on a site written in python and it makes several calls to the urllib2 module to read websites. I then parse them with BeautifulSoup.As I have to read 5-10 sites, the...
复制链接

扫一扫