python urllib2.urlopen(),Python urllib2.urlopen()很慢,需要更好的方法来读几个网址...

As the title suggests, I'm working on a site written in python and it makes several calls to the urllib2 module to read websites. I then parse them with BeautifulSoup.

As I have to read 5-10 sites, the page takes a while to load.

I'm just wondering if there's a way to read the sites all at once? Or anytricks to make it faster, like should I close the urllib2.urlopen after each read, or keep it open?

Added: also, if I were to just switch over to php, would that be faster for fetching and Parsi g HTML and XML files from other sites? I just want it to load faster, as opposed to the ~20 seconds it currently takes

解决方案

I'm rewriting Dumb Guy's code below using modern Python modules like threading and Queue.

import threading, urllib2

import Queue

urls_to_load = [

'http://stackoverflow.com/',

'http://slashdot.org/',

'http://www.archive.org/',

'http://www.yahoo.co.jp/',

]

def read_url(url, queue):

data = urllib2.urlopen(url).read()

print('Fetched %s from %s' % (len(data), url))

queue.put(data)

def fetch_parallel():

result = Queue.Queue()

threads = [threading.Thread(target=read_url, args = (url,result)) for url in urls_to_load]

for t in threads:

t.start()

for t in threads:

t.join()

return result

def fetch_sequencial():

result = Queue.Queue()

for url in urls_to_load:

read_url(url,result)

return result

Best time for find_sequencial() is 2s. Best time for fetch_parallel() is 0.9s.

Also it is incorrect to say thread is useless in Python because of GIL. This is one of those case when thread is useful in Python because the the threads are blocked on I/O. As you can see in my result the parallel case is 2 times faster.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值