1.使用urllib2模块下载URL
import urllib2
def download(url):
return urllib2.urlopen(url).read()
2.捕获异常
出现下载错误时,该函数能够捕获异常,然后返回None。
import urllib2
def download(url):
print 'Downloading:',url
try:
html=urllib2.urlopen(url).read()
except urllib2.URLError as e:
print 'Downloading error',e.reason
html=None
return html
3.重试下载
4xx错误发生在请求存在问题时,而5xx错误则发生在服务端存在问题时。 所以, 我们只需要确保download 函数在 发生5xx 错误时重试下载即可。下面是支持重试下载功能的新版本代码。
def download(url,num_retries=2):
print('Downloading',url)
try:
html=urllib.urlopen(url).read()
except urllib2.URLError as e:
print('Downloading error',e.reason)
html=None
if num_retries>0:
if hasattr(e,'code') and 500<=e.code<600:
return download(url,num_retries-1)
return html
4.设置用户代理
设定一个默认的用户代理“wswp”
import urllib2
def download(url,user_agent='wswp',num_retries=2):
print('Downloading:',url)
headers={'User-agent':user_agent}
request=urllib2.Request(url,headers=headers)
try:
html=urllib.urlopen(request).read()
except urllib2.URLError as e:
print('Downloading error',e.reason)
html=None
if num_retries>0:
if hasattr(e,'code') and 500<=e.code<600:
return download(url,num_retries-1)
return html