一般情况下,我们从网络上抓资源会这样写代码:
try:
request = urllib2.Request(url)
response = urllib2.urlopen(request, timeout = 10)
except:
print traceback.format_exc()
content = cStringIO.StringIO(response.read())
但是网络上的资源有很多的不确定性。比如,我想抓取一个图片文件,但是服务器因为内部错误返回一串字符串;或者返回一个error code。
因此这要求我们在读取内容之前,需要判断一下某些状态是否正确。例如:
response = None
try:
request = urllib2.Request(url)
response = urllib2.urlopen(request, timeout = 10)
except:
print traceback.format_exc()
# 判断返回码是否是正常
if response.code < 200 or response.code >= 300:
print "some error found"
# 判断content type是不是image的
if response.headers.type is not None and response.headers.type.find("image") != -1:
print "this is an image resource"
else:
print "not an image"
# 再读取资源
content = cStringIO.StringIO(response.read())