1、上回说到超时处理这个问题,我们接着来
try:
response = urllib.request.urlopen("http://httpbin.org/get",timeout=0.01) #在实际应用中,这里的timeout可以是3或者5秒甚至更长,只要超过你规定的时间,程序执行相应的结果
print(response.read().decode("utf-8"))
except urllib.error.URLError as e:
print("time out")
这样的处理,可以使得程序简单明了,更加易懂
2、状态码
我们访问百度:
response = urllib.request.urlopen("http://www.baidu.com")
print(response.status)
来看看运行结果
D:\program\python\python\python.exe D:/program/python/douban/test/testUrllib.py
200
进程已结束,退出代码0
可以看到,运行结果返回了一个200的状态码,表示服务器成功的返回网页,我们再来一个
response = urllib.request.urlopen("http://douban.com")
print(response.status)
D:\program\python\python\python.exe D:/program/python/douban/test/testUrllib.py
Traceback (most recent call last):
File "D:/program/python/douban/test/testUrllib.py", line 31, in <module>
response = urllib.request.urlopen("http://douban.com")
File "D:\program\python\python\lib\urllib\request.py", line 222, in urlopen
return opener.open(url, data, timeout)
File "D:\program\python\python\lib\urllib\request.py", line 531, in open
response = meth(req, response)
File "D:\program\python\python\lib\urllib\request.py", line 640, in http_response
response = self.parent.error(
File "D:\program\python\python\lib\urllib\request.py", line 569, in error
return self._call_chain(*args)
File "D:\program\python\python\lib\urllib\request.py", line 502, in _call_chain
result = func(*args)
File "D:\program\python\python\lib\urllib\request.py", line 649, in http_error_default
raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 418:
进程已结束,退出代码1
报错了,错误码:418,返回418就被意味着被人家的反爬程序发现了
3、用response中的getheaders对象返回头部信息。依旧用百度来举例,因为用豆瓣的话会被发现
response = urllib.request.urlopen("http://www.baidu.com")
print(response.getheaders())
运行结果如下
D:\program\python\python\python.exe D:/program/python/douban/test/testUrllib.py
[('Bdpagetype', '1'), ('Bdqid', '0x8f297c2c000029fb'), ('Cache-Control', 'private'), ('Content-Type', 'text/html;charset=utf-8'), ('Date', 'Fri, 10 Jun 2022 08:18:55 GMT'), ('Expires', 'Fri, 10 Jun 2022 08:18:16 GMT'), ('P3p', 'CP=" OTI DSP COR IVA OUR IND COM "'), ('P3p', 'CP=" OTI DSP COR IVA OUR IND COM "'), ('Server', 'BWS/1.1'), ('Set-Cookie', 'BAIDUID=3503CD0BB7BBA79CB7D355C2B99F78CA:FG=1; expires=Thu, 31-Dec-37 23:55:55 GMT; max-age=2147483647; path=/; domain=.baidu.com'), ('Set-Cookie', 'BIDUPSID=3503CD0BB7BBA79CB7D355C2B99F78CA; expires=Thu, 31-Dec-37 23:55:55 GMT; max-age=2147483647; path=/; domain=.baidu.com'), ('Set-Cookie', 'PSTM=1654849135; expires=Thu, 31-Dec-37 23:55:55 GMT; max-age=2147483647; path=/; domain=.baidu.com'), ('Set-Cookie', 'BAIDUID=3503CD0BB7BBA79C8687F503C24A4E78:FG=1; max-age=31536000; expires=Sat, 10-Jun-23 08:18:55 GMT; domain=.baidu.com; path=/; version=1; comment=bd'), ('Set-Cookie', 'BDSVRTM=0; path=/'), ('Set-Cookie', 'BD_HOME=1; path=/'), ('Set-Cookie', 'H_PS_PSSID=36559_36597_36455_31253_36421_36166_36570_36531_36519_26350_36469; path=/; domain=.baidu.com'), ('Traceid', '1654849135065809204210315912949889247739'), ('Vary', 'Accept-Encoding'), ('Vary', 'Accept-Encoding'), ('X-Frame-Options', 'sameorigin'), ('X-Ua-Compatible', 'IE=Edge,chrome=1'), ('Connection', 'close'), ('Transfer-Encoding', 'chunked')]
进程已结束,退出代码0
可以看到,返回了好多键值对,这些键值对你也可以在百度页面按下F12,然后刷新,然后按下停止按钮,在名称的相应头中可以看到,通过对比你可以发现,这两个地方基本上是一致的
3、模拟浏览器进行数据的获取
先简单的用
http://httpbin.org这个网址做个测试
url = "http://httpbin.org/post"
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.54 Safari/537.36 Edg/101.0.1210.39"
}#欺骗浏览器,对headers进行一个伪装,封装一下
data = bytes(urllib.parse.urlencode({'name':'eric'}),encoding = "utf-8")
req = urllib.request.Request(url=url,data=data,headers=headers,method="POST")
response = urllib.request.urlopen(req)
print(response.read().decode("utf-8"))
来看运行结果:
D:\program\python\python\python.exe D:/program/python/douban/test/testUrllib.py
{
"args": {},
"data": "",
"files": {},
"form": {
"name": "eric"
},
"headers": {
"Accept-Encoding": "identity",
"Content-Length": "9",
"Content-Type": "application/x-www-form-urlencoded",
"Host": "httpbin.org",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.54 Safari/537.36 Edg/101.0.1210.39",
"X-Amzn-Trace-Id": "Root=1-62a3040e-524eae5153f5b30e6b324dd4"
},
"json": null,
"origin": "49.78.13.48",
"url": "http://httpbin.org/post"
}
进程已结束,退出代码0
与之前不同,此时我们返回的值更加像浏览器了
实战:
url = "http://douban.com"
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.5005.63 Safari/537.36 Edg/102.0.1245.33"
}
req = urllib.request.Request(url=url,headers=headers)
response = urllib.request.urlopen(req)
print(response.read().decode("utf-8"))
运行结果
代码太长,放不下了,实际上它是可以正常运行的,而且没有报错
所以我们可以看到,如果我们想爬取豆瓣网页,这样一个简单的伪装就可以让豆瓣认为我们是正常的浏览器,可以正常的爬取。
完结撒花!!