当爬取百度主页时,可以看到get函数只有一个url,也会成功响应,状态码为[200]:
>>> import requests
>>> url='https://www.baidu.com/'
>>> r=requests.get(url)
>>> r
<Response [200]>
>>> r.text
'<!DOCTYPE html>\r\n<!--STATUS OK--><html> <head><meta http-equiv=content-type content=text/html;charset=utf-8><meta http-equiv=X-UA-Compatible content=IE=Edge><meta content=always name=referrer><link rel=stylesheet type=text/css href=https://ss1.bdstatic.com/5eN1bjq8AAUYm2zgoY3K/r/www/cache/bdorz/baidu.min.css><title>ç\x99¾åº¦ä¸\x80ä¸\x8bï¼\x8cä½\xa0å°±ç\x9f¥é\x81\x93</title></head> <body link=#0000cc> <div id=wrapper> <div id=head> <div class=head_wrapper> <div class=s_form> <div class=s_form_wrapper> <div id=lg> <img hidefocus=true src=//www.baidu.com/img/bd_logo1.png width=270 height=129> </div> <form id=form name=f action=//www.baidu.com/s class=fm> <input type=hidden name=bdorz_come value=1> <input type=hidden name=ie value=utf-8> <input type=hidden name=f value=8> <input type=hidden name=rsv_bp value=1> <input type=hidden name=rsv_idx value=1> <input type=hidden name=tn value=baidu><span class="bg s_ipt_wr"><input id=kw name=wd class=s_ipt value maxlength=255 autocomplete=off autofocus=autofocus></span><span class="bg s_btn_wr"><input type=submit id=su value=ç\x99¾åº¦ä¸\x80ä¸\x8b class="bg s_btn" autofocus></span> </form> </div> </div> <div id=u1> <a href=http://news.baidu.com name=tj_trnews class=mnav>æ\x96°é\x97»</a> <a href=https://www.hao123.com name=tj_trhao123 class=mnav>hao123</a> <a href=http://map.baidu.com name=tj_trmap class=mnav>å\x9c°å\x9b¾</a> <a href=http://v.baidu.com name=tj_trvideo class=mnav>è§\x86é¢\x91</a> <a href=http://tieba.baidu.com name=tj_trtieba class=mnav>è´´å\x90§</a> <noscript> <a href=http://www.baidu.com/bdorz/login.gif?login&tpl=mn&u=http%3A%2F%2Fwww.baidu.com%2f%3fbdorz_come%3d1 name=tj_login class=lb>ç\x99»å½\x95</a> </noscript> <script>document.write(\'<a href="http://www.baidu.com/bdorz/login.gif?login&tpl=mn&u=\'+ encodeURIComponent(window.location.href+ (window.location.search === "" ? "?" : "&")+ "bdorz_come=1")+ \'" name="tj_login" class="lb">ç\x99»å½\x95</a>\');\r\n </script> <a href=//www.baidu.com/more/ name=tj_briicon class=bri style="display: block;">æ\x9b´å¤\x9a产å\x93\x81</a> </div> </div> </div> <div id=ftCon> <div id=ftConw> <p id=lh> <a href=http://home.baidu.com>å\x85³äº\x8eç\x99¾åº¦</a> <a href=http://ir.baidu.com>About Baidu</a> </p> <p id=cp>©2017 Baidu <a href=http://www.baidu.com/duty/>使ç\x94¨ç\x99¾åº¦å\x89\x8då¿\x85读</a> <a href=http://jianyi.baidu.com/ class=cp-feedback>æ\x84\x8fè§\x81å\x8f\x8dé¦\x88</a> 京ICPè¯\x81030173å\x8f· <img src=//www.baidu.com/img/gs.gif> </p> </div> </div> </div> </body> </html>\r\n'
然而,爬取其他一些网站时,可能会出现不能正确响应的情况,如下。
>>> import requests
>>> url='https://bj.lianjia.com/zufang/'
>>> r=requests.get(url)
>>> r
<Response [403]>
>>> r.text
'<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">\r\n<html>\r\n<head><title>403 Forbidden</title></head>\r\n<body bgcolor="white">\r\n<h1>403 Forbidden</h1>\r\n<p>You don\'t have permission to access the URL on this server. Sorry for the inconvenience.<br/>\r\nPlease report this message and include the following information to us.<br/>\r\nThank you very much!</p>\r\n<table>\r\n<tr>\r\n<td>URL:</td>\r\n<td>https://bj.lianjia.com/zufang/</td>\r\n</tr>\r\n<tr>\r\n<td>Server:</td>\r\n<td>proxy17-online.mars.ljnode.com</td>\r\n</tr>\r\n<tr>\r\n<td>Date:</td>\r\n<td>2019/07/03 01:00:14</td>\r\n</tr>\r\n</table>\r\n<hr/>Powered by Lianjia</body>\r\n</html>\r\n'
>>> from bs4 import BeautifulSoup
>>> bs=BeautifulSoup(r.text)
>>> bs
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html>
<head><title>403 Forbidden</title></head>
<body bgcolor="white">
<h1>403 Forbidden</h1>
<p>You don't have permission to access the URL on this server. Sorry for the inconvenience.<br/>
Please report this message and include the following information to us.<br/>
Thank you very much!</p>
<table>
<tr>
<td>URL:</td>
<td>https://bj.lianjia.com/zufang/</td>
</tr>
<tr>
<td>Server:</td>
<td>proxy17-online.mars.ljnode.com</td>
</tr>
<tr>
<td>Date:</td>
<td>2019/07/03 01:00:14</td>
</tr>
</table>
<hr/>Powered by Lianjia</body>
</html>
可以看出,requests.get()函数返回状态不是[200],网站被禁止访问。解决办法是在requests.get()函数中加入headers参数,模拟网站的正常访问,绕开反爬机制。headers参数查找方法如下:
headers参数从访问网站的后台获取,在待爬取网页页面右击选择审查元素,弹出的界面中选择Network选项,F5刷新出现进程图,选择Headers进程中的User-Agent一行,复制粘贴到代码行中,将其改写为字典格式,赋值给headers变量,在get()函数url参数后面添加headers=headers即可(如果只有User-Agent参数仍然不能正确响应,可在字典中包含Cookie等其他参数尝试。为简便,可将Request Header中所有参数均包含进去,其中字典元素的键和值均为字符串格式)。如下所示:
>>> headers={'User-Agent':'Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36'}
>>> r=requests.get(url,headers=headers)
>>> r
<Response [200]>
>>> r=requests.get(url,headers)
>>> r
<Response [403]>
>>> r=requests.get(url,h=headers)
Traceback (most recent call last):
File "<pyshell#154>", line 1, in <module>
r=requests.get(url,h=headers)
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python37\lib\site-packages\requests\api.py", line 75, in get
return request('get', url, params=params, **kwargs)
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python37\lib\site-packages\requests\api.py", line 60, in request
return session.request(method=method, url=url, **kwargs)
TypeError: request() got an unexpected keyword argument 'h'
>>> r=requests.get(url,headers={'User-Agent':'Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36'})
>>> r
<Response [200]>
>>> r=requests.get(url,'User-Agent'='Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36')
SyntaxError: keyword can't be an expression
>>> r=requests.get(url,User-Agent='Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36')
SyntaxError: keyword can't be an expression
>>> r=requests.get(url,'User-Agent':'Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36')
SyntaxError: invalid syntax
>>> r=requests.get(url,header=headers)
Traceback (most recent call last):
File "<pyshell#160>", line 1, in <module>
r=requests.get(url,header=headers)
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python37\lib\site-packages\requests\api.py", line 75, in get
return request('get', url, params=params, **kwargs)
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python37\lib\site-packages\requests\api.py", line 60, in request
return session.request(method=method, url=url, **kwargs)
TypeError: request() got an unexpected keyword argument 'header'
>>> head={'User-Agent':'Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36'}
>>> r=requests.get(url,header=head)
Traceback (most recent call last):
File "<pyshell#162>", line 1, in <module>
r=requests.get(url,header=head)
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python37\lib\site-packages\requests\api.py", line 75, in get
return request('get', url, params=params, **kwargs)
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python37\lib\site-packages\requests\api.py", line 60, in request
return session.request(method=method, url=url, **kwargs)
TypeError: request() got an unexpected keyword argument 'header'
>>> head={'User-Agent':'Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36'}
>>> r=requests.get(url,headers=head)
>>> r
<Response [200]>
还可以看出,参数必须采用“headers=字典”格式,其它均不行。
参考:Python——爬虫【Requests设置请求头Headers】