Python爬虫--爬取知乎

最新推荐文章于 2023-05-31 09:03:56 发布

weixin_34331102

最新推荐文章于 2023-05-31 09:03:56 发布

阅读量333

点赞数 1

文章标签：爬虫 python

原文链接：https://my.oschina.net/jack088/blog/3050285

版权

2019独角兽企业重金招聘Python工程师标准>>>

1. 爬一下知乎
import requests
url = 'http://www.zhihu.com/'
res = requests.get(url).text
print (res)
结果：

直接访问发现返回 400 错误
E:\360Downloads\Python36\python3.exe E:/work/yansong/python1/zhihuClimbInsect/zhihu_Spider.py
<html>
<head><title>400 Bad Request</title></head>
<body bgcolor="white">
<center><h1>400 Bad Request</h1></center>
<hr><center>openresty</center>
</body>
</html>

因为知乎做了反爬虫处理。

2. 加上浏览器伪装进行爬取：
import requests
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'}
url = 'http://www.zhihu.com/'
res = requests.get(url,headers=headers).text
print(res)
再次运行，发现已经成功返回数据。但是这种写法不可以应用到所有的网站。

3. 设置代理爬取
有时候同一个IP去爬取同一网站上的内容，久了之后就会被该网站服务器屏蔽。解决方法就是更换IP。这个时候，在对方网站上，显示的不是我们真实地IP地址，而是代理服务器的IP地址。西刺代理http://www.xicidaili.com/nn/ 提供了很多可用的国内IP，云代理http://www.ip3366.net/提供了许多国外IP可以直接拿来使用。

如何在爬虫里加入代理呢，看看requests的官方文档http://docs.python-requests.org/zh_CN/latest/user/advanced.html#proxies，如果需要使用代理，你可以通过为任意请求方法提供 proxies 参数来配置单个请求：

import requests
proxies = {
"http": "http://61.135.217.7:80",
"https": "https://1118.190.95.26:9001",
}
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'}
url = 'http://www.zhihu.com/'
res = requests.get(url, headers=headers, proxies=proxies).text
print(res)
print (len(res))
在使用代理服务器爬去网站时，如果出现异常，要考虑代理IP是否失效了。可以写一个爬虫实时抓取最新的代理IP。

Python实时抓取最新的代理IP 参见：Python实时抓取最新的代理IP

转载于:https://my.oschina.net/jack088/blog/3050285