Crawler--09: Anti-Crawling-Mechanismus
一、ua反爬
- 爬虫中非常重要的一种反反爬策略
- user-agent 用户代理
- Fake_useragent模块
- 安装Fake_useragent模块
pip install fake_useragent
ua = UserAgent()
s = ua.random
print(s)
二、IP反爬
1、相关网址
- 返回当前的ip地址
- 查看当前的ip地址
- 比较好用的ip代理平台
2、ip反爬
- 有些网站会检测ip在同一时间内的访问次数,如果过于频繁,会封禁当前ip
- 解决办法:设置代理ip
- 在
requests
模块中,有一个参数proxies
来设置代理ip
- 相关代理ip网站上的免费ip不好使
- 付费的代理ip不会用
- 通过cmd(windows)命令行内输入
ipconfig
来查看电脑的内部的ip地址
- 查看上网IP:使用IPIP网站查看电脑的上网IP
- 推荐代理网站
import random
import requests
ips = [
('223.240.244.48:23564'),
('121.233.226.191:5412'),
('114.100.3.87:766'),
('58.219.59.76:5412'),
('180.113.10.47:894'),
('27.40.111.110:36410'),
('42.56.3.242:766'),
('180.113.12.163:5412'),
('113.237.243.46:3617'),
('58.219.59.129:36410'),
('223.240.242.44:5412'),
('117.60.239.133:5412'),
('114.97.199.48:3617'),
('163.179.204.157:3617'),
('180.125.97.143:894'),
('60.174.190.152:23564'),
('49.86.177.230:36410'),
('182.101.237.158:5412'),
('114.98.139.136:23564'),
('114.225.241.237:23564'),
]
url = 'http://www.httpbin.org/ip'
for i in range(20):
try:
ip = random.choice(ips)
res = requests.get(url, proxies={
'http':ip}, timeout=0.5)
print(res.text)
except Exception as e:
print('出现错误!',e)
import requests
url = 'http://www.httpbin.org/ip'
proxies = {
'http': 'http://1550023517:[email protected]:16817',
'https': 'http://1550023517:[email protected]:16817'
}
result = requests.get(url, proxies=proxies)
print(result.text)
import requests
class ProxyPool():
def __init__(self):
self.proxy_url = 'http://dev.kdlapi.com/api/getproxy/?orderid=992045485987175&num=100&protocol=2&method=1&an_ha=1&sep=2'
self.test_url = 'https://www.baidu.com/'
self.headers = {
'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 11_3_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.82 Safari/537.36'
}
def get_prosy_pool(self):
html = requests.get(url=self.proxy_url, headers=self.headers).text
proxy_list = html.split('\n')
for prosy in proxy_list:
self.test_proxy(prosy)
def test_proxy(self, prosy):
proxies = {
'http':'{}'