1. 设置headers
通过Python的requests.get(url)有时候会爬到的内容有时候是空网页。原因是网站检测到你发送的request不正常。
这时候通过设置headers参数来模拟真实浏览器发送的请求,往往能解决问题。
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.143 Safari/537.36'
}
response = requests.get(url, headers = headers)
往往设置User-Agent就可以了,获取自己的User - Agent的方式。
Chrome浏览器-F12(检查) - 点击NetWork - 按ctrl+R - 点击Name任意一项 - 查看User-Agent
另外还可以在headers中添加cookies,可能被禁止访问的概率更小。
详见: cookies
2. 设置proxies
有时候会写一些刷访问量的脚本,如果不设置proxies,每次访问请求都是来自同一ip,很容易被后台检测到。
# step1: 获取一个proxies
class get_kuaidaili_ip(): # 获取快代理免费代理ip
# 尝试代理agents增强反反爬
def random_agent(self):
user_agents = [
"Mozilla/5.0 (iPod; U; CPU iPhone OS 4_3_2 like Mac OS X; zh-cn) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8H7 Safari/6533.18.5",
"Mozilla/5.0 (iPhone; U; CPU iPhone OS 4_3_2 like Mac OS X; zh-cn) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8H7 Safari/6533.18.5",
"MQQBrowser/25 (Linux; U; 2.3.3; zh-cn; HTC Desire S Build/GRI40;480*800)",
"Mozilla/5.0 (Linux; U; Android 2.3.3; zh-cn; HTC_DesireS_S510e Build/GRI40) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
"Mozilla/5.0 (SymbianOS/9.3; U; Series60/3.2 NokiaE75-1 /110.48.125 Profile/MIDP-2.1 Configuration/CLDC-1.1 ) AppleWebKit/413 (KHTML, like Gecko) Safari/413"
'Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1',
'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50',
'Opera/9.80 (Windows NT 6.1; U; en) Presto/2.8.131 Version/11.11'
]
return random.choice(user_agents)
# 尝试代理IP增强反反爬
def get_ip_list(self, url, headers, proxies):
web_data = requests.get(url, headers=headers, proxies=proxies)
soup = BeautifulSoup(web_data.text, 'lxml')
ips = soup.find_all('tr')
ip_list = []
for i in range(1, len(ips)):
ip_info = ips[i]
tds = ip_info.find_all('td')
ip_list.append(tds[0].text + ':' + tds[1].text)
return ip_list
def get_random_ip(self, ip_list):
proxy_list = []
for ip in ip_list:
proxy_list.append('http://' + ip)
proxy_ip = random.choice(proxy_list)
proxies = {'http': proxy_ip}
return proxies
def get_one(self, proxies = None):
#url = 'http://www.xicidaili.com/nn/5'
url = 'https://www.kuaidaili.com/free/inha/%s/' % random.randint(1, 10)
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.143 Safari/537.36'
}
ip_list = self.get_ip_list(url, headers=headers, proxies=proxies)
# print(ip_list)
return self.get_random_ip(ip_list)
url = get_kuaidaili_ip()
proxies = url.get_one()
step 2: 使用proxy
response = requests.get(url, headers = headers, proxies=proxies)
3. 爬虫的技术升级过程
下图的1235在笔者已经使用过,接下来可以尝试
- 多cookies
- 对付验证码
- 使用selenium
原出处:
从零学爬虫–给微信公众号阅读量作个弊:刷阅读量