切换代理ip一直是我们在反反爬虫过程中常用的手段,但是目前各大ip代理网站的优质ip的价格都十分高昂,用于个人不太划算。好在有些网站提供免费的ip,经过测试,他们响应速度较为良好。如果我们将他们爬取下来并加以维护,足以满足我们个人的使用。国内主要的ip代理网站有:西刺免费代理ip,快代理免费代理,Proxy360代理,全网代理ip
————————————————————————————————————————————————————————————-
第一次尝试
在第一次爬取西刺免费代理ip网站的时候由于疏忽,直接headers本机暴露了自己的ip,导致自己的ip被封。只怪自己年少无知,没想到提供ip代理支持爬虫工作的网站还有反爬虫的机制,如此傲娇我等佩服,可见后端运维人员也是”风清亮节“,”出淤泥而不染“。
第二次尝试
在有了第一次教训下,我套上了一层代理ip,我内心十分拒绝,为何人与人之间连最基本的信任都没有。
通过对网站的简单分析,我们不难爬取ip数据。代码如下
#!/usr/bin/env python
#-*- coding:utf-8 -*-
import requests
from bs4 import BeautifulSoup
import pymysql
base_url = "https://www.kuaidaili.com/free/inha/"
headers = {"User-Agent":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36"}
proxies={'http':'115.223.253.171:8060'}
def reptile_ip(url):
list = []
html = requests.get(url,headers = headers,proxies = proxies)
soup = BeautifulSoup(html.content,"html.parser")
items = soup.find("table",class_ = "table table-bordered table-striped").find_all("tr")
items =items[1:]
for i in items:
book = {}
book["ip"] = i.find_all("td")[0].text
book["post"] = i.find_all("td")[1].text
book["type"] = i.find_all("td")[3].text
book["place"] = i.find_all("td")[4].text
book["response_time"] = i.find_all("td")[5].text
list.append(book)
return list
if __name__ == "__main__":
url = base_url
conn = pymysql.connect("localhost",user="cp328",passwd = "cP+86743175",db ="test")
cursor = conn.cursor()
cursor.execute("drop table if exists ip_proxy")
createtab = """create table ip_proxy(
id integer NOT NULL auto_increment PRIMARY KEY,
ip char(50) not null ,
post char(20) not null ,
type char(20) not null,
place char(50)not null,
response_time char(20) not null)"""
sql = "insert into ip_proxy (ip,post,type,place,response_time) values (%s,%s,%s,%s,%s)"
cursor.execute(createtab)
for i in range(1,10):
print(i)
lists = reptile_ip(url+str(i))
for list in lists:
try:
cursor.execute(sql, (list["ip"], list["post"], list["type"], list["place"], list["response_time"]))
conn.commit()
print(str(list["ip"])+"has been keeped")
except:
conn.rollback()
但是实际上封ip的问题依然没有解决。我们无法爬取大量的ip数据。但是经过分析,与西刺代理不同,快代理的机制只是限制ip的访问次数,并没有对ip进行封禁。我们可以在当我们的IP被限制访问次数的时候切换自己的代理ip就可以了,而与此同时,我们也可以利用已经爬取的ip数据进行访问。
第三次尝试
方法:用他的ip去爬他自己,他山之石可以攻玉。
#!/usr/bin/env python
#-*- coding:utf-8 -*-
import requests
from bs4 import BeautifulSoup
import pymysql
import time
global ip_num
ip_num = 1
base_url = "https://www.kuaidaili.com/free/inha/"
headers = {"User-Agent":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36"}
proxies={'http':'115.223.245.117:9000'}
def proxies_switch(url):
print("正在进行ip的切换...")
global ip_num
stutas = False
while stutas ==False: #找到合适的ip——post地址
print("正在验证第%s个ip地址"%(ip_num))
sql = "select ip,post from ip_proxy where id = %s" % (ip_num)
cursor.execute(sql)
items = cursor.fetchall()
ip = items[0][0]
post = items[0][1]
ip_post = ip+":"+post
response = requests.get(url,proxies = {'http':ip_post})
stutas = response.ok
ip_num = ip_num+1
proxies['http'] = ip_post
print("第%个ip地址测试成功"%(ip_num))
time.sleep(0.5) #切换成功之后sleep一秒 防止新的ip_post被封
def reptile_ip(url):
list = []
html = requests.get(url,headers = headers,proxies = proxies)
print("连接情况为",html.ok)
if html.ok == False:
proxies_switch(url)
html = requests.get(url, headers=headers, proxies=proxies)
soup = BeautifulSoup(html.content,"html.parser")
items = soup.find("table",class_ = "table table-bordered table-striped").find_all("tr")
items =items[1:]
for i in items:
book = {}
book["ip"] = i.find_all("td")[0].text
book["post"] = i.find_all("td")[1].text
book["type"] = i.find_all("td")[3].text
book["place"] = i.find_all("td")[4].text
book["response_time"] = i.find_all("td")[5].text
list.append(book)
return list
if __name__ == "__main__":
url = base_url
conn = pymysql.connect("localhost",user="cp328",passwd = "cP+86743175",db ="test")
cursor = conn.cursor()
cursor.execute("drop table if exists ip_proxy")
createtab = """create table ip_proxy(
id integer NOT NULL auto_increment PRIMARY KEY,
ip char(50) not null ,
post char(20) not null ,
type char(20) not null,
place char(50)not null,
response_time char(20) not null)"""
sql = "insert into ip_proxy (ip,post,type,place,response_time) values (%s,%s,%s,%s,%s)"
cursor.execute(createtab)
for i in range(1,70):
print(i)
lists = reptile_ip(url+str(i))
for list in lists:
try:
cursor.execute(sql, (list["ip"], list["post"], list["type"], list["place"], list["response_time"]))
conn.commit()
print(str(list["ip"])+"has been keeped")
except:
conn.rollback()
经过测试,问题基本解决。
下图是保存在数据库的ip代理池,便于下次直接取用
如果读者有一些细节问题,可直接在评论区反馈。
谢谢阅读,如有帮助,可订阅博主,后期有各种博文更新。