爬取快代理的ip地址
在使用爬虫时,经常会被识别,进而不能爬取相应的网站,所以此时,我们就想到了,设置代理,这样就可以隐藏了真实的客户端
常见的一些免费代理ip的网址为:
https://www.kuaidaili.com/free/
https://www.xicidaili.com/
from bs4 import BeautifulSoup
import time
import requests
import urllib3
# 过滤掉告警信息
urllib3.disable_warnings()
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.132 Safari/537.36'
}
def acquire_ip(page):
url = "https://www.kuaidaili.com/free/inha/{}/".format(page)
res = requests.get(url,verify=False,headers=headers)
# print(res.text)
html = res.text
soup = BeautifulSoup(html,'lxml')
# print(soup.select('[id="list"]'))
# print(soup.find("div",id="list"))
table_all = soup.find("div",id="list")
table_headers = table_all.find("thead").find("tr").find_all("th")
# print(table_headers)
content_headers=[]
for table_header in table_headers:
text_header = table_header.string
# print(text_header)
content_headers.append(text_header)
# print(content_headers)
table_trs= table_all.find('tbody').find_all('tr')
content_text = []
for table_tr in table_trs:
tds = table_tr.find_all('td')
inner_list = []
for td in tds:
content = td.string
inner_list.append(content)
content_text.append(inner_list)
# print(content_text)
result = []
for i in content_text:
result.append(dict(zip(content_headers,i)))
return result
# print(result)
# 要是爬取的数据特别多的话,会占用太多的资源,所以使用生成器,并且加上等待,防止被识别
def main(num):
for i in range(1,num):
yield acquire_ip(i)
time.sleep(2)
# print(acquire_ip(2))
if __name__ == '__main__':
import csv
contents = main(8)
print(contents)
content_headers = ("IP", "PORT", "匿名度","类型" ,"位置","响应速度", "最后验证时间")
with open('csv.csv','w',encoding="utf-8",newline="") as fs:
result = csv.DictWriter(fs, content_headers)
result.writeheader()
for content in contents:
result.writerows(content)
结果:
遇到的问题:有时候得到的csv文件会显示乱码?
解决方式:使用notepad++或者其它的可修改编码的程序打开,然后修改编码为ASCII然后保存,在使用excel打开
————————————————
版权声明:本文为CSDN博主「那个雨季」的原创文章,遵循CC 4.0 BY-SA版权协议,转载请附上原文出处链接及本声明。
原文链接:https://blog.csdn.net/qq_43534980/article/details/105475640