66免费代理分析:
曾经尝试过网站提供的提取,用正则爬下来后当时正常,过两天数量异常
于是,采用正规手段,分页抓取,抓取1300页,对每页进行分析
为此,构造正则表达式即可
import requests,re
from redis import Redis
redis = Redis(db=7 )
def craw_66ip () :
url = 'http://www.66ip.cn/{}.html'
for i in range(1 ,1300 ):
r = requests.get(url.format(i)).text
ips = re.findall('td>(\w+\.\w+\.\w+\.\w+)</td' ,r,re.S)
ports = re.findall('\.\w+</td.*?>(\w+)</td' ,r,re.S)
for i in range(len(ips)):
str = ips[i]+":" +ports[i]
redis.rpush('nowashhttp' ,str)
print('加入' )
craw_66ip()
此处,对IP和端口进行正则匹配,不难,获取到IP和端口列表后,对两者进行合并,得到N个“IP:端口”格式的字符串,并将该字符串存入Redis数据库中,以备清洗程序使用
西刺免费代理分析
西刺分四个类目,国内普通,国内高匿,国内HTTP,国内HTTPS
经过分析,可知爬取方式一样,用同一个函数即可,只是网址不同而已
from bs4 import BeautifulSoup
import requests,time
from redis import Redis
redis = Redis(db=7 )
def craw_xici () :
ip_list = []
url = 'http://www.xicidaili.com/wt/'
url1 = 'http://www.xicidaili.com/wn/'
url2 = 'http://www.xicidaili.com/nn/'
url3 = 'http://www.xicidaili.com/nt/'
headers = {
'Cookie' : '_free_proxy_session=BAh7B0kiD3Nlc3Npb25faWQGOgZFVEki'
'JWIzYzA0ZWNhN2U4YmJiZmI3N2M1YzQ0ZmFjZDU1OGFhBjsAVEkiE'
'F9jc3JmX3Rva2VuBjsARkkiMXhOSktRWmRoaGlLRXd0UnU1NmtDWT'
'FvVzh6SVFZUWxTWnlLeGVIVVVpNEU9BjsARg%3D%3D--f0f7b59b27'
'a7bbb3e87c4eb1f5043c9c5f5ef435; __guid=264997385.176939'
'6082676313900.1532227595275.8433; Hm_lvt_0cf76c77469e965'
'd2957f0553e6ecf59=1532227595; monitor_count=5; Hm_lpvt_0'
'cf76c77469e965d2957f0553e6ecf59=1532227642' ,
'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKi'
't/537.36 (KHTML, like Gecko) Chrome/49.0.2623.22'
'1 Safari/537.36 SE 2.X MetaSr 1.0'
}
def get_one_page (url) :
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, 'lxml' )
ips = soup.select('table tr' )
ips.remove(ips[0 ])
for i in ips:
for j, k in enumerate(i):
if j == 3 :
port = k.find_next_sibling().get_text()
ip = k.get_text()
ip_list.append(ip + ':' + port)
def get_all_pages (url) :
for i in range(1 , 20 ):
url_ = url + str(i)
get_one_page(url_)
get_all_pages(url)
get_all_pages(url1)
get_all_pages(url2)
get_all_pages(url3)
print(len(ip_list))
print(ip_list)
for i in ip_list:
redis.rpush('nowashhttp' , i)
craw_xici()
同样将数据存储至Redis数据库中,个人觉得将IP都存到一个列表里最好,用着Redis的高请求和写入速度,紧着一个键值对就写啊,哈哈,全部写入,然后由清洗程序检验,检验合格者存入另一个集合中,去重