爬虫代理的获取代码
获取网站免费的代理IP
- 免费的IP代理:找到很多关于IP代理的网站,发现稍微靠谱点的IP代理获取网站很少,这里推荐一个网站:http://www.xicidaili.com/nn; 相对而言这个网站的代理还是相对比较好的。
- 收费的IP代理:阿布云这个网站提供的代理还是比较好的 https://www.abuyun.com/,但是价格超贵。如果是网站对IP代理需求不是非常高,就用免费的。
- 下面爬取IP代理代码供大家参考下(用到bs4, urllib2等包):
#!/usr/bin/env python
# encoding: utf-8
"""
Spider ip proxy. Website: http://www.xicidaili.com/nn
Authors: idKevin
Date: 20170717
"""
from bs4 import BeautifulSoup
import urllib2
import time
import socket
import random
def getContent(Url):
""" Get the web site content. """
header = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:40.0) Gecko/20100101 Firefox/40.0'}
req = urllib2.Request(Url, headers=header) # request
while True:
try:
response = urllib2.urlopen(req).read() # Web site content
break
except urllib2.HTTPError as e: # Ouput log to debug easily
print 1, e
time.sleep(random.choice(range(5, 20)))
except urllib2.URLError as e:
print 2, e
time.sleep(random.choice(range(10, 30)))
except socket.timeout as e:
print 3, e
time.sleep(random.choice(range(15, 20)))
except Exception as e:
print 4, e
time.sleep(random.choice(range(10, 20)))
return response # The website content
def extractIPAddress(content):
""" Extract web IP address and port. """
proxys = [] # proxy list
soup = BeautifulSoup(content, 'html.parser') # soup object
trs = soup.find_all('tr') # extract tr tag
for tds in trs[1:]:
td = tds.find_all('td') # extract td tag
proxys.append(str(td[1].contents[0]) + ":" + str(td[2].contents[0]))
return proxys
def getProxys():
""" main function. """
Url = 'http://www.xicidaili.com/nn/1' # assign relevant url
content = getContent(Url) # achieve html content
proxys = extractIPAddress(content) # achieve proxys
# for e in proxys: # output proxy on screen
# print e
return proxys
过滤无效的代理IP
- 并不是所有的代理都能用,可能的原因:我们所处的网络连不到这个代理,或者这个代理连不到我们的目标网址,所以我们踢除无效代理。
- 根据你要访问的网站,过滤掉无效的代理IP,过滤条件根据你要爬的网站设置,下面贴出过滤代码(下面的代码12行调用上面的代码模块):
#!/usr/bin/env python
# encoding: utf-8
"""
Effective proxys. Website: http://www.xicidaili.com/nn
Authors: idKevin
Date: 20170717
"""
import urllib2
import time
import random
import xiciProxys # import user-defined package
import socket
import re
def testProxys(proxys):
""" Test the proxys. """
validProxys = []
Url = "http://ip.chinaz.com/getip.aspx"
for proxy in proxys:
try:
# set proxy
proxy_handler = urllib2.ProxyHandler({'http':proxy, 'https':proxy})
opener = urllib2.build_opener(proxy_handler)
urllib2.install_opener(opener)
# request website
response = urllib2.urlopen(Url, timeout=5).read()
# set filtration condition according website
if re.findall('{ip:.*?,address:..*?}', response) != []: # remove invalid proxy
validProxys.append(proxy)
print "%s\t%s" % (proxy, response)
except Exception as e:
# print "%s\t%s" % (proxy, "invalid")
continue
return validProxys
执行结果:
49.84.29.115:8118 {ip:’49.84.29.115’,address:’江苏省镇江市 电信’}
114.97.238.176:8118 {ip:’114.97.238.176’,address:’安徽省合肥市包河区 电信’}
122.72.32.72:80 {ip:’36.102.239.105’,address:’浙江省 电信’}
1.197.56.73:808 {ip:’1.197.56.73’,address:’河南省许昌市 电信’}
113.206.74.90:8118 {ip:’113.206.74.90’,address:’重庆市 联通’}
121.31.79.66:8123 {ip:’121.31.79.66’,address:’广西梧州市 联通’}
110.73.1.255:8123 {ip:’110.73.1.255’,address:’广西防城港市 联通’}
180.108.189.180:8118 {ip:’180.108.189.180’,address:’江苏省苏州市 电信’}
171.39.117.158:8123 {ip:’171.39.117.158’,address:’广西百色市 联通’}
112.114.92.175:8118 {ip:’112.114.92.175’,address:’云南省临沧市 电信’}
110.73.14.172:8123 {ip:’110.73.14.172’,address:’广西防城港市 联通’}
42.53.40.61:80 {ip:’42.53.40.61’,address:’辽宁省 联通’}
大概可以获取13个免费代理,有些网站对IP代理网站要求不是非常高,可以免强满足要求。