快速爬取大量数据,且减少ip被封的窍门有:
1、多代理ip,多线程
2、设置随机网页访问间隔
百度到一个很好的threading多线程的blog:
http://www.cnblogs.com/tkqasn/p/5700281.html
这个真的讲得特详细,很受用
多线程验证ip可用:
def validIpList()
global ipTrueList
ipTrueList = []
for ip in range(ipList):
threads.append(threading.Thread(target=validIp,args=(ip,)))
for t in threads:
t.start()
for t in threads:
t.join()
def validIp(ip):
global ipTrueList
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument(ip)
browser = webdriver.Chrome(chrome_options=chrome_options)
try:
browser.get("https://www.baidu.com")
txt = browser.page_source
if txt.find("百度") != -1:
ipTrueList.append(ip)
browser.quit()
return
else:
browser.quit()
return
except Exception,e:
browser.quit()
return