工具:pyspider
话不多说看下需要爬取的网站:https://www.kuaidaili.com/free/
我们的需求是爬取上面的ip和port,这样就不用每次都去挨着复制
代码:
from pyspider.libs.base_handler import *
class Handler(BaseHandler):
crawl_config = {
}
def __init__(self):
self.url='https://www.kuaidaili.com/free/'
@every(minutes=24 * 60)
def on_start(self):
for page in range(1,2613):
self.crawl(self.url+'inha/'+str(page)+'/', callback=self.index_page,validate_cert=False,fetch_type='js')
@config(age=10 * 24 * 60 * 60)
def index_page(self, response):
pass
运行一下,我们就能看到确实有2612页面
接一下该选择我们的ip和port,这里需要选择器,pyspider自带的选择器选不了,我去360极速浏览器拷贝:
代码:
#!/usr/bin/env python
# -*- encoding: utf-8 -*-
# Created on 2018-12-17 15:14:03
# Project: daili
from pyspider.libs.base_handler import *
class Handler(BaseHandler):
crawl_config = {
}
def __init__(self):
self.url='https://www.kuaidaili.com/free/'
@every(minutes=24 * 60)
def on_start(self):
for page in range(1,2613):
self.crawl(self.url+'inha/'+str(page)+'/', callback=self.index_page,validate_cert=False,fetch_type='js')
@config(age=10 * 24 * 60 * 60)
def index_page(self, response):
IP=response.doc('#list > table > tbody > tr:nth-child(1) > td:nth-child(1)').text()
PORT=response.doc('#list > table > tbody > tr:nth-child(1) > td:nth-child(2)').text()
print IP
打印ip的结果:
写进文件中:
from pyspider.libs.base_handler import *
class Handler(BaseHandler):
crawl_config = {
}
def __init__(self):
self.url='https://www.kuaidaili.com/free/'
@every(minutes=24 * 60)
def on_start(self):
for page in range(1,2613):
self.crawl(self.url+'inha/'+str(page)+'/', callback=self.index_page,validate_cert=False,fetch_type='js')
@config(age=10 * 24 * 60 * 60)
def index_page(self, response):
IP=response.doc('#list > table > tbody > tr:nth-child(1) > td:nth-child(1)').text()
PORT=response.doc('#list > table > tbody > tr:nth-child(1) > td:nth-child(2)').text()
with open('E:'+'/''ip_port.txt','wb')as f:
f.write(IP)
f.write(':')
f.write(PORT)
f.write('\r')
结果:
接下来进行批量导出