爬取快代理的ip和port的简单实践

最新推荐文章于 2024-04-07 16:48:14 发布

喝粥也会胖的唐僧

最新推荐文章于 2024-04-07 16:48:14 发布

阅读量1.3k

点赞数

分类专栏： Python

本文链接：https://blog.csdn.net/zhou_438/article/details/85051233

版权

Python 专栏收录该内容

126 篇文章 9 订阅

订阅专栏

工具：pyspider

话不多说看下需要爬取的网站：https://www.kuaidaili.com/free/

我们的需求是爬取上面的ip和port，这样就不用每次都去挨着复制

代码：

from pyspider.libs.base_handler import *


class Handler(BaseHandler):
    crawl_config = {
    }
    def __init__(self):
        self.url='https://www.kuaidaili.com/free/'
    
    @every(minutes=24 * 60)
    def on_start(self):
        for page in range(1,2613):
            self.crawl(self.url+'inha/'+str(page)+'/', callback=self.index_page,validate_cert=False,fetch_type='js')

    @config(age=10 * 24 * 60 * 60)
    def index_page(self, response):
       pass

运行一下，我们就能看到确实有2612页面

接一下该选择我们的ip和port，这里需要选择器，pyspider自带的选择器选不了，我去360极速浏览器拷贝：

代码：

#!/usr/bin/env python
# -*- encoding: utf-8 -*-
# Created on 2018-12-17 15:14:03
# Project: daili

from pyspider.libs.base_handler import *


class Handler(BaseHandler):
    crawl_config = {
    }
    def __init__(self):
        self.url='https://www.kuaidaili.com/free/'
    
    @every(minutes=24 * 60)
    def on_start(self):
        for page in range(1,2613):
            self.crawl(self.url+'inha/'+str(page)+'/', callback=self.index_page,validate_cert=False,fetch_type='js')

    @config(age=10 * 24 * 60 * 60)
    def index_page(self, response):
        IP=response.doc('#list > table > tbody > tr:nth-child(1) > td:nth-child(1)').text()
        PORT=response.doc('#list > table > tbody > tr:nth-child(1) > td:nth-child(2)').text()
        print IP

打印ip的结果：

写进文件中：

from pyspider.libs.base_handler import *


class Handler(BaseHandler):
    crawl_config = {
    }
    def __init__(self):
        self.url='https://www.kuaidaili.com/free/'
    
    @every(minutes=24 * 60)
    def on_start(self):
        for page in range(1,2613):
            self.crawl(self.url+'inha/'+str(page)+'/', callback=self.index_page,validate_cert=False,fetch_type='js')

    @config(age=10 * 24 * 60 * 60)
    def index_page(self, response):
        IP=response.doc('#list > table > tbody > tr:nth-child(1) > td:nth-child(1)').text()
        PORT=response.doc('#list > table > tbody > tr:nth-child(1) > td:nth-child(2)').text()
        with open('E:'+'/''ip_port.txt','wb')as f:
            f.write(IP)
            f.write(':')
            f.write(PORT)
            f.write('\r')

结果：

接下来进行批量导出

喝粥也会胖的唐僧

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
爬取快代理的ip和port的简单实践

工具：pyspider话不多说看下需要爬取的网站：https://www.kuaidaili.com/free/ 我们的需求是爬取上面的ip和port，这样就不用每次都去挨着复制代码：from pyspider.libs.base_handler import *class Handler(BaseHandler): crawl_config = { }...
复制链接

扫一扫