用scrapy爬取可用的代理

最新推荐文章于 2024-01-24 15:46:45 发布

huzai9527

最新推荐文章于 2024-01-24 15:46:45 发布

阅读量325

点赞数 1

分类专栏： python scrapy

本文链接：https://blog.csdn.net/huzai9527/article/details/84202569

版权

python 同时被 2 个专栏收录

9 篇文章 0 订阅

订阅专栏

scrapy

2 篇文章 0 订阅

订阅专栏

一、分析免费代理网站的结构

我爬取了三个字段：IP、port、type

二、分析要爬取的数据，编写items.py

因此在items.py中，建立相应的字段

import scrapy
class IproxyItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    ip = scrapy.Field()
    type = scrapy.Field()
    port = scrapy.Field()

三、爬取所有的免费ip- 在spider目录下，创建IpSpider.py

import scrapy
import Iproxy.items
class IpSpider(scrapy.Spider):
    name = 'IpSpider'
    allowed_domains = ['xicidaili.com']
    start_urls = ['http://www.xicidaili.com/']

    def parse(self, response):
        item = Iproxy.items.IproxyItem()
        item['ip'] = response.css('tr td:nth-child(2)::text').extract()
        item['port'] = response.css('tr td:nth-child(3)::text').extract()
        item['type'] = response.css('tr td:nth-child(6) ::text').extract()
        yield item

四、检测是否可用，如果可用则存入数据库- 因为是免费的ip，所以我们有必要检测一下他是否可用，对于可用的就存入数据库，反之则丢弃- 检测处理数据在pipeline.py中编写- 检测原理，通过代理访问百度，如果能够访问，则说明可用

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html

import pymysql
import requests

class IproxyPipeline(object):
    def process_item(self, item, spider):
        print('@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@')
        db = pymysql.connect("localhost", "root", "168168", "spider")
        cursor = db.cursor()
        for i in range(1, len(item['ip'])):
            ip = item['ip'][i] + ':' + item['port'][i]
            try:
                if self.proxyIpCheck(ip) is False:
                    print('此ip：'+ip+"不能用")
                    continue
                else:
                    print('此ip：'+ip+'可用，存入数据库！')
                    sql = 'insert into proxyIp value ("%s")' % (ip)
                    cursor.execute(sql)
                    db.commit()
            except:
                db.rollback()
        db.close()
        return item

    def proxyIpCheck(self, ip):
        proxies = {'http': 'http://' + ip, 'https': 'https://' + ip}
        try:
            r = requests.get('https://www.baidu.com/', proxies=proxies, timeout=1)
            if (r.status_code == 200):
                return True
            else:
                return False
        except:
            return False

五、运行情况- 可以看出还是有好多ip不能用的- 可用的存在数据库

TIM图片20181118172841.jpg

huzai9527

关注

1
点赞
踩
0

收藏

觉得还不错? 一键收藏
1
评论
用scrapy爬取可用的代理

一、分析免费代理网站的结构我爬取了三个字段：IP、port、type二、分析要爬取的数据，编写items.py因此在items.py中，建立相应的字段import scrapyclass IproxyItem(scrapy.Item): # define the fields for your item here like: # name = scrapy.Fie...
复制链接

扫一扫

专栏目录