Python scrapy 爬虫框架学习笔记

最新推荐文章于 2022-12-05 21:37:55 发布

myarche

最新推荐文章于 2022-12-05 21:37:55 发布

阅读量569

点赞数

分类专栏：爬虫

本文链接：https://blog.csdn.net/myarche/article/details/104205932

版权

爬虫专栏收录该内容

2 篇文章 0 订阅

订阅专栏

早在去年，根据搜索的相关资料，已经写出了一部分针对某些网站的爬虫，但是学习的不系统，全部是根据自己搜索由来而做的，对很多较深的原理并不是太懂，这两天找了一套视频，在深入的学习一下。

官网：https://scrapy.org/

安装就不说了，参考官网，我自己是 Windows7 环境，安装的是 python 环境是 Anaconda3 (64-bit)

2.初步使用 scrapy

新建 stackoverflow_spider.py 文件编写代码并在命令行进行运行以下代码：

#并制定爬取的数据存储为本地 json 类型文件（并支持存储 为 csv xml 等类型）
scrapy runspider stackoverflow_spider.py -o quotes.json

import scrapy

class StackOverflowSpider(scrapy.Spider):
	# 定义爬虫项目名称
	name = "stackoverflow"
	# 指定爬取初始链接，parse 作为默认回调函数
	start_urls=["http://stackoverflow.com/questions?sort=votes"]

	def parse(self, response):
		for href in response.css('.question-summary h3 a::attr(href)'):
			# 相对链接转化成绝对链接
			full_url = response.urljoin(href.extract())
			# 注册 parse_question 为回调函数
			yield scrapy.Request(full_url,callback=self.parse_question)

	def parse_question(self, response):
		yield {
			'title':response.css('h1 a::text').extract()[0],
			'votes':response.css(".question .vote-count-post::text").extract()[0],
			'body':response.css(".question .post-text").extract()[0],
			'tags': response.css('.question .post-tag::text').extract(),
			'link': response.url,
		}

附带框架的高级特性：

1，内置的数据抽取器
2，交互式控制台用于调试数据抽取方法
3，内置对结果输出的支持，可以保存为JSON, CSV, XML等
4，自动处理编码
5，支持自定义扩展
6，丰富的内置扩展，可用于处理：
1）cookies and session
2）HTTP features like compression, authentication, caching
3） user-agent spoofing
7，远程调试scrapy
8，更多的支持，比如可爬取xml、csv，可自动下载图片等等。
4）robots.txt
5） crawl depth restriction

3.基本使用步骤

创建一个工程：

scrapy startproject tutorial

会自动产生一下文件：

tutorial/
    scrapy.cfg            # deploy configuration file

    tutorial/             # project's Python module, you'll import your code from here
        __init__.py

        items.py          # project items definition file  

        middlewares.py    # project middlewares file

        pipelines.py      # project pipelines file

        settings.py       # project settings file

        spiders/          # a directory where you'll later put your spiders
            __init__.py

然后用快捷命令创建编写文件，以便快速改写：

scrapy genspider toscrape_spider toscrape.com

首先来一个保存源代码的实例：

# -*- coding: utf-8 -*-
import scrapy


class TocSpiderSpider(scrapy.Spider):
    name = 'toc_spider'
    allowed_domains = ['tocscrape.com']
    start_urls = ['http://quotes.toscrape.com/tag/humor/']

    def parse(self, response):
        filename = response.url.split("/")[-2] + '.html'
        with open(filename,'wb') as f:
            f.write(response.body)

执行爬取：

scrapy crawl toc_spider

#运行前如果不知道有几个爬虫实例，可以使用命令  scrapy list 来查看

当然我们抓取数据一般都会指定爬取字段，然后保存到数据库中，所以我们要定义 item，需要爬取多少字段就定义多少字段

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html

import scrapy


class ToscrapeItem(scrapy.Item):
    # define the fields for your item here like:
    name = scrapy.Field()

然后我们需要在爬取文件中引入

#顶部引入，项目文件夹名称  后面的 item 类，可以在 item 文件中复制
from tocscrape.items import ToscrapeItem


#然后爬虫逻辑中使用

item = new ToscrapeItem()

item['name'] = response.xpath()

4.1基本概念介绍之命令行

help：scrapy的基本命令，用于查看帮助信息。
scrapy --help

version：查看版本信息，可见-v参数查看各组件的版本信息,包含安装的 python scrapy xml 等等；
scrapy version
scrapy version -v


创建一个工程
scrapy startproject projectname

在工程中产生一个 spider ，可以产生多个 spider ，要求 projectname 不能相同，需要创建工程后进入工程目录中使用，后面可选跟指定的爬取域名
scrapy genspider example example.com

列出工程中都有那些spider
scrapy list

view: 查看也页面源码在浏览器中显示的样子,检测确定爬取数据代码中是否包含某些代码
scrapy view https://www.baidu.com/

parse: 在工程中使用固定的 parse 函数解析某个页面，需要进入工程目录中使用
scrapy parse url

一个非常有用的命令，可用于调试数据、检测xpath、查看页面源码，等等，使用后方便使用交互环境下，用 scrapy 显示的 reponse 等等进行调试，调试必备
scrapy shell url

PS：小窍门获取返回数据中匹配的数据  respnse.xpath('/html/body/div/li/em/text()').re('\d+')[0]

运行自包含的爬虫 runspider，没有通过 startproject 命令创建工程目录，直接单独编写后使用
scrapy runpiser demo_spider.py

bench 执行一个基准测试；可用来检测scrapy是否安装成功；
scrapy bench

5基本概念介绍之 scrapy 中重要的对象

Request 对象，最新版请查看官方文档:

初始化参数：
class scrapy.http.Request(
url [ , 
callback, 
method='GET',
headers, 
body,
cookies, 
meta, 非常重要，多个解析函数中传递数据
encoding='utf-8',  
priority=0,
 don't_filter=False,
 errback ] )

其他属性：
url
method
headers
body
meta
copy()
replace()

子类：
FormRequest 非常重要实现登录功能，做请求用
Response 一般不会去实例化，直接使用，包含很多可用数据

关于登录模块的学习，贴下教程所写：

# -*- coding: utf-8 -*-
import json
import scrapy
from scrapy import FormRequest
from scrapy.mail import MailSender

from bioon import settings
from bioon.items import BioonItem

class BioonspiderSpider(scrapy.Spider):
    name = "bioonspider"
    allowed_domains = ["bioon.com"]
    start_urls=['http://login.bioon.com/login']
    
    def parse(self,response):
        #从response.headers中获取cookies信息
        r_headers = response.headers['Set-Cookie']
        cookies_v = r_headers.split(';')[0].split('=')
        
        cookies = {cookies_v[0]:cookies_v[1]}
        
        #模拟请求的头部信息
        headers = {
        'Host':	'login.bioon.com',
        'Referer':'http://login.bioon.com/login',
        'User-Agent':'Mozilla/5.0 (Windows NT 6.1; rv:40.0) Gecko/20100101 Firefox/40.0',
        'X-Requested-With':'XMLHttpRequest' 
        }
        
        #获取验证信息
        csrf_token = response.xpath(
            '//input[@id="csrf_token"]/@value').extract()[0]
        
        #获得post的目的URL
        login_url = response.xpath(
            '//form[@id="login_form"]/@action').extract()[0]
        end_login = response.urljoin(login_url)
        
        #生成post的数据
        formdata={
        #请使用自己注册的用户名
        'account':'********',
        'client_id':'usercenter',
        'csrf_token':csrf_token,
        'grant_type':'grant_type',
        'redirect_uri':'http://login.bioon.com/userinfo',
        #请使用自己注册的用户名
        'username':'********',
        #请使用自己用户名的密码
        'password':'xxxxxxx',
        }
        
        #模拟登录请求
        return FormRequest(
        end_login,
        formdata=formdata,
        headers=headers,
        cookies=cookies,
        callback=self.after_login
        )

    def after_login(self,response):
        
        self.log('Now handling bioon login page.')
        
        aim_url = 'http://news.bioon.com/Cfda/'
        
        obj = json.loads(response.body)
        
        print "Loging state: ", obj['message']
        if "success" in obj['message']:
            self.logger.info("=========Login success.==========")
        
        return scrapy.Request(aim_url,callback = self.parse_list)
    
    def parse_list(self,response):
        
        lis_news = response.xpath(
            '//ul[@id="cms_list"]/li/div/h4/a/@href').extract()
        
        for li in lis_news:
            end_url = response.urljoin(li)
            yield scrapy.Request(end_url,callback=self.parse_content)
    
    def parse_content(self,response):
        
        head = response.xpath(
            '//div[@class="list_left"]/div[@class="title5"]')[0]
        
        item=BioonItem()
        
        item['title'] = head.xpath('h1/text()').extract()[0]
            
        item['source'] = head.xpath('p/text()').re(ur'来源：(.*?)\s(.*?)$')[0]
        
        item['date_time'] = head.xpath('p/text()').re(ur'来源：(.*?)\s(.*?)$')[1]
        
        item['body'] = response.xpath(
            '//div[@class="list_left"]/div[@class="text3"]').extract()[0]
        
        return item

        
    def closed(self,reason):
        import pdb;pdb.set_trace()
        self.logger.info("Spider closed: %s"%str(reason))
        mailer = MailSender.from_settings(self.settings)
        mailer.send(
            to=["******@qq.com"], 
            subject="Spider closed", 
            body=str(self.crawler.stats.get_stats()), 
            cc=["**********@xxxxxxxx.com"]
            )

item 文件：

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html

import scrapy


class BioonItem(scrapy.Item):
    # define the fields for your item here like:
    title = scrapy.Field()
    source =scrapy.Field()
    date_time = scrapy.Field()
    body = scrapy.Field()

setting 文件：

# -*- coding: utf-8 -*-

# Scrapy settings for bioon project
#
# For simplicity, this file contains only the most important settings by
# default. All the other settings are documented here:
#
# http://doc.scrapy.org/en/latest/topics/settings.html
#
#Scrapy项目实现的bot的名字(也为项目名称)。
BOT_NAME = 'bioon'

SPIDER_MODULES = ['bioon.spiders']
NEWSPIDER_MODULE = 'bioon.spiders'

#保存项目中启用的下载中间件及其顺序的字典。默认:: {}
DOWNLOADER_MIDDLEWARES = {
    'scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware': 110,
}

# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:30.0) Gecko/20100101 Firefox/30.0'

#保存项目中启用的pipeline及其顺序的字典。该字典默认为空，值(value)任意。 
#不过值(value)习惯设定在0-1000范围内。
ITEM_PIPELINES={
#'bioon.pipelines.BioonPipeline':500
}

#下载器下载网站页面时需要等待的时间。该选项可以用来限制爬取速度， 
#减轻服务器压力。同时也支持小数:
DOWNLOAD_DELAY = 0.25    # 250 ms of delay

#爬取网站最大允许的深度(depth)值。如果为0，则没有限制。
DEPTH_LIMIT=0

#是否启用DNS内存缓存(DNS in-memory cache)。默认: True
DNSCACHE_ENABLED=True

#logging输出的文件名。如果为None，则使用标准错误输出(standard error)。默认: None
LOG_FILE='scrapy.log'

#log的最低级别。可选的级别有: CRITICAL、 ERROR、WARNING、INFO、DEBUG。默认: 'DEBUG'
LOG_LEVEL='DEBUG'

#如果为 True ，进程所有的标准输出(及错误)将会被重定向到log中。
#例如， 执行 print 'hello' ，其将会在Scrapy log中显示。
#默认: False
LOG_STDOUT=False

#对单个网站进行并发请求的最大值。默认: 8
CONCURRENT_REQUESTS_PER_DOMAIN=8

#Default: True ,Whether to enable the cookies middleware. If disabled, no cookies will be sent to web servers.
COOKIES_ENABLED = True

#feed settings
FEED_URI = 'file:///C:/Users/stwan/Desktop/bioon/a.txt'
FEED_FORMAT = 'jsonlines'

LOG_ENCODING = None

##----------------------Mail settings------------------------
#Default: ’scrapy@localhost’,Sender email to use (From: header) for sending emails.
MAIL_FROM='*********@163.com'

#Default: ’localhost’, SMTP host to use for sending emails.
MAIL_HOST="smtp.163.com"

#Default: 25, SMTP port to use for sending emails.
MAIL_PORT="25"

#Default: None, User to use for SMTP authentication. If disabled no SMTP authentication will be performed.
MAIL_USER="*********@163.com"

#Default: None, Password to use for SMTP authentication, along with MAIL_USER.
MAIL_PASS="xxxxxxxxxxxxx"

#Enforce using STARTTLS. STARTTLS is a way to take an existing insecure connection, 
#and upgrade it to a secure connection using SSL/TLS.
MAIL_TLS=False

#Default: False, Enforce connecting using an SSL encrypted connection
MAIL_SSL=False

pipelines 文件：

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html

from bioon.handledb import adb_insert_data,exec_sql
from bioon.settings import DBAPI,DBKWARGS

class BioonPipeline(object):
    def process_item(self, item, spider):
        print "Now in pipeline:"
        print item['name']
        print item['value']
        print "End of pipeline."
        #store data
        #adb_insert_data(item,"tablename",DBAPI,**DBKWARGS)
        return item

middlewares 文件：

# Importing base64 library because we'll need it ONLY in case if the proxy we are going to use requires authentication
#-*- coding:utf-8-*-
import base64
from proxy import GetIp,counter
from scrapy import log
ips=GetIp().get_ips()

class ProxyMiddleware(object):
    http_n=0     #counter for http requests
    https_n=0    #counter for https requests  
    # overwrite process request
    def process_request(self, request, spider):
        # Set the location of the proxy
        if request.url.startswith("http://"):
            n=ProxyMiddleware.http_n
            n=n if n<len(ips['http']) else 0 
            request.meta['proxy']= "http://%s:%d"%(ips['http'][n][0],int(ips['http'][n][1]))
            log.msg('Squence - http: %s - %s'%(n,str(ips['http'][n])))
            ProxyMiddleware.http_n=n+1

        if request.url.startswith("https://"):
            n=ProxyMiddleware.https_n
            n=n if n<len(ips['https']) else 0             
            request.meta['proxy']= "https://%s:%d"%(ips['https'][n][0],int(ips['https'][n][1]))
            log.msg('Squence - https: %s - %s'%(n,str(ips['https'][n])))
            ProxyMiddleware.https_n=n+1

爬取 xici 网站的 ip 列表：

爬取类：

# -*- coding: utf-8 -*-
import scrapy
from collectips.items import CollectipsItem

class XiciSpider(scrapy.Spider):
    name = "xici"
    allowed_domains = ["xicidaili.com"]
    start_urls = (
        'http://www.xicidaili.com',
    )
    
    def start_requests(self):
        reqs=[]
        
        for i in range(1,206):
            req=scrapy.Request("http://www.xicidaili.com/nn/%s"%i)
            reqs.append(req)
        
        return reqs
    
    def parse(self, response):
        ip_list=response.xpath('//table[@id="ip_list"]')
        
        trs = ip_list[0].xpath('tr')
        
        items=[]
        
        for ip in trs[1:]:
            pre_item=CollectipsItem()
            
            pre_item['IP'] = ip.xpath('td[3]/text()')[0].extract()
            
            pre_item['PORT'] = ip.xpath('td[4]/text()')[0].extract()
            
            pre_item['POSITION'] = ip.xpath('string(td[5])')[0].extract().strip()
            
            pre_item['TYPE'] = ip.xpath('td[7]/text()')[0].extract()
            
            pre_item['SPEED'] = ip.xpath(
                'td[8]/div[@class="bar"]/@title').re('\d{0,2}\.\d{0,}')[0]
                
            pre_item['LAST_CHECK_TIME'] = ip.xpath('td[10]/text()')[0].extract()
            
            items.append(pre_item)
            
        return items

item：

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html

import scrapy


class CollectipsItem(scrapy.Item):
    # define the fields for your item here like:
    IP = scrapy.Field()
    PORT = scrapy.Field()
    POSITION = scrapy.Field()
    TYPE = scrapy.Field()
    SPEED = scrapy.Field()
    LAST_CHECK_TIME = scrapy.Field()

setting：

# -*- coding: utf-8 -*-

# Scrapy settings for collectips project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     http://doc.scrapy.org/en/latest/topics/settings.html
#     http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
#     http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'collectips'

SPIDER_MODULES = ['collectips.spiders']
NEWSPIDER_MODULE = 'collectips.spiders'

# database connection parameters
DBKWARGS={'db':'ippool','user':'root', 'passwd':'toor',
    'host':'localhost','use_unicode':True, 'charset':'utf8'}


# Configure item pipelines
# See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
   'collectips.pipelines.CollectipsPipeline': 300,
}

#Configure log file name
LOG_FILE = "scrapy.log"

# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = 'Mozilla/5.0 (Windows NT 6.1; rv:40.0) Gecko/20100101 Firefox/40.0'

pipelines：

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html

import MySQLdb

class CollectipsPipeline(object):

    def process_item(self, item, spider):

        DBKWARGS = spider.settings.get('DBKWARGS')
        con = MySQLdb.connect(**DBKWARGS)
        cur = con.cursor()
        sql = ("insert into proxy(IP,PORT,TYPE,POSITION,SPEED,LAST_CHECK_TIME) "
            "values(%s,%s,%s,%s,%s,%s)")
        lis = (item['IP'],item['PORT'],item['TYPE'],item['POSITION'],item['SPEED'],
            item['LAST_CHECK_TIME'])
        try:
            cur.execute(sql,lis)
        except Exception,e:
            print "Insert error:",e
            con.rollback()
        else:
            con.commit()
        cur.close()
        con.close()
        return item

天猫商品爬取代码：

爬取类：

# -*- coding: utf-8 -*-
import scrapy
from topgoods.items import TopgoodsItem

class TmGoodsSpider(scrapy.Spider):
    name = "tm_goods"
    allowed_domains = ["http://www.tmall.com"]
    start_urls = (
        'http://list.tmall.com/search_product.htm?type=pc&totalPage=100&cat=50025135&sort=d&style=g&from=sn_1_cat-qp&active=1&jumpto=10#J_Filter',
    )
    #记录处理的页数
    count=0 
     
    def parse(self, response):
          
        TmGoodsSpider.count += 1
        
        divs = response.xpath("//div[@id='J_ItemList']/div[@class='product']/div")
        if not divs:
            self.log( "List Page error--%s"%response.url )
        
        print "Goods numbers: ",len(divs)
        
        for div in divs:
            item=TopgoodsItem()
            #商品价格
            item["GOODS_PRICE"] = div.xpath("p[@class='productPrice']/em/@title")[0].extract()
            #商品名称
            item["GOODS_NAME"] = div.xpath("p[@class='productTitle']/a/@title")[0].extract()
            #商品连接
            pre_goods_url = div.xpath("p[@class='productTitle']/a/@href")[0].extract()
            item["GOODS_URL"] = pre_goods_url if "http:" in pre_goods_url else ("http:"+pre_goods_url)
            #图片链接
            try:
                file_urls = div.xpath('div[@class="productImg-wrap"]/a[1]/img/@src|'
                'div[@class="productImg-wrap"]/a[1]/img/@data-ks-lazyload').extract()[0]
                item['file_urls'] = ["http:"+file_urls]
            except Exception,e:
                print "Error: ",e
                import pdb;pdb.set_trace()
            yield scrapy.Request(url=item["GOODS_URL"],meta={'item':item},callback=self.parse_detail,
            dont_filter=True)

    def parse_detail(self,response):

        div = response.xpath('//div[@class="extend"]/ul')
        if not div:
            self.log( "Detail Page error--%s"%response.url )
            
        item = response.meta['item']
        div=div[0]
        #店铺名称
        item["SHOP_NAME"] = div.xpath("li[1]/div/a/text()")[0].extract()
        #店铺连接
        pre_shop_url = div.xpath("li[1]/div/a/@href")[0].extract()
        item["SHOP_URL"] = response.urljoin(pre_shop_url)
        #公司名称
        item["COMPANY_NAME"] = div.xpath("li[3]/div/text()")[0].extract().strip()
        #公司所在地
        item["COMPANY_ADDRESS"] = div.xpath("li[4]/div/text()")[0].extract().strip()
        
        yield item

items：

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html

import scrapy


class TopgoodsItem(scrapy.Item):
    # define the fields for your item here like:
    GOODS_PRICE = scrapy.Field()
    GOODS_NAME = scrapy.Field()
    GOODS_URL = scrapy.Field()
    SHOP_NAME = scrapy.Field()
    SHOP_URL = scrapy.Field()
    COMPANY_NAME = scrapy.Field()
    COMPANY_ADDRESS = scrapy.Field()
    
    #图片链接
    file_urls = scrapy.Field()

settings 注意，这里包含图片下载：

# -*- coding: utf-8 -*-

# Scrapy settings for topgoods project
#
# For simplicity, this file contains only the most important settings by
# default. All the other settings are documented here:
#
#     http://doc.scrapy.org/en/latest/topics/settings.html
#

BOT_NAME = 'topgoods'

SPIDER_MODULES = ['topgoods.spiders']
NEWSPIDER_MODULE = 'topgoods.spiders'

DOWNLOADER_MIDDLEWARES = {
        'scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware':301,
    }
#下面三行图片下载
ITEM_PIPELINES = {'scrapy.pipelines.images.ImagesPipeline': 1}

IMAGES_URLS_FIELD = 'file_urls'
IMAGES_STORE = r'.'
# IMAGES_THUMBS = {
    # 'small': (50, 50),
    # 'big': (270, 270),
# }

LOG_FILE = "scrapy.log"

代理 ip 的模块设置：

middlewares：

# Importing base64 library because we'll need it ONLY in case 
#if the proxy we are going to use requires authentication
#-*- coding:utf-8-*-
import base64
from proxy import GetIp,counter
import logging
ips=GetIp().get_ips()

class ProxyMiddleware(object):
    http_n=0     #counter for http requests
    https_n=0    #counter for https requests  
    # overwrite process request
    def process_request(self, request, spider):
        # Set the location of the proxy
        if request.url.startswith("http://"):
            n=ProxyMiddleware.http_n
            n=n if n<len(ips['http']) else 0 
            request.meta['proxy']= "http://%s:%d"%(
                ips['http'][n][0],int(ips['http'][n][1]))
            logging.info('Squence - http: %s - %s'%(n,str(ips['http'][n])))
            ProxyMiddleware.http_n=n+1

        if request.url.startswith("https://"):
            n=ProxyMiddleware.https_n
            n=n if n<len(ips['https']) else 0             
            request.meta['proxy']= "https://%s:%d"%(
                ips['https'][n][0],int(ips['https'][n][1]))
            logging.info('Squence - https: %s - %s'%(n,str(ips['https'][n])))
            ProxyMiddleware.https_n=n+1

在同一目录下创建 proxy.py

import sys
from handledb import exec_sql
import socket
import urllib2

dbapi="MySQLdb"
kwargs={'user':'root','passwd':'toor','db':'ippool','host':'localhost', 'use_unicode':True}

def counter(start_at=0):
    '''Function: count number
	Usage: f=counter(i) print f() #i+1'''
    count=[start_at]
    def incr():
        count[0]+=1
        return count[0]
    return incr

def use_proxy (browser,proxy,url):
    '''Open browser with proxy'''
    #After visited transfer ip
    profile=browser.profile
    profile.set_preference('network.proxy.type', 1)  
    profile.set_preference('network.proxy.http', proxy[0])  
    profile.set_preference('network.proxy.http_port', int(proxy[1]))  
    profile.set_preference('permissions.default.image',2)
    profile.update_preferences() 
    browser.profile=profile
    browser.get(url)
    browser.implicitly_wait(30)
    return browser
    
class Singleton(object):
    '''Signal instance example.'''
    def __new__(cls, *args, **kw):  
        if not hasattr(cls, '_instance'):  
            orig = super(Singleton, cls)  
            cls._instance = orig.__new__(cls, *args, **kw)  
        return cls._instance 

class GetIp(Singleton):
    def __init__(self):
        sql='''SELECT  `IP`,`PORT`,`TYPE`
        FROM  `proxy` 
        WHERE `TYPE` REGEXP  'HTTP|HTTPS'
        AND  `SPEED`<5 OR `SPEED` IS NULL
        ORDER BY `proxy`.`TYPE` ASC 
        LIMIT 50 '''
        self.result = exec_sql(sql,**kwargs)
    def del_ip(self,record):
        '''delete ip that can not use'''
        sql="delete from proxy where IP='%s' and PORT='%s'"%(record[0],record[1])
        print sql
        exec_sql(sql,**kwargs)
        print record ," was deleted."
    def judge_ip(self,record):
        '''Judge IP can use or not'''
        http_url="http://www.baidu.com/"
        https_url="https://www.alipay.com/"
        proxy_type=record[2].lower()
        url=http_url if  proxy_type== "http" else https_url
        proxy="%s:%s"%(record[0],record[1])
        try:
            req=urllib2.Request(url=url)
            req.set_proxy(proxy,proxy_type)
            response=urllib2.urlopen(req,timeout=30)
        except Exception,e:
            print "Request Error:",e
            self.del_ip(record)
            return False
        else:
            code=response.getcode()
            if code>=200 and code<300:
                print 'Effective proxy',record
                return True
            else:
                print 'Invalide proxy',record
                self.del_ip(record)
                return False
        
    def get_ips(self):
        print "Proxy getip was executed."
        http=[h[0:2] for h in self.result if h[2] =="HTTP" and self.judge_ip(h)]
        https=[h[0:2] for h in self.result if h[2] =="HTTPS" and self.judge_ip(h)]
        print "Http: ",len(http),"Https: ",len(https)
        return {"http":http,"https":https}

myarche

关注

0
点赞
踩
3

收藏

觉得还不错? 一键收藏
0
评论
Python scrapy 爬虫框架学习笔记

早在去年，根据搜索的相关资料，已经写出了一部分针对某些网站的爬虫，但是学习的不系统，全部是根据自己搜索由来而做的，对很多较深的原理并不是太懂，这两天找了一套视频，在深入的学习一下。官网：https://scrapy.org/安装就不说了，参考官网，我自己是 Windows7 环境，安装的是 python 环境是 Anaconda3 (64-bit)2.初步使用 scrapy新...
复制链接

扫一扫