python代理池项目_python开源IP代理池--IPProxys

最新推荐文章于 2024-06-03 11:38:08 发布

涂诗语

最新推荐文章于 2024-06-03 11:38:08 发布

阅读量413

点赞数

文章标签： python代理池项目

本文链接：https://blog.csdn.net/weixin_36286464/article/details/113653882

版权

今天博客开始继续更新，谢谢大家对我的关注和支持。这几天一直是在写一个ip代理池的开源项目。通过前几篇的博客，我们可以了解到突破反爬虫机制的一个重要举措就是代理ip。拥有庞大稳定的ip代理，在爬虫工作中将起到重要的作用,但是从成本的角度来说，一般稳定的ip池都很贵，因此我这个开源项目的意义就诞生了，爬取一些代理网站提供的免费ip(虽然70%都是不好使的,但是扛不住量大，网站多)，检测有效性后存储到数据库中，同时搭建一个http服务器，提供一个api接口，供大家的爬虫程序调用。(

好了，废话不多说，咱们进入今天的主题，讲解一下我写的这个开源项目IPProxys。

37e0c88f23d1?from=singlemessage

下面是这个项目的工程结构:

37e0c88f23d1?from=singlemessage

api包:主要是实现http服务器，提供api接口(通过get请求,返回json数据)

data文件夹:主要是数据库文件的存储位置和qqwry.dat(可以查询ip的地理位置)

db包：主要是封装了一些数据库的操作

spider包：主要是爬虫的核心功能，爬取代理网站上的代理ip

test包：测试一些用例，不参与整个项目的运行

util包：提供一些工具类。IPAddress.py查询ip的地理位置

validator包:用来测试ip地址是否可用

config.py：主要是配置信息(包括配置ip地址的解析方式和数据库的配置)

接下来讲一下关键代码:

首先说一下apiServer.py:

#coding:utf-8

'''

定义几个关键字，count types,protocol,country,area,

'''

import urllib

from config import API_PORT

from db.SQLiteHelper import SqliteHelper

__author__ = 'Xaxdus'

import BaseHTTPServer

import json

import urlparse

# keylist=['count', 'types','protocol','country','area']

class WebRequestHandler(BaseHTTPServer.BaseHTTPRequestHandler):

def do_GET(self):

"""

dict={}

parsed_path = urlparse.urlparse(self.path)

try:

query = urllib.unquote(parsed_path.query)

print query

if query.find('&')!=-1:

params = query.split('&')

for param in params:

dict[param.split('=')[0]]=param.split('=')[1]

else:

dict[query.split('=')[0]]=query.split('=')[1]

str_count=''

conditions=[]

for key in dict:

if key =='count':

str_count = 'lIMIT 0,%s'% dict[key]

if key =='country' or key =='area':

conditions .append(key+" LIKE '"+dict[key]+"%'")

elif key =='types' or key =='protocol' or key =='country' or key =='area':

conditions .append(key+"="+dict[key])

if len(conditions)>1:

conditions = ' AND '.join(conditions)

else:

conditions =conditions[0]

sqlHelper = SqliteHelper()

result = sqlHelper.select(sqlHelper.tableName,conditions,str_count)

# print type(result)

# for r in result:

# print r

print result

data = json.dumps(result)

self.send_response(200)

self.end_headers()

self.wfile.write(data)

except Exception,e:

print e

self.send_response(404)

if __name__=='__main__':

server = BaseHTTPServer.HTTPServer(('0.0.0.0',API_PORT), WebRequestHandler)

server.serve_forever()

从代码中可以看出是对参数的解析，参数包括count(数量), types(模式),protocol(协议),country(国家),area(地区),(

types类型(0高匿名，1透明)，protocol(0 http,1 https http),country(国家),area(省市))例如访问http://127.0.0.1:8000/?count=8&types=0.返回json数据。如下图所示:

37e0c88f23d1?from=singlemessage

接着说一下SQLiteHelper.py(主要是对sqlite的操作):

#coding:utf-8

from config import DB_CONFIG

from db.SqlHelper import SqlHelper

__author__ = 'Xaxdus'

import sqlite3

class SqliteHelper(SqlHelper):

tableName='proxys'

def __init__(self):

'''

建立数据库的链接

:return:

'''

self.database = sqlite3.connect(DB_CONFIG['dbPath'],check_same_thread=False)

self.cursor = self.database.cursor()

#创建表结构

self.createTable()

def createTable(self):

self.cursor.execute("create TABLE IF NOT EXISTS %s (id INTEGER PRIMARY KEY ,ip VARCHAR(16) NOT NULL,"

"port INTEGER NOT NULL ,types INTEGER NOT NULL ,protocol INTEGER NOT NULL DEFAULT 0,"

"country VARCHAR (20) NOT NULL,area VARCHAR (20) NOT NULL,updatetime TimeStamp NOT NULL DEFAULT (datetime('now','localtime')) ,speed DECIMAL(3,2) NOT NULL DEFAULT 100)"% self.tableName)

self.database.commit()

def select(self,tableName,condition,count):

'''

:param tableName: 表名

:param condition: 条件包含占位符

:param value: 占位符所对应的值(主要是为了防注入)

:return:

'''

command = 'SELECT DISTINCT ip,port FROM %s WHERE %s ORDER BY speed ASC %s '%(tableName,condition,count)

self.cursor.execute(command)

result = s

最低0.47元/天解锁文章

涂诗语

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
python代理池项目_python开源IP代理池--IPProxys

今天博客开始继续更新，谢谢大家对我的关注和支持。这几天一直是在写一个ip代理池的开源项目。通过前几篇的博客，我们可以了解到突破反爬虫机制的一个重要举措就是代理ip。拥有庞大稳定的ip代理，在爬虫工作中将起到重要的作用,但是从成本的角度来说，一般稳定的ip池都很贵，因此我这个开源项目的意义就诞生了，爬取一些代理网站提供的免费ip(虽然70%都是不好使的,但是扛不住量大，网站多)，检测有效性后存储到数...
复制链接

扫一扫