做广域网分布式爬虫,网络状态不一致,经常连接错误、超时、连接未找到、域名没找到等的吧,DNS缓存就起作用了。
最直接的做法,可以直接设置/etc/hosts,但这个是设置全局的。也可以在程序中控制。
其实就是
在http绑定socket address前,socket根据域名获取ip这块返回指定的ip即可。
从开源代码
sqlmap中
lib/core/option.py 中扒一块代码出来:
def _setDNSCache():
"""
Makes a cached version of socket._getaddrinfo to avoid subsequent DNS requests.
"""
def _getaddrinfo(*args, **kwargs):
if args in kb.cache:
return kb.cache[args]
else:
kb.cache[args] = socket._getaddrinfo(*args, **kwargs)
return kb.cache[args]
if not hasattr(socket, "_getaddrinfo"):
socket._getaddrinfo = socket.getaddrinfo
socket.getaddrinfo = _getaddrinfo
要把功能改造,然后植入自己的项目中,
项目本身的需求是保存在本地dns配置,从本地dns配置读取,还需要提的是,dns缓存选dns服务器也很重要的.
改造后如下:
# -*- coding:utf-8 -*-
# @version: 1.0
# @author: ZhangZhipeng
# @date: 2015-12-01
# source: sqlmap/lib/core/option.py line_no: 1105
# https://github.com/sqlmapproject/sqlmap
import socket
from dns.resolver import Resolver
from dns.exception import DNSException
resolver = Resolver()
_dnscache = {}
dns_name_servers = [
'114.114.114.114',
'223.5.5.5',
'223.6.6.6',
'8.8.8.8',
]
resolver.nameservers = list(set(dns_name_servers + resolver.nameservers))
print "dns servers:", resolver.nameservers
def start_dns_cache(file_name=None):
load_hosts_dns_setting(file_name)
set_socket_getaddrinfo(file_name)
dump_hosts_dns_setting(file_name)
def load_hosts_dns_setting(file_name):
if not file_name:
return
global _dnscache
try:
with file(file_name)as f:
lines = f.read().splitlines()
for i in lines:
line = i.strip()
ip_host = i.split(" ")
if len(ip_host) != 2:
continue
ip, host = ip_host
addrinfo = (2, 1, 0, '', (ip, 80))
_dnscache.setdefault((host, 80, 0, 1), [])
_dnscache[(host, 80, 0, 1)].append(addrinfo)
except:
pass
def dump_hosts_dns_setting(file_name):
if not file_name:
return
dns_list = set([])
for hostinfo, addrinfo_list in _dnscache.items():
host = hostinfo[0]
for addrinfo in addrinfo_list:
ip = addrinfo[-1][0]
dns_list.add(ip + " " + host + "\n")
with file(file_name, "w")as f:
f.write("".join(dns_list))
def set_socket_getaddrinfo(file_name=None):
def _getaddrinfo(*args, **kwargs):
# print
global _dnscache
if args in _dnscache:
# print args, " in cache", _dnscache[args]
return _dnscache[args]
else:
_dnscache.setdefault(args, [])
# print args, "not in cache"
addrinfo_list = get_addrinfo(*args[:1])
if not addrinfo_list:
_dnscache[args] = socket._getaddrinfo(*args, **kwargs)
return _dnscache[args]
for i in addrinfo_list:
addrinfo = (2, 1, 0, '', (i, args[1]))
# print args, "add address:", addrinfo
_dnscache[args].append(addrinfo)
if file_name:
with file(file_name, "a")as f:
dns_setting = i + " " + args[0] + "\n"
f.write(dns_setting)
return _dnscache[args]
if not hasattr(socket, '_getaddrinfo'):
socket._getaddrinfo = socket.getaddrinfo
socket.getaddrinfo = _getaddrinfo
def get_addrinfo(host):
try:
return [host.to_text() for host in resolver.query(host)]
except DNSException, e:
return []
###############test ###############
def test():
set_socket_getaddrinfo()
import urllib
import requests
urllib.urlopen('http://baidu.com')
requests.get("http://baidu.com")
urllib.urlopen('http://10.20.12.10:26680/client/weibo')
def test_get_addr(hosts=None):
if type(hosts) in (tuple, list):
for i in hosts:
if not i:
continue
# print i
for addrinfo in get_addrinfo(i):
print addrinfo, i
if __name__ == '__main__':
test()
hosts = """
api.weibo.com
beacon.sina.com.cn
login.sina.com.cn
login.weibo.cn
passport.weibo.com
rs.sinajs.cn
s.weibo.com
weibo.com
www.weibo.com
"""
hosts = hosts.replace(" ", "").splitlines()
test_get_addr(hosts)
start_dns_cache("dns-test.hosts")
这个呢,是直接写入到一个文件中了,可以直接用序列化来管理_dnscache,这样就只需要load、update、dump就可以了,但是这样子配置文件就无法直视了,并不是很不方便,有点不可取,可以考虑用 redis 或者mongodb存储
优点:
- 只针对当前程序生效
- 可配置DNS服务器(硬编码)
- 自主配置DNS-HOST配置文件
- 可以DNS配置持久化(可选)
缺点:
- 不对系统全局生效,可以指定file_name=/etc/hosts #慎重
- 会对所有的访问域名做dns缓冲,可以加个参数只对dns-host配置中的存在的域名做缓冲、持久化,或者另写一个配置文件,管理针对哪些域名做dns缓冲、持久化。
- 没有做读写锁,如果是多线程,不安全
- 每次获取到新的dns直接写入文件,如果是大型的爬虫,网站N多,可能会导致I/O居高,可以做缓冲区设置,多少域名后再做持久化,或者根据时间戳做持久化。
- 写文件本身不安全,可以用nosql或sql等数据库做管理。sqllite、redis、mysql等
参考: