完整代码不到100行,基本实现了网络爬虫的功能,设计得相当精巧,非常值得学习与研究。
参考链接:
https://linux.cn/article-8265-1.html
这个爬虫没有借助任何工具库,直接用标准库代码完成,也没有借助协程(yield、yield from)。
完整代码如下:
from selectors import DefaultSelector,EVENT_WRITE,EVENT_READ
import socket
import re
import urllib.parse
import time
#未获取的URL
urls_todo = set(['/'])
#已解析的URL
seen_urls = set(['/'])
concurrency_achieved = 0
selector = DefaultSelector()
stopped = False
class Fetcher:
def __init__(self, url):
self.response = b''
self.url = url
self.sock = None
def fetch(self):
global concurrency_achieved
concurrency_achieved = max(concurrency_achieved, len(urls_todo))
self.sock = socket.socket()
self.sock.setblocking(False)
try:
self.sock.connect(('xkcd.com', 80))
except BlockingIOError:
pass
selector.register(self.sock.fileno(), EVENT_WRITE, self.connected)
def connected(self, key, mask):
selector.unregister(key.fd)
get = 'GET {} HTTP/1.0\r\nHost: xkcd.com\r\n\r\n'.format(self.url)
self.sock.send(get.encode('ascii'))
selector.register(key.fd, EVENT_READ, self.read_response)
def read_response(self, key, mask):
global stopped
chunk = self.sock.recv(4096) # 4k chunk size.
if chunk:
self.response += chunk
else:
selector.unregister(key.fd) # Done reading.
links = self.parse_links()
for link in links.difference(seen_urls):
urls_todo.add(link)
Fetcher(link).fetch()
seen_urls.update(links)
urls_todo.remove(self.url)
if not urls_todo:
stopped = True
def body(self):
body = self.response.split(b'\r\n\r\n', 1)[1]
return body.decode('utf-8')
def parse_links(self):
if not self.response:
print('error: {}'.format(self.url))
return set()
if not self._is_html():
return set()
urls = set(re.findall(r'''(?i)href=["']?([^\s"'<>]+)''',
self.body()))
links = set()
for url in urls:
normalized = urllib.parse.urljoin(self.url, url)
parts = urllib.parse.urlparse(normalized)
if parts.scheme not in ('', 'http', 'https'):
continue
host, port = urllib.parse.splitport(parts.netloc)
if host and host.lower() not in ('xkcd.com', 'www.xkcd.com'):
continue
defragmented, frag = urllib.parse.urldefrag(parts.path)
links.add(defragmented)
return links
def _is_html(self):
head, body = self.response.split(b'\r\n\r\n', 1)
headers = dict(h.split(': ') for h in head.decode().split('\r\n')[1:])
return headers.get('Content-Type', '').startswith('text/html')
start = time.time()
fetcher = Fetcher('/')
fetcher.fetch()
while not stopped:
events = selector.select()
for event_key, event_mask in events:
callback = event_key.data
callback(event_key, event_mask)
print('{} URLs fetched in {:.1f} seconds, achieved concurrency = {}'.format(
len(seen_urls), time.time() - start, concurrency_achieved))
下面讲几个重点:
一,selectors模块。这是Python3.4新增的,它实现了高效的I/O复用,常用于非阻塞的socket的编程中。我们先来看看几个关键的方法。
register(fileobj, events, data=None)
作用:注册一个文件对象。
参数: fileobj——即可以是fd(file descriptor),也可以是一个拥有fileno()方法的对象;
events——event Mask 常量(EVENT_READ、EVENT_WRITE);
data——英文文档上说data is an opaque object,其实它具体是干什么的要使用者自己决定,一般情况是传一个回调函数;
返回值: 一个SelectorKey类的实例;
二,程序是怎么运转的?请看下面代码:
while not stopped:
events = selector.select()
for event_key, event_mask in events:
callback = event_key.data
callback(event_key, event_mask)
通过全局变量stopped来控制循环,这个好理解。这个selector.select()是个什么东西?
文档链接:https://docs.python.org/3/library/selectors.html
但是文档上面并没有说,这个select()是做什么的,所以只能去看看源码(selectors.py):
@abstractmethod
def select(self, timeout=None):
"""Perform the actual selection, until some monitored file objects are
ready or a timeout expires.
Parameters:
timeout -- if timeout > 0, this specifies the maximum wait time, in
seconds
if timeout <= 0, the select() call won't block, and will
report the currently ready file objects
if timeout is None, select() will block until a monitored
file object becomes ready
Returns:
list of (key, events) for ready file objects
`events` is a bitwise mask of EVENT_READ|EVENT_WRITE
"""
raise NotImplementedError
源码里面倒是有文档注释,说得也比较清楚,但我们还需要知道这个key是什么,看看register()方法的源码:
def register(self, fileobj, events, data=None):
if (not events) or (events & ~(EVENT_READ | EVENT_WRITE)):
raise ValueError("Invalid events: {!r}".format(events))
key = SelectorKey(fileobj, self._fileobj_lookup(fileobj), events, data)
if key.fd in self._fd_to_key:
raise KeyError("{!r} (FD {}) is already registered"
.format(fileobj, key.fd))
self._fd_to_key[key.fd] = key
return key
可以看出,这个key就是指SelectorKey,而SelectorKey定义也很简单:
SelectorKey = namedtuple('SelectorKey', ['fileobj', 'fd', 'events', 'data'])
SelectorKey.__doc__ = """SelectorKey(fileobj, fd, events, data)
Object used to associate a file object to its backing
file descriptor, selected event mask, and attached data.
"""
现在,我们回到刚开始讨论的位置,程序是怎么运转的:
程序通过Fetcher.fetch()启动程序,然后调用selector.register(),该方法通过参数fileobj对象获得fileobj的文件描述符fd,然后通过fileobj、fd、events、data四个参数生成一个SelectorKey命名元组。然后将这个SelectorKey保存起来,供selector.select()调用。
再看看下面代码:
callback = event_key.data
callback(event_key, event_mask)
event_key.data就是register()方法的最后一个参数data,而event_key就是register()通过fileobj参数生成的SelectorKey,event_mask指register()传递的第二个参数events(EVENT_READ、EVENT_WRITE)。也就是这个地方才真正决定register()传递的data参数具体是做什么用的。(此处用做回调函数)
最核心的部分已经讲清楚了,等有空再来细看,还有没有需要强调的重点。
三、socket编程
套接字(socket)是通信的基石,是支持TCP/IP协议的网络通信的基本操作单元。TCP是底层通讯协议,定义的是数据传输和连接方式的规范。而我们常用的HTTP是应用层协议,定义的是传输数据的内容的规范,HTTP协议中的数据是利用TCP协议传输的,所以支持HTTP也就一定支持TCP 。参考资料:
https://docs.python.org/3/library/socket.html
socket比较底层,所以很复杂,规则相当多。在此只是简略介绍下,看下面客户端获取数据的代码:
import socket # 导入 socket 模块
s = socket.socket() # 创建 socket 对象
host = 'www.baidu.com' # 地址
port = 80 # 端口号
response = b''
url = '/'
get = 'GET {0} HTTP/1.0\r\nHost: {1}\r\n\r\n'.format(url,host)
s.send(get.encode('ascii')) # 发送请求
while True:
chunk = s.recv(1024) # 获取服务器返回的数据
if not chunk: break
response += chunk
body = response.split(b'\r\n\r\n', 1)[1]
print(body)
s.close() # 关闭连接
我们再介绍几个方法:
socket.
fileno
()Return the socket’s file descriptor (a small integer),
这个方法返回一个文件描述符fd,这正是我们最上面代码中register所用到的参数。
socket.
recv
(bufsize)Receive data from the socket. The return value is a bytes object representing the data received. The maximum amount of data to be received at once is specified by bufsize.
用于接收tcp数据,返回一个字节对象,大小为bufsize个字节。假如的我们要接收的数据为5000byte,如果bufsize为1024,则需要发送5次才能完成数据接收。bufsize就是数据块的大小,为了获得更好的网络传输性能,一般一次不要传输过多的数据,所以bufsize不宜过大。
socket.
send
(bytes)Send data to the socket. The socket must be connected to a remote socket.
这个bytes有格式要求,按照上面代码的格式即可。
四、http与https
实际上,上面完整版的代码并没有获得数据,因为该网站限制只能用https访问,而我们的代码访问的方式是http,所以需要改一下:
def fetch(self):
global concurrency_achieved
concurrency_achieved = max(concurrency_achieved, len(urls_todo))
self.sock = ssl.wrap_socket(socket.socket()) #这里需用ssl包装成https协议
# self.sock.setblocking(False)
try:
self.sock.connect(('xkcd.com', 443)) #端口改成443而不是原来的80
except BlockingIOError:
pass
selector.register(self.sock.fileno(), EVENT_WRITE, self.connected)
上面代码修改了三处,一处是ssl,二是注释了非阻塞的方式,三是把端口改成了443。