分布式爬虫实战（一）互联网协议相关基础知识

最新推荐文章于 2022-01-22 19:40:46 发布

贫僧草头

最新推荐文章于 2022-01-22 19:40:46 发布

阅读量771

点赞数

分类专栏： Python 搜索引擎数据科学文章标签：爬虫互联网分布式

本文链接：https://blog.csdn.net/Lookka/article/details/74911387

版权

Python 同时被 3 个专栏收录

4 篇文章 0 订阅

订阅专栏

数据科学

4 篇文章 0 订阅

订阅专栏

搜索引擎

1 篇文章 0 订阅

订阅专栏

HTTP 协议

HTTP HEADER

HTTP REQUEST 的 Header 部分，需要重点设置某些参数

 Accept: text/plain
 Accept-Charset: utf-8
 Accept-Encoding: gzip, deflate
 Accept-Language: en-US
 Connection: keep-alive
 Content-Length: 348
 Content-Type: application/x-www-form-urlencoded
 Date: Tue, 15 Nov 1994 08:12:31 GMT
 Host: en.wikipedia.org:80
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36
Cookie: CP=H2; WMF-Last-Access=10-Jul-2017; WMF-Last-Access-Global=10-Jul-2017; GeoIP=HK:HCW:Hong_Kong:22.28:114.15:v4

HTTP RESPONSE 的 HEADER 中需要注意的

 Accept-Patch: text/example;charset=utf-8
 Cache-Control: max-age=3600
 Content-Encoding: gzip
 Last-Modified: Tue, 15 Nov 1994 12:45:26 GMT  Content-Language: da
 Content-Length: 348
 ETag: “737060cd8c284d8af7ad3082f209582d“
 Expires: Thu, 01 Dec 1994 16:00:00 GMT
 Location: http://www.w3.org/pub/WWW/People.html
 Set-Cookie: UserID=JohnDoe; Max-Age=3600; Version=1  Status: 200 OK

使用 Chrome 的 Postman 插件抓包

我常用的抓包的工具：firebug（似乎已经停止维护）、Postman、Wireshark。

Postman 的优势是使用简单、可以自动生成一些代码、可以伪造／重发请求等。

安装 Chrome 的 Postman 插件之后打开 Inspector ，点击需要查看的请求，可以看到右边面板出现请求／响应的数据：

这里写图片描述

右边面板的 “ Code ” 按钮可以生成请求代码：

这里写图片描述

响应码

每个 HTTP 请求都会返回一个状态码

2XX ：成功
3XX ：跳转
4XX ：客户端错误
5XX ：服务器端错误

 400 Bad Request 客户端请求有语法错误，不能被服务器所理解
 401 Unauthorized 请求未经授权，这个状态代码必须和WWW-Authenticate报头域一起使用  403 Forbidden 服务器收到请求，但是拒绝提供服务
 404 Not Found 请求资源不存在，eg:输入了错误的URL
500 Internal Server Error 服务器发生不可预期的错误
 503 Server Unavailable 服务器当前不能处理客户端的请求，一段时间后可能恢复正常
 300 Multiple Choices 存在多个可用的资源，可处理或丢弃
 301 Moved Permanetly 重定向
 302 Found 重定向
 304 Not Modified 请求的资源未更新，丢弃

宽度优先和深度优先

对于宽度优先和深度优先哪个好这个问题，答案是：It depends.

一般情况下，宽度优先会更好，理由是：

重要的网页和种子节点的距离较近
宽度优先有利于多爬虫并发协作
万维网深度一般不会很深，有很多路径可以到达

当然，套路的说法是：宽度优先和深度优先结合。

宽度优先示例代码：

import urllib2
from collections import deque
import json
from lxml import etree
import httplib
import hashlib
from pybloomfilter import BloomFilter

class CrawlBSF:
    request_headers = {
        'host': "www.mafengwo.cn",
        'connection': "keep-alive",
        'cache-control': "no-cache",
        'upgrade-insecure-requests': "1",
        'user-agent': "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.95 Safari/537.36",
        'accept': "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
        'accept-language': "zh-CN,en-US;q=0.8,en;q=0.6"
        }

    cur_level = 0
    max_level = 5
    dir_name = 'iterate/'
    iter_width = 50
    downloaded_urls = []

    du_md5_file_name = dir_name + 'download.txt'
    du_url_file_name = dir_name + 'urls.txt'

    download_bf = BloomFilter(1024*1024*16, 0.01)

    cur_queue = deque()
    child_queue = deque()

    def __init__(self, url):
        self.root_url = url
        self.cur_queue.append(url)
        self.du_file = open(self.du_url_file_name, 'a+')
        try:
            self.dumd5_file = open(self.du_md5_file_name, 'r')
            self.downloaded_urls = self.dumd5_file.readlines()
            self.dumd5_file.close()
            for urlmd5 in self.downloaded_urls:
                self.download_bf.add(urlmd5[:-2])
        except IOError:
            print "File not found"
        finally:
            self.dumd5_file = open(self.du_md5_file_name, 'a+')

    def enqueueUrl(self, url):
        self.child_queue.append(url)

    def dequeuUrl(self):
        try:
            url = self.cur_queue.popleft()
            return url
        except IndexError:
            self.cur_level += 1
            if self.cur_level == self.max_level:
                return None
            if len(self.child_queue) == 0:
                return None
            self.cur_queue = self.child_queue
            self.child_queue = deque()
            return self.dequeuUrl()

    def getpagecontent(self, cur_url):
        print "downloading %s at level %d" % (cur_url, self.cur_level)
        try:
            req = urllib2.Request(cur_url, headers=self.request_headers)
            response = urllib2.urlopen(req)
            html_page = response.read()
            filename = cur_url[7:].replace('/', '_')
            fo = open("%s%s.html" % (self.dir_name, filename), 'wb+')
            fo.write(html_page)
            fo.close()
        except urllib2.HTTPError, Arguments:
            print Arguments
            return
        except httplib.BadStatusLine:
            print 'BadStatusLine'
            return
        except IOError:
            print 'IO Error at ' + filename
            return
        except Exception, Arguments:
            print Arguments
            return
        # print 'add ' + hashlib.md5(cur_url).hexdigest() + ' to list'
        dumd5 = hashlib.md5(cur_url).hexdigest()
        self.downloaded_urls.append(dumd5)
        self.dumd5_file.write(dumd5 + '\r\n')
        self.du_file.write(cur_url + '\r\n')
        self.download_bf.add(dumd5)

        html = etree.HTML(html_page.lower().decode('utf-8'))
        hrefs = html.xpath(u"//a")

        for href in hrefs:
            try:
                if 'href' in href.attrib:
                    val = href.attrib['href']
                    if val.find('javascript:') != -1:
                        continue
                    if val.startswith('http://') is False:
                        if val.startswith('/'):
                            val = 'http://www.mafengwo.cn' + val
                        else:
                            continue
                    if val[-1] == '/':
                        val = val[0:-1]
                    # if hashlib.md5(val).hexdigest() not in self.downloaded_urls:
                    if hashlib.md5(val).hexdigest() not in self.download_bf:
                        self.enqueueUrl(val)
                    else:
                        print 'Skip %s' % (val)
            except ValueError:
                continue

    def start_crawl(self):
        while True:
            url = self.dequeuUrl()
            if url is None:
                break
            self.getpagecontent(url)
        self.dumd5_file.close()
        self.du_file.close()

crawler = CrawlBSF("http://www.mafengwo.cn")
crawler.start_crawl()