crawler4j源码分析（三）Fetcher

最新推荐文章于 2022-09-22 15:42:33 发布

lvvista

最新推荐文章于 2022-09-22 15:42:33 发布

阅读量1.3k

点赞数

分类专栏： crawler-architecture 杂项 HTTP 文章标签： crawler je 爬虫 http协议 fetcher

本文链接：https://blog.csdn.net/lvvista/article/details/37655687

版权

crawler-architecture 同时被 3 个专栏收录

7 篇文章 0 订阅

订阅专栏

杂项

2 篇文章 0 订阅

订阅专栏

HTTP

1 篇文章 0 订阅

订阅专栏

对于爬虫来讲，Fetcher的主要工作就是获取给定URL对应的资源，然后交给Parser处理，通常情形下，Fetcher和Parser之间通过page buffer来衔接，从而将二者之间的耦合度降到最低，不过在crawler4j中，由于每个爬取线程都有自己的parser，而所有的parser都对应一个全局的fetcher，并且fetcher的页面获取函数设计成了可重入的模式，因此也不会存在parser之间的同步和争抢fetcher的问题。从WebCrawler的主处理函数processPage可以看出，在这个函数中，获取到一个页面后，所有的解析和信息提取都在这里完成了，因此也省去了模块之间的交互，是系统实现的复杂度大大简化。

crawler4j的fetcher很简单，总共就四个类，主要工作全部在PageFetcher中完成，PageFetchResult用来存放响应消息，IdleConnectionMonitorThread为连接池的监控线程。CustomFetchStatus中对Http响应码进行了转换。总体来讲，fetcher模块就做了两件事，第一是根据配置信息初始化相关的http请求参数，并且创建基于Httpclient的http连接管理线程PoolingClientConnectionManager，在crawler4j中，这个线程的作用主要就是轮训并关闭空闲和失效的连接，从而保证系统资源的有效使用，且看下面代码：

                params.setParameter(ClientPNames.COOKIE_POLICY, CookiePolicy.BROWSER_COMPATIBILITY);
		params.setParameter(CoreProtocolPNames.USER_AGENT, config.getUserAgentString());
		params.setIntParameter(CoreConnectionPNames.SO_TIMEOUT, config.getSocketTimeout());
		params.setIntParameter(CoreConnectionPNames.CONNECTION_TIMEOUT, config.getConnectionTimeout());

这些是Http的基本设置，其中，SO_TIMEOUT设置了一个连接中两个package之间的最大间隔时常，超过这个时间，就认为socket超时，CONNECTION_TIMEOUT决定了一个连接建立的超时时常，超过这个设置就认为当前连接建立失败。

关于PoolingClientConnectionManager只有两个相关的配置：

		connectionManager = new PoolingClientConnectionManager(schemeRegistry);
		connectionManager.setMaxTotal(config.getMaxTotalConnections());
		connectionManager.setDefaultMaxPerRoute(config.getMaxConnectionsPerHost());

一个是最大允许连接数，一个是每个Host上的最大连接数。

Http协议中连接的创建和维护最为耗费系统资源，因此连接的高效管理和使用就很重要，为此crawler4j采用了httpclient中的PoolingClientConnectionManager来管理多个http连接，PoolingClientConnectionManager的主要功能就是向一个新的http请求分配一个已有连接，如果没有针对当前路由的可用连接，则创建一个新的，并自动维护到每个Host的最大连接数，这一点很重要，尤其是对与爬虫而言，如果同时链接到一个Host的请求过多，不仅会给对端服务器带来较大的负荷，有时还会被对端屏蔽，这也是礼貌爬取的一种实现手段（除了设置两次请求间隔时常之外）。下面是连接管理监控线程的代码：

public IdleConnectionMonitorThread(PoolingClientConnectionManager connMgr) {
        super("Connection Manager");
        this.connMgr = connMgr;
    }

    @Override
    public void run() {
        try {
            while (!shutdown) {
                synchronized (this) {
                    wait(5000);
                    // Close expired connections
                    connMgr.closeExpiredConnections();
                    // Optionally, close connections
                    // that have been idle longer than 30 sec
                    connMgr.closeIdleConnections(30, TimeUnit.SECONDS);
                }
            }
        } catch (InterruptedException ex) {
            // terminate
        }
    }

除此之外，可以发现crawler4j在发送请求和接受响应时都对消息实体的编码类型进行了限制（只接受gzip编码的消息），前者通过设置Http请求头实现，后者通过注册响应过滤函数来实现，请看代码：

get = new HttpGet(toFetchURL);
get.addHeader("Accept-Encoding", "gzip");
httpClient.addResponseInterceptor(new HttpResponseInterceptor() {
            //注册响应消息拦截类
            @Override
            public void process(final HttpResponse response, final HttpContext context) throws HttpException,
                    IOException {
                HttpEntity entity = response.getEntity();
                Header contentEncoding = entity.getContentEncoding();
                if (contentEncoding != null) {
                    HeaderElement[] codecs = contentEncoding.getElements();
                    for (HeaderElement codec : codecs) {
                        if (codec.getName().equalsIgnoreCase("gzip")) {//过滤gzip格式的响应消息
                            response.setEntity(new GzipDecompressingEntity(response.getEntity()));
                            return;
                        }
                    }
                }
            }

        });

下面是连接管理监控线程的代码，可以看出就是不断轮训并关闭空闲和无效的连接

public void run() {
        try {
            while (!shutdown) {
                synchronized (this) {
                    wait(5000);
                    // Close expired connections
                    connMgr.closeExpiredConnections();
                    // Optionally, close connections
                    // that have been idle longer than 30 sec
                    connMgr.closeIdleConnections(30, TimeUnit.SECONDS);
                }
            }
        } catch (InterruptedException ex) {
            // terminate
        }
    }

再来看看页面的获取，这部分工作全部放在了fetchHeader函数中完成，在这里首先根据接收到的URL构造Get请求（设置请求头Accept-Encoding=gzip），然后将响应结果存放在PageFetchResult中，这个类只有5个成员变量，statusCode，entity，responseHeaders，fetchedUrl和movedToUrl，第一个是响应状态码，第二个就是响应的消息体，

第三个是响应头，fetchedUrl存放本次请求的URL，movedToUrl用来存放重定向后的URL。

对于接收到的响应，首先判断是否重定向，如果是则规范化并保存重定向后的URL，

if (statusCode != HttpStatus.SC_OK) {
				if (statusCode != HttpStatus.SC_NOT_FOUND) {
					if (statusCode == HttpStatus.SC_MOVED_PERMANENTLY || statusCode == HttpStatus.SC_MOVED_TEMPORARILY) {
						Header header = response.getFirstHeader("Location");
						if (header != null) {
							String movedToUrl = header.getValue();
							movedToUrl = URLCanonicalizer.getCanonicalURL(movedToUrl, toFetchURL);
							fetchResult.setMovedToUrl(movedToUrl);
						} 
						fetchResult.setStatusCode(statusCode);
						return fetchResult;
					}
					logger.info("Failed: " + response.getStatusLine().toString() + ", while fetching " + toFetchURL);
				}
				fetchResult.setStatusCode(response.getStatusLine().getStatusCode());
				return fetchResult;
			}

然后就是检查消息实体的大小是否超出设置的允许大小范围内，如果超过限制，则不再处理，设置状态码为CustomFetchStatus.PageTooBig

if (fetchResult.getEntity() != null) {
				long size = fetchResult.getEntity().getContentLength();
				if (size == -1) {
					Header length = response.getLastHeader("Content-Length");
					if (length == null) {
						length = response.getLastHeader("Content-length");
					}
					if (length != null) {
						size = Integer.parseInt(length.getValue());
					} else {
						size = -1;
					}
				}
				if (size > config.getMaxDownloadSize()) {
					fetchResult.setStatusCode(CustomFetchStatus.PageTooBig);
					<span style="color:#ff6666;">get.abort();</span>
					return fetchResult;
				}

				fetchResult.setStatusCode(HttpStatus.SC_OK);
				return fetchResult;

			}
<span style="color:#ff6666;">get.abort();</span>

fetcher就做了这么多事情，对于消息体的转换以及信息抽取都不在fetcher的职责范围内，消息向字节码的转换和解析是在WebCrawler中来实现的，这部分严格来讲属于parser部分，将在下一章来讲解。

			Page page = new Page(curURL);
			int docid = curURL.getDocid();

			if (!<span style="color:#3366ff;">fetchResult.fetchContent(page)</span>) {
				onContentFetchError(curURL);
				return;
			}

			if (!parser.parse(page, curURL.getURL())) {
				onParseError(curURL);
				return;
			}

这里有一个地方需要注意，请看标红的代码，可以发现，如果是响应消息体为空或者消息体大小超出范围（意味着本次请求到此结束），则会直接get.abort结束本次请求，这个调用意味着本次连接承载的响应流也会随之释放，不复存在。否则，对于成功的请求，则没有调用abort，这是因为后面的消息体转换（蓝色代码）还需要从这个响应流中读取消息，因此不能关闭，这一点从在PageFetcher中自定义的对gzip消息的封装可以看出。

private static class GzipDecompressingEntity extends HttpEntityWrapper {

		public GzipDecompressingEntity(final HttpEntity entity) {
			super(entity);
		}

		@Override
		public <span style="color:#ff6666;">InputStream getContent()</span> throws IOException, IllegalStateException {

			// the wrapped entity's getContent() decides about repeatability
			InputStream wrappedin = wrappedEntity.getContent();

			return new GZIPInputStream(wrappedin);
		}

		@Override
		public long getContentLength() {
			// length of ungzipped content is not known
			return -1;
		}

	}

上面红色代码标准函数既是将HttpEntity封装成了GZIPInputStream，正是有了这里的转换，才能完成后面消息体向字节数组的转换：

public boolean fetchContent(Page page) {
        try {
            page.load(entity);
            page.setFetchResponseHeaders(responseHeaders);
            return true;
        } catch (Exception e) {
            logger.info("Exception while fetching content for: " + page.getWebURL().getURL() + " [" + e.getMessage()
                    + "]");
        }
        return false;
    }
public void load(HttpEntity entity) throws Exception {

		contentType = null;
		Header type = entity.getContentType();
		if (type != null) {
			contentType = type.getValue();
		}

		contentEncoding = null;
		Header encoding = entity.getContentEncoding();
		if (encoding != null) {
			contentEncoding = encoding.getValue();
		}

		Charset charset = ContentType.getOrDefault(entity).getCharset();
		if (charset != null) {
			contentCharset = charset.displayName();	
		}

		contentData = EntityUtils.toByteArray(entity);
	}

至此fetcher的工作流程已经全部解析完毕，下一节我们来看看parser的设计和实现。

lvvista

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
crawler4j源码分析（三）Fetcher

对于爬虫来讲，Fetcher的主要工作就是获取给定URL对应的资源，然后交给Parser处理，通常情形下，Fetcher和Parser之间通过page buffer来衔接，从而将二者之间的耦合度降到最低，不过在crawler4j中，由于每个爬取线程都有自己的parser，而所有的parser都对应一个全局的fetcher，并且fetcher的页面获取函数设计成了可重入的模式，因此也不会存在pars
复制链接

扫一扫

专栏目录