搜索引擎Nutch源代码研究之一网页抓取(2)

最新推荐文章于 2020-12-08 21:28:09 发布

iteye_12007

最新推荐文章于 2020-12-08 21:28:09 发布

阅读量135

点赞数

分类专栏： Search Engine 文章标签：搜索引擎 Socket C++ C# C

本文链接：https://blog.csdn.net/iteye_12007/article/details/81969400

版权

Search Engine 专栏收录该内容

19 篇文章 1 订阅

订阅专栏

今天我们来看看Nutch的源代码中的protocol-http插件，是如何抓取和下载web页面的。protocol-http就两个类HttpRespose和Http类，其中HttpRespose主要是向web服务器发请求来获取响应，从而下载页面。Http类则非常简单，其实可以说是HttpResponse的一个Facade,设置配置信息，然后创建HttpRespose。用户似乎只需要和Http类打交道就行了（我也没看全，所以只是猜测）。
我们来看看HttpResponse类：
看这个类的源码需要从构造函数
public HttpResponse(HttpBase http, URL url, CrawlDatum datum) throws ProtocolException, IOException开始
首先判断协议是否为http

Java代码

if (!"http".equals(url.getProtocol()))
throw new HttpException("Not an HTTP url:" + url);

if (!"http".equals(url.getProtocol()))
      throw new HttpException("Not an HTTP url:" + url);

获得路径，如果url.getFile()的为空直接返回”/”,否则返回url.getFile()
String path = "".equals(url.getFile()) ? "/" : url.getFile();
然后根据url获取到主机名和端口名。如果端口不存在，则端口默认为80，请求的地址将不包括端口号portString= ""，否则获取到端口号，并得到portString

Java代码

String host = url.getHost();
int port;
String portString;
if (url.getPort() == -1) {
port= 80;
portString= "";
} else {
port= url.getPort();
portString= ":" + port;
}

String host = url.getHost();
    int port;
    String portString;
    if (url.getPort() == -1) {
      port= 80;
      portString= "";
    } else {
      port= url.getPort();
      portString= ":" + port;
}

然后创建socket，并且设置连接超时的时间：

Java代码

socket = new Socket(); // create the socket socket.setSoTimeout(http.getTimeout());

socket = new Socket();                    // create the socket socket.setSoTimeout(http.getTimeout());

根据是否使用代理来得到socketHost和socketPort:

Java代码

String sockHost = http.useProxy() ? http.getProxyHost() : host;
int sockPort = http.useProxy() ? http.getProxyPort() : port;

String sockHost = http.useProxy() ? http.getProxyHost() : host;
int sockPort = http.useProxy() ? http.getProxyPort() : port;

创建InetSocketAddress，并且开始建立连接：

Java代码

InetSocketAddress sockAddr= new InetSocketAddress(sockHost, sockPort);
socket.connect(sockAddr, http.getTimeout());

InetSocketAddress sockAddr= new InetSocketAddress(sockHost, sockPort);
socket.connect(sockAddr, http.getTimeout());

获取输入流：

Java代码

// make request
OutputStream req = socket.getOutputStream();

// make request
      OutputStream req = socket.getOutputStream();

以下代码用来向服务器发Get请求：

Java代码

StringBuffer reqStr = new StringBuffer("GET ");
if (http.useProxy()) {
reqStr.append(url.getProtocol()+"://"+host+portString+path);
} else {
reqStr.append(path);
}
reqStr.append(" HTTP/1.0\r\n");
reqStr.append("Host: ");
reqStr.append(host);
reqStr.append(portString);
reqStr.append("\r\n");
reqStr.append("Accept-Encoding: x-gzip, gzip\r\n");
String userAgent = http.getUserAgent();
if ((userAgent == null) || (userAgent.length() == 0)) {
if (Http.LOG.isFatalEnabled()) { Http.LOG.fatal("User-agent is not set!"); }
} else {
reqStr.append("User-Agent: ");
reqStr.append(userAgent);
reqStr.append("\r\n");
}
reqStr.append("\r\n");
byte[] reqBytes= reqStr.toString().getBytes();
req.write(reqBytes);
req.flush();

StringBuffer reqStr = new StringBuffer("GET ");
      if (http.useProxy()) {
         reqStr.append(url.getProtocol()+"://"+host+portString+path);
      } else {
         reqStr.append(path);
      }

      reqStr.append(" HTTP/1.0\r\n");
      reqStr.append("Host: ");
      reqStr.append(host);
      reqStr.append(portString);
      reqStr.append("\r\n");
      reqStr.append("Accept-Encoding: x-gzip, gzip\r\n");
      String userAgent = http.getUserAgent();
      if ((userAgent == null) || (userAgent.length() == 0)) {
        if (Http.LOG.isFatalEnabled()) { Http.LOG.fatal("User-agent is not set!"); }
      } else {
        reqStr.append("User-Agent: ");
        reqStr.append(userAgent);
        reqStr.append("\r\n");
      }
      reqStr.append("\r\n");
      byte[] reqBytes= reqStr.toString().getBytes();
      req.write(reqBytes);
      req.flush();

接着来处理相应，获得输入流并且包装成PushbackInputStream来方便操作：

Java代码

PushbackInputStream in = // process response
new PushbackInputStream(
new BufferedInputStream(socket.getInputStream(), Http.BUFFER_SIZE),
Http.BUFFER_SIZE) ;

PushbackInputStream in =                  // process response
        new PushbackInputStream(
          new BufferedInputStream(socket.getInputStream(), Http.BUFFER_SIZE), 
          Http.BUFFER_SIZE) ;

提取状态码和响应中的HTML的header：

Java代码

boolean haveSeenNonContinueStatus= false;
while (!haveSeenNonContinueStatus) {
// parse status code line
this.code = parseStatusLine(in, line);
// parse headers
parseHeaders(in, line);
haveSeenNonContinueStatus= code != 100; // 100 is "Continue"
}

boolean haveSeenNonContinueStatus= false;
      while (!haveSeenNonContinueStatus) {
        // parse status code line
        this.code = parseStatusLine(in, line); 
        // parse headers
        parseHeaders(in, line);
        haveSeenNonContinueStatus= code != 100; // 100 is "Continue"
      }

接着读取内容：

Java代码

readPlainContent(in);

readPlainContent(in);

获取内容的格式，如果是压缩的则处理压缩

Java代码

String contentEncoding = getHeader(Response.CONTENT_ENCODING);
if ("gzip".equals(contentEncoding) || "x-gzip".equals(contentEncoding)) {
content = http.processGzipEncoded(content, url);
} else {
if (Http.LOG.isTraceEnabled()) {
Http.LOG.trace("fetched " + content.length + " bytes from " + url);
}
}

String contentEncoding = getHeader(Response.CONTENT_ENCODING);
      if ("gzip".equals(contentEncoding) || "x-gzip".equals(contentEncoding)) {
        content = http.processGzipEncoded(content, url);
      } else {
        if (Http.LOG.isTraceEnabled()) {
          Http.LOG.trace("fetched " + content.length + " bytes from " + url);
        }
      }

整个过程结束。
下面我们来看看parseStatusLine parseHeaders readPlainContent以及readChunkedContent的过程。
private int parseStatusLine(PushbackInputStream in, StringBuffer line)
throws IOException, HttpException：
这个函数主要来提取响应得状态，例如200 OK这样的状态码：
请求的状态行一般格式（例如响应Ok的话） HTTP/1.1 200" 或 "HTTP/1.1 200 OK

Java代码

int codeStart = line.indexOf(" ");
int codeEnd = line.indexOf(" ", codeStart+1);

int codeStart = line.indexOf(" ");
int codeEnd = line.indexOf(" ", codeStart+1);

如果是第一种情况：

Java代码

if (codeEnd == -1)
codeEnd = line.length();

if (codeEnd == -1) 
      codeEnd = line.length();

状态码结束（200）位置便是line.length()
否则状态码结束（200）位置就是line.indexOf(" ", codeStart+1);
接着开始提取状态码：

Java代码

int code;
try {
code= Integer.parseInt(line.substring(codeStart+1, codeEnd));
} catch (NumberFormatException e) {
throw new HttpException("bad status line '" + line
+ "': " + e.getMessage(), e);
}

int code;
    try {
      code= Integer.parseInt(line.substring(codeStart+1, codeEnd));
    } catch (NumberFormatException e) {
      throw new HttpException("bad status line '" + line 
                              + "': " + e.getMessage(), e);
}

下面看看

Java代码

private void parseHeaders(PushbackInputStream in, StringBuffer line)
throws IOException, HttpException：

private void parseHeaders(PushbackInputStream in, StringBuffer line)
throws IOException, HttpException：

这个函数主要是将响应的headers加入我们已经建立的结构header的Metadata中。
一个循环读取headers:
一般HTTP response的header部分和内容部分会有一个空行，使用readLine如果是空行就会返回读取的字符数为0，具体readLine实现看完这个函数在仔细看：
while (readLine(in, line, true) != 0)
如果没有空行，那紧接着就是正文了，正文一般会以<!DOCTYPE、<HTML、<html开头。如果读到的一行中包含这个，那么header部分就读完了。

Java代码

// handle HTTP responses with missing blank line after headers
int pos;
if ( ((pos= line.indexOf("<!DOCTYPE")) != -1)
|| ((pos= line.indexOf("<HTML")) != -1)
|| ((pos= line.indexOf("<html")) != -1) )

      // handle HTTP responses with missing blank line after headers
      int pos;
      if ( ((pos= line.indexOf("<!DOCTYPE")) != -1) 
           || ((pos= line.indexOf("<HTML")) != -1) 
           || ((pos= line.indexOf("<html")) != -1) )

接着把多读的那部分压回流中,并设置那一行的长度为pos

Java代码

in.unread(line.substring(pos).getBytes("UTF-8"));
line.setLength(pos);

       in.unread(line.substring(pos).getBytes("UTF-8"));
        line.setLength(pos);

接着把对一行的处理委托给processHeaderLine(line)来处理：

Java代码

try {
//TODO: (CM) We don't know the header names here
//since we're just handling them generically. It would
//be nice to provide some sort of mapping function here
//for the returned header names to the standard metadata
//names in the ParseData class
processHeaderLine(line);
} catch (Exception e) {
// fixme:
e.printStackTrace(LogUtil.getErrorStream(Http.LOG));
}
return;
}
processHeaderLine(line);

        try {
            //TODO: (CM) We don't know the header names here
            //since we're just handling them generically. It would
            //be nice to provide some sort of mapping function here
            //for the returned header names to the standard metadata
            //names in the ParseData class
          processHeaderLine(line);
       } catch (Exception e) {
          // fixme:
          e.printStackTrace(LogUtil.getErrorStream(Http.LOG));
        }
        return;
      }
      processHeaderLine(line);

下面我们看看如何处理一行header的：
private void processHeaderLine(StringBuffer line)
throws IOException, HttpException
请求的头一般格式：
Cache-Control: private
Date: Fri, 14 Dec 2007 15:32:06 GMT
Content-Length: 7602
Content-Type: text/html
Server: Microsoft-IIS/6.0

这样我们就比较容易理解下面的代码了：

Java代码

int colonIndex = line.indexOf(":"); // key is up to colon

int colonIndex = line.indexOf(":");       // key is up to colon

如果没有”:”并且这行不是空行则抛出HttpException异常

Java代码

if (colonIndex == -1) {
int i;
for (i= 0; i < line.length(); i++)
if (!Character.isWhitespace(line.charAt(i)))
break;
if (i == line.length())
return;
throw new HttpException("No colon in header:" + line);
}

    if (colonIndex == -1) {
      int i;
      for (i= 0; i < line.length(); i++)
        if (!Character.isWhitespace(line.charAt(i)))
          break;
      if (i == line.length())
        return;
      throw new HttpException("No colon in header:" + line);
}

否则，可以可以提取出键-值对了：
key为0~colonIndex部分,然后过滤掉开始的空白字符，作为value部分。
最后放到headers中：

Java代码

String key = line.substring(0, colonIndex);
int valueStart = colonIndex+1; // skip whitespace
while (valueStart < line.length()) {
int c = line.charAt(valueStart);
if (c != ' ' && c != '\t')
break;
valueStart++;
}
String value = line.substring(valueStart);
headers.set(key, value);

    String key = line.substring(0, colonIndex);

    int valueStart = colonIndex+1;            // skip whitespace
    while (valueStart < line.length()) {
      int c = line.charAt(valueStart);
      if (c != ' ' && c != '\t')
       break;
      valueStart++;
    }
    String value = line.substring(valueStart);
    headers.set(key, value);

下面我们看看用的比较多的辅助函数
private static int readLine(PushbackInputStream in, StringBuffer line,
boolean allowContinuedLine) throws IOException

代码的实现：
开始设置line的长度为0不断的读，直到c!=-1,对于每个c:
如果是\r并且下一个字符是\n则读入\r,如果是\n,并且如果line.length() > 0，也就是这行前面已经有非空白字符，并且还允许连续行，在读一个字符，如果是’ ’或者是\t说明此行仍未结束，读入该字符，一行结束，返回读取的实际长度。其他情况下直接往line追加所读的字符：

Java代码

line.setLength(0);
for (int c = in.read(); c != -1; c = in.read()) {
switch (c) {
case '\r':
if (peek(in) == '\n') {
in.read();
}
case '\n':
if (line.length() > 0) {
// at EOL -- check for continued line if the current
// (possibly continued) line wasn't blank
if (allowContinuedLine)
switch (peek(in)) {
case ' ' : case '\t': // line is continued
in.read();
continue;
}
}
return line.length(); // else complete
default :
line.append((char)c);
}
}
throw new EOFException();
}

    line.setLength(0);
    for (int c = in.read(); c != -1; c = in.read()) {
      switch (c) {
        case '\r':
          if (peek(in) == '\n') {
            in.read();
          }
        case '\n': 
          if (line.length() > 0) {
            // at EOL -- check for continued line if the current
            // (possibly continued) line wasn't blank
            if (allowContinuedLine) 
              switch (peek(in)) {
                case ' ' : case '\t':                   // line is continued
                  in.read();
                  continue;
              }
          }
          return line.length();      // else complete
        default :
          line.append((char)c);
      }
    }
    throw new EOFException();
  }

接着看如何读取内容的，也就是
private void readPlainContent(InputStream in)
throws HttpException, IOException的实现：
首先从headers（在此之前已经读去了headers放到metadata中了）中获取响应的长度，

Java代码

int contentLength = Integer.MAX_VALUE; // get content length
String contentLengthString = headers.get(Response.CONTENT_LENGTH);
if (contentLengthString != null) {
contentLengthString = contentLengthString.trim();
try {
contentLength = Integer.parseInt(contentLengthString);
} catch (NumberFormatException e) {
throw new HttpException("bad content length: "+contentLengthString);
}
}

int contentLength = Integer.MAX_VALUE;    // get content length
    String contentLengthString = headers.get(Response.CONTENT_LENGTH);
    if (contentLengthString != null) {
      contentLengthString = contentLengthString.trim();
      try {
        contentLength = Integer.parseInt(contentLengthString);
      } catch (NumberFormatException e) {
       throw new HttpException("bad content length: "+contentLengthString);
      }
}

如果大于http.getMaxContent()（这个值在配置文件中http.content.limit来配置），
则截取maxContent那么长的字段：

Java代码

if (http.getMaxContent() >= 0
&& contentLength > http.getMaxContent()) // limit download size
contentLength = http.getMaxContent();
ByteArrayOutputStream out = new ByteArrayOutputStream(Http.BUFFER_SIZE);
byte[] bytes = new byte[Http.BUFFER_SIZE];
int length = 0; // read content
for (int i = in.read(bytes); i != -1; i = in.read(bytes)) {
out.write(bytes, 0, i);
length += i;
if (length >= contentLength)
break;
}
content = out.toByteArray();

    if (http.getMaxContent() >= 0
     && contentLength > http.getMaxContent())   // limit download size
      contentLength  = http.getMaxContent();

    ByteArrayOutputStream out = new ByteArrayOutputStream(Http.BUFFER_SIZE);
    byte[] bytes = new byte[Http.BUFFER_SIZE];
    int length = 0;                           // read content
    for (int i = in.read(bytes); i != -1; i = in.read(bytes)) {
      out.write(bytes, 0, i);
      length += i;
      if (length >= contentLength)
        break;
    }
    content = out.toByteArray();
  }

今天就写到这了。

iteye_12007

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
搜索引擎Nutch源代码研究之一网页抓取(2)

今天我们来看看Nutch的源代码中的protocol-http插件，是如何抓取和下载web页面的。protocol-http就两个类HttpRespose和Http类，其中HttpRespose主要是向web服务器发请求来获取响应，从而下载页面。Http类则非常简单，其实可以说是HttpResponse的一个Facade,设置配置信息，然后创建HttpRespose。用户似乎只需要和Http类...
复制链接

扫一扫