WebSphinx,Jobo等爬虫的Http分析-CSDN博客

WebSphinx,Jobo等爬虫的Http分析

看了一些开源的网络爬虫代码，把注意都集中在一些细节的设计模式上，感觉对整个系统的整体把握不够，于是打算花点时间好好分析一下各个爬虫对Http的处理。

第一个问题就是如果去获取远程的网页，在Java里面可以采用下面的简单方式：

1）自己写代码，通过URL获取,代码如下：

public static String getHtml(String urlString)

{

try

{

StringBuffer html = new StringBuffer();

URL url = new URL(urlString);

HttpURLConnection conn = (HttpURLConnection) url.openConnection();

InputStreamReader isr = new InputStreamReader(conn.getInputStream());

BufferedReader br = new BufferedReader(isr);

String temp;

while ((temp = br.readLine()) != null)

{

html.append(temp).append("\n");

}

br.close();

isr.close();

return html.toString();

} catch (Exception e)

{

e.printStackTrace();

return null;

}

2. 利用HttpClient包进行相应的处理：(与上面的代码相比，HttpClient的灵活性，可配置性，易用性都比较大)

HttpClient在一个可扩展的OO框架内，实现了HTTP的全部方法(GET, POST, PUT, DELETE, HEAD, OPTIONS, and TRACE),支持HTTPS(ssl上的HTTP)的加密操作,透明地穿过HTTP代理建立连接,通过CONNECT方法，利用通过建立穿过HTTP代理的HTTPS连接.

使用HttpClient编程的基本步聚:
1:创建 HttpClient 的一个实例.
2:创建某个方法（DeleteMethod，EntityEnclosingMethod，ExpectContinueMethod，GetMethod， HeadMethod，MultipartPostMethod，OptionsMethod，PostMethod，PutMethod， TraceMethod）的一个实例，一般可用要目标URL为参数。
3:让 HttpClient 执行这个方法.
4:读取应答信息.
5:释放连接.
6:处理应答.

/**/ /* 1 生成 HttpClinet 对象并设置参数*/

HttpClient httpClient = new HttpClient();

// 设置 Http 连接超时为5秒

httpClient.getHttpConnectionManager().getParams().setConnectionTimeout( 5000 );

/**/ /*2 生成 GetMethod 对象并设置参数*/

GetMethod getMethod = new GetMethod(url);

// 设置 get 请求超时为 5 秒

getMethod.getParams().setParameter(HttpMethodParams.SO_TIMEOUT, 5000 );

// 设置请求重试处理，用的是默认的重试处理：请求三次

getMethod.getParams().setParameter(HttpMethodParams.RETRY_HANDLER, new DefaultHttpMethodRetryHandler());

/**/ /*3 执行 HTTP GET 请求*/

try

{

int statusCode = httpClient.executeMethod(getMethod);

/**//*4 判断访问的状态码*/

if (statusCode != HttpStatus.SC_OK)

{

System.err.println("Method failed: "+ getMethod.getStatusLine());

}

/**//*5 处理 HTTP 响应内容*/

Header[] headers=getMethod.getResponseHeaders();

for(Header h: headers)

System.out.println(h.getName()+" "+h.getValue());*/

//读取 HTTP 响应内容，这里简单打印网页内容

byte[] responseBody = getMethod.getResponseBody();//读取为字节数组

//读取为 InputStream，在网页内容数据量大时候推荐使用

InputStream response = getMethod.getResponseBodyAsStream();//

}

catch (HttpException e)

{

// 发生致命的异常，可能是协议不对或者返回的内容有问题

System.out.println("Please check your provided http address!");

e.printStackTrace();

}

finally

{

/**//*6 .释放连接*/

getMethod.releaseConnection();

}

3: 看一看JoBo里面是怎么去处理的：

在爬虫Jobo里面有一个HttpTool类，基本上负责对利用Http协议对页面进行处理：

HttpTool类的作用：Class for retrieving documents from HTTP servers. The main purpose of this class is to retrieve a document from an HTTP server。For many purposes the Java URLInputStream is good for this, but if you want to have full control over the HTTP headers,(both request and response headers), HttpTool is the answer。

# public HttpDoc retrieveDocument(URL u, int method, String parameters) throws HttpException

{

# DocAndConnection docAndConnection = retrieveDocumentInternal(u, method, parameters, null, null);

# HttpDoc doc = docAndConnection != null ? docAndConnection.httpDoc : null;

# if (doc != null && doc.getHttpCode() == 401)

{

# String authProtName = NTLMAuthorization.WWW_AUTHENTICATE_HEADER;

# String authProtValue = doc.getHeaderValue(authProtName);

# if (authProtValue == null)

{

# authProtName = NTLMAuthorization.PROXY_AUTHENTICATE_HEADER;

# authProtValue = doc.getHeaderValue(authProtName);