Using the Lobo Cobra Toolkit to Retrieve the HTML of Rendered Pages


Cobra Parsing: Disable Persistent Connections and Set Socket Timeouts

I use the Cobra Toolkit to parse web pages for various projects. In a project that has eight concurrent parsers running, I found that some of the parsers would hang indefinitely in a socket read during JavaScript processing. I think, but have not confirmed, that most of the hung sockets are related to persistent connections to Google's safe browser/YouTube servers in the 1E100.net domain. So I wanted a method to disable persistent connections in Cobra. As it's also possible for a server not to respond, URLConnection read timeouts would be a useful option, too.

I created some simple code that causes URLConnection objects created by Cobra to be configured with a timeout and to optionally disable persistent connections.

The Cobra DOM parser requires an org.lobobrowser.html.UserAgentContext object. A sample UserAgentContext object is provided in org.lobobrowser.html.test.SimpleUserAgentContext. This class has a createHttpRequest() method that returns an org.lobobrowser.html.test.SimpleHttpRequest object whenever the parser needs to open a socket connection. In one of the several open() methods, SimpleHttpRequest creates a URLConnection object. The timeout and persistence settings need to be applied to the URLConnection.

I created two simple classes in the package com.benjysbrain.CobraExtension:

  • CobraUserAgentContext extends SimpleUserAgentContext.
  • CobraHttpRequest extends SimpleHttpRequest.

CobraUserAgentContext

The CobraUserAgentContext constructor takes two parameters:

  • int timeout - the timeout in milliseconds used on the URLConnection setReadTimeout() and setConnectTimeout() methods
  • boolean persistent - false to disable persistent connections

The createHttpRequest() method returns a CobraHttpRequest object that has been configured with the timeout and persistence values.

CobraHttpRequest

The new constructor for this class contains the UserAgentContext and proxy parameters as in the parent class but also adds timeout and persistent settings as in the CobraUserAgentContext class. The open() method with five parameters is overridden. When it is called, a URLConnection is created by the parent's five parameter open() method, and then the timeouts and persistence settings are applied to the URLConnection.

Using the code

Substitute CobraUserAgentContext for SimpleUserAgentContext in your programs. Use the constructor that allows you to set timeouts and persistence values, and pass the object to org.lobobrowser.html.parser.DocumentBuilderImpl. Whenever the parser needs a new URL connection, it will use the CobraHttpRequest object, which sets the timeouts and persistence settings.

To compile the code, listed below, you will need Java 1.5 or greater as the timeout methods of URLConnection are not in earlier versions. Create the com.benjysbrain.CobraExtension directory structure, put the source in the leaf directory, add the Cobra Toolkit jar to your classpass, and compile the source. I recommend setting a timeout value of about one minute, but you might want to increase this depending on the responsiveness of the servers from which pages are parsed.

 

package com.benjysbrain.CobraExtension ; import org.lobobrowser.html.test.* ; import org.lobobrowser.html.* ; /** CobraUserAgentContext is a subclass of org.lobobrowser.html.test.SimpleUserAgentContext that overrides the createHttpRequest() method to provide an HttpRequest object with a URLConnection object with timeouts and other properties. In addition to the new createHttpRequest() method, a new constructor has been added. <p> The Cobra Toolkit (http://lobobrowser.org/cobra.jsp) is part of the Lobo Project. <p> Java 1.5 or later required. <p> Copyright 2010 by Ben E. Cline. This source code is provided for educational purposes "as is" with no warranty. If you use the code, please acknowledge the author. <p> http://www.benjysbrain.com @author Benjy Cline */ public class CobraUserAgentContext extends SimpleUserAgentContext { /** The read timeout and connection timeout in milliseconds. */ int timeout ; /** If false, HttpRequest objects have the "Connection : close" property set to discourage persistent connections. */ boolean persistent ; /** Create a CobraUserAgentContext object where createHttpRequest() returns an HttpRequest object with a URLConnection object with the specified timeout and persistence setting. */ public CobraUserAgentContext(int timeout, boolean persistent) { super() ; this.timeout = timeout ; this.persistent = persistent ; } /** Create an HttpRequest object, used to load images, scripts, etc., with timeout and persistence values. */ public HttpRequest createHttpRequest() { return new CobraHttpRequest(this, this.getProxy(), timeout, persistent) ; } }

 

package com.benjysbrain.CobraExtension ; import org.lobobrowser.html.test.* ; import org.lobobrowser.html.* ; import java.io.* ; /** CobraHttpRequest is a subclass of org.lobobrowser.html.test.SimpleHttpRequest. It adds a constructor and a modified version of the open() method. If the new constructor is used, a timeout and persistent state are used during open() calls to configure the URLConnection object. See com.benjysbrain.CobraExtension.CobraUserAgentContext. <p> The Cobra Toolkit (http://lobobrowser.org/cobra.jsp) is part of the Lobo Project. <p> Java 1.5 or later required. <p> Copyright 2010 by Ben E. Cline. This source code is provided for educational purposes "as is" with no warranty. If you use the code, please acknowledge the author. <p> http://www.benjysbrain.com <p> @author Benjy Cline */ public class CobraHttpRequest extends SimpleHttpRequest { /** The read timeout and connection timeout in milliseconds. */ int timeout = 1000*60*30 ; /** If false, HttpRequest objects have the "Connection : close" property set to discourage persistent connections. */ boolean persistent = false ; /** Create an HttpRequest object whose open() methods create URLConnection objects with timeout and persistence values. */ public CobraHttpRequest(UserAgentContext context, java.net.Proxy proxy, int timeout, boolean persistence) { super(context, proxy) ; this.timeout = timeout ; this.persistent = persistent ; } /** Override the primary open() method so that the URLConnection object can be configured. */ public void open(final String method, final java.net.URL url, boolean asyncFlag, final String userName, final String password) throws java.io.IOException { super.open(method, url, asyncFlag, userName, password) ; connection.setReadTimeout(timeout) ; connection.setConnectTimeout(timeout) ; if(!persistent) connection.setRequestProperty("Connection", "close") ; } }

These classes are not particularly general, but they can serve as a model for more elegant code. If you have questions or comments or if you discover errors in this page or the code, please let me know at the e-mail address in the footer of this page.


This page © copyright 2010 by Ben E. Cline.  E-Mail:  

 

 

 

 

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值