java 网页解析工具_java网页解析工具包

Jsoup是一个非常好的解析网页的包,用Java开发的,提供了类似DOM,CSS选择器的方式来查找和提取文档中的内容。

相关资料如下:

今天做了一个Jsoup解析网站的项目,使用Jsoup.connect(url).get()连接某网站时偶尔会出现

java.net.SocketTimeoutException:Read timed out异常。

原因是默认的Socket的延时比较短,而有些网站的响应速度比较慢,

所以会发生超时的情况。

解决方法:

链接的时候设定超时时间即可。

doc = Jsoup.connect(url).timeout(5000).get();

5000表示延时时间设置为5s。

测试代码如下:

1,不设定timeout时:

packagejsoupTest;

importjava.io.IOException;

importorg.jsoup.*;

importorg.jsoup.helper.Validate;

importorg.jsoup.nodes.Document;

importorg.jsoup.nodes.Element;

importorg.jsoup.select.Elements;

publicclassJsoupTest {

publicstaticvoidmain(String[] args)throwsIOException{

String url = "http://www.weather.com.cn/weather/101010400.shtml";

longstart = System.currentTimeMillis();

Document doc=null;

try{

doc = Jsoup.connect(url).get();

}

catch(Exception e){

e.printStackTrace();

}

finally{

System.out.println("Time is:"+(System.currentTimeMillis()-start) +"ms");

}

Elements elem = doc.getElementsByTag("Title");

System.out.println("Title is:"+elem.text());

}

}

package jsoupTest;

import java.io.IOException;

import org.jsoup.*;

import org.jsoup.helper.Validate;

import org.jsoup.nodes.Document;

import org.jsoup.nodes.Element;

import org.jsoup.select.Elements;

public class JsoupTest {

public static void main(String[] args) throws IOException{

String url = "http://www.weather.com.cn/weather/101010400.shtml";

long start = System.currentTimeMillis();

Document doc=null;

try{

doc = Jsoup.connect(url).get();

}

catch(Exception e){

e.printStackTrace();

}

finally{

System.out.println("Time is:"+(System.currentTimeMillis()-start) + "ms");

}

Elements elem = doc.getElementsByTag("Title");

System.out.println("Title is:" +elem.text());

}

}

有时发生超时:

java.net.SocketTimeoutException: Read timed out

at java.net.SocketInputStream.socketRead0(Native Method)

at java.net.SocketInputStream.read(Unknown Source)

at java.net.SocketInputStream.read(Unknown Source)

at java.io.BufferedInputStream.fill(Unknown Source)

at java.io.BufferedInputStream.read1(Unknown Source)

at java.io.BufferedInputStream.read(Unknown Source)

at sun.net.www.http.ChunkedInputStream.fastRead(Unknown Source)

at sun.net.www.http.ChunkedInputStream.read(Unknown Source)

at java.io.FilterInputStream.read(Unknown Source)

at sun.net.www.protocol.http.HttpURLConnection$HttpInputStream.read(Unknown Source)

at java.util.zip.InflaterInputStream.fill(Unknown Source)

at java.util.zip.InflaterInputStream.read(Unknown Source)

at java.util.zip.GZIPInputStream.read(Unknown Source)

at java.io.BufferedInputStream.read1(Unknown Source)

at java.io.BufferedInputStream.read(Unknown Source)

at java.io.FilterInputStream.read(Unknown Source)

at org.jsoup.helper.DataUtil.readToByteBuffer(DataUtil.java:113)

at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:447)

at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:393)

at org.jsoup.helper.HttpConnection.execute(HttpConnection.java:159)

at org.jsoup.helper.HttpConnection.get(HttpConnection.java:148)

at jsoupTest.JsoupTest.main(JsoupTest.java:17)

Time is:3885ms

Exception in thread "main" java.lang.NullPointerException

at jsoupTest.JsoupTest.main(JsoupTest.java:25)

2,设定了则一般不会超时

packagejsoupTest;

importjava.io.IOException;

importorg.jsoup.*;

importorg.jsoup.helper.Validate;

importorg.jsoup.nodes.Document;

importorg.jsoup.nodes.Element;

importorg.jsoup.select.Elements;

publicclassJsoupTest {

publicstaticvoidmain(String[] args)throwsIOException{

String url = "http://www.weather.com.cn/weather/101010400.shtml";

longstart = System.currentTimeMillis();

Document doc=null;

try{

doc = Jsoup.connect(url).timeout(5000).get();

}

catch(Exception e){

e.printStackTrace();

}

finally{

System.out.println("Time is:"+(System.currentTimeMillis()-start) +"ms");

}

Elements elem = doc.getElementsByTag("Title");

System.out.println("Title is:"+elem.text());

}

}

package jsoupTest;

import java.io.IOException;

import org.jsoup.*;

import org.jsoup.helper.Validate;

import org.jsoup.nodes.Document;

import org.jsoup.nodes.Element;

import org.jsoup.select.Elements;

public class JsoupTest {

public static void main(String[] args) throws IOException{

String url = "http://www.weather.com.cn/weather/101010400.shtml";

long start = System.currentTimeMillis();

Document doc=null;

try{

doc = Jsoup.connect(url).timeout(5000).get();

}

catch(Exception e){

e.printStackTrace();

}

finally{

System.out.println("Time is:"+(System.currentTimeMillis()-start) + "ms");

}

Elements elem = doc.getElementsByTag("Title");

System.out.println("Title is:" +elem.text());

}

}

输出为:

Time is:4158ms Title is:顺义天气预报-今日_明日_一周天气预报:16日星期五  多云转晴  11/-4℃

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值