和上篇博客一样,我在学习用httpclient实现网络爬虫时,在网上搜了很多资料,但httpclient版本都是之前的3.1版本或者4.3的版本,我通过自己的学习了解了httpclient4.5,今天在这里我将我自己写的用httpclient+jsoup实现抓取网页分享给大家。
httpclient+jsoup获取网页非常简单,首先通过httpclient的get方法(如有不懂的地方可以看下我的上一篇关于get方法的讲解)获取到一个网页的所有内容,然后通过jsoup对获取到的内容进行解析,将这个网页内的链接全部获取到,然后再通过get方法获取到这些链接内网页的内容,这样我们就可以获取到一个网页下所有链接的网页内容。比如说,我们获取到一个网页,其下有50个链接,我们就可以获取到50个网页,下面是代码,大家感受一下
获取到url
public class GetUrl {
public static List<String> getUrl(String ur) {
//创建默认的httpClient实例
CloseableHttpClient client = HttpClients.createDefault();
//创建list存放已读取过的页面
List<String> urllist = new ArrayList<String>();
//创建get
HttpGet get=new HttpGet(ur);
//设置连接超时
Builder custom = RequestConfig.custom();
RequestConfig config = custom.setConnectTimeout(5000).setConnectionRequestTimeout(1000).setSocketTimeout(5000).build();
get.setConfig(config);
//设置消息头部
get.setHeader("Accept", "Accept text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8");
get.setHeader("User-Agent", "Mozilla/5.0 (Windows NT 6.1; rv:6.0.2) Gecko/20100101 Firefox/6.0.2");
try {
//执行get请求
CloseableHttpResponse response = client.execute(get);
//获取响应实体
HttpEntity entity = response.getEntity();
//将响应实体转为String类型
String html = EntityUtils.toString(entity);
//通过jsoup将String转为jsoup可处理的文档类型
Document doc = Jsoup.parse(html);
//找到该页面中所有的a标签
Elements links = doc.getElementsByTag("a");
int i=1;
for (Element element : links) {
//获取到a标签中href属性的内容
String url = element.attr("href");
//对href内容进行判断 来判断该内容是否是url
if(url.startsWith("http://blog.csdn.net/") && !urllist.contains(url)){
GetPage.getPage(url);
System.out.println(url);
urllist.add(url);
i++;
}
}
response.close();
} catch (IOException e) {
e.printStackTrace();
}finally {
try {
client.close();
} catch (IOException e) {
e.printStackTrace();
}
}
return urllist;
}
}
根据获取到的url生成页面
public class GetPage {
public static boolean getPage(String url) {
//创建默认的httpClient实例
CloseableHttpClient client = HttpClients.createDefault();
//定义BufferedReader输入流来读取URL的响应
BufferedReader br=null;
//设置连接超时
Builder custom = RequestConfig.custom();
RequestConfig config = custom.setConnectTimeout(5000).setConnectionRequestTimeout(1000).setSocketTimeout(5000).build();
//创建httpget.
HttpGet get = new HttpGet(url);
get.setConfig(config);
//设置消息头部模拟浏览器访问
get.setHeader("Accept", "Accept text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8");
get.setHeader("User-Agent", "Mozilla/5.0 (Windows NT 6.1; rv:6.0.2) Gecko/20100101 Firefox/6.0.2");
try {
//执行get请求
CloseableHttpResponse response = client.execute(get);
//获取响应实体
HttpEntity entity = response.getEntity();
br=new BufferedReader(new InputStreamReader(entity.getContent(),"UTF-8"));
String page="";
//逐行读取数据
while(true){
String line = br.readLine();
if(line == null){
break;
}else{
page+=line;
page+="\n";
}
}
//设置获取到的网页的输出路径
FileWriter writer = new FileWriter("D:/html/"+UUID.randomUUID()+".html");
//创建字符输出流
PrintWriter fout = new PrintWriter(writer);
fout.print(page);
fout.close();
br.close();
response.close();
} catch (IOException e) {
e.printStackTrace();
}
return true;
}
}
最后是测试类
public class Test {
public static void main(String[] args) {
List<String> list = GetUrl.getUrl("http://blog.csdn.net/");
System.out.println("所写链接内的所有内容");
int i=1;
for (String url : list) {
GetUrl.getUrl(url);
System.out.println("第"+i+"条链接内容");
i++;
}
}
}
以上就是如何用httpclient+jsoup实现抓取网页了,希望能给大家带来帮助。