基于Jsoup实现的简单网络爬虫

最新推荐文章于 2024-08-22 11:00:00 发布

郑斯道

最新推荐文章于 2024-08-22 11:00:00 发布

阅读量4.2k

点赞数 2

分类专栏： Java 文章标签：爬虫数据网络爬虫

本文链接：https://blog.csdn.net/zhengshidao/article/details/72845794

版权

本文介绍了使用Jsoup实现简单网络爬虫的过程，从理解爬虫基本原理到运用HttpClient获取页面资源，再到利用Jsoup进行内容解析。通过示例展示了如何筛选并获取子链接，以及如何抓取大量数据，包括图片和链接资源。

摘要由CSDN通过智能技术生成

之前是完全不会爬虫的，但是新项目中需要从网页上爬一大堆的数据，所以就花了一天时间学习了下。主题部分还是很简单的。
* 既然想要写博文，那我就要写的细致点，对自己对读者都是一种负责！

什么是爬虫？

我所理解的爬虫就是从互联网上获取Url，顺着Url一个一个的去访问页面
一个页面会有很多的链接，对于每个链接可以判断是否使我们想要的，再对子链接进行操作、访问等等。

for each 链接 in 当前网页所有的链接
{
        if(如果本链接是我们想要的 || 这个链接从未访问过)
        {
                处理对本链接
                把本链接设置为已访问
        }
}

对于爬虫：
1. 首先你需要给定一个种子链接。
2. 在种子链接当中寻找你要的子链接。
3. 通过子链接访问其他页面，在页面中获取你所需要的内容。

这当中涉及到的内容有：

Http
内容解析器

其主要的过程是这样的：

取一个种子URL，比如www.oschina.net
通过httpclient请求获取页面资源(获取的页面资源其中肯定包含了其他的URL，可以作为下一个种子循环使用)
通过正则或者jsoup解析出想要的内容(解析出其他的URL链接，同时获取本页面的所有图片，这都是可以的)
使用3获取的下一个种子URL，重复1

我们先来看下如何用 HttpClient 获取到整个页面：
在使用 HttpClient 之前，你需要先导入 HttpClint.jar 包

在HttpClient jar更新之后，使用的实例都是 CloseableHttpClient 了所以我们也用它

 public static String get(String url){
        String result = "";
        try {
            //获取httpclient实例
            CloseableHttpClient httpclient = HttpClients.createDefault();
            //获取方法实例。GET
            HttpGet httpGet = new HttpGet(url);
            //执行方法得到响应
            CloseableHttpResponse response = httpclient.execute(httpGet);
            try {
                //如果正确执行而且返回值正确，即可解析
                if (response != null
                        && response.getStatusLine().getStatusCode() == HttpStatus.SC_OK) {
                    System.out.println(response.getStatusLine());
                    HttpEntity entity = response.getEntity();
                    //从输入流中解析结果
                    result = readResponse(entity, "utf-8");
                }
            } finally {
                httpclient.close();
                response.close();
            }
        }catch (Exception e){
            e.printStackTrace();
        }
        return result;
    }

    /**
     * stream读取内容，可以传入字符格式
     * @param resEntity
     * @param charset
     * @return
     */
    private static String readResponse(HttpEntity resEntity, String charset) {
        StringBuffer res = new StringBuffer();
        BufferedReader reader = null;
        try {
            if (resEntity == null) {
                return null;
            }

            reader = new BufferedReader(new InputStreamReader(
                    resEntity.getContent(), charset));
            String line = null;

            while ((line = reader.readLine()) != null) {
                res.append(line);
            }

        } catch (Exception e) {
            e.printStackTrace();
        } finally {
            try {
                if (reader != null) {
                    reader.close();
                }
            } catch (IOException e) {
            }
        }
        return res.toString();
    }

通过这种方法，获取到的是整个页面的资源，其中包含了Html的代码（www.baidu.com）：

<!DOCTYPE html>
<!--STATUS OK-->
<html>
 <head>
  <meta http-equiv="content-type" content="text/html;charset=utf-8">
  <meta http-equiv="X-UA-Compatible" content="IE=Edge">
  <meta content="always" name="referrer">
  <link rel="stylesheet" type="text/css" href="http://s1.bdstatic.com/r/www/cache/bdorz/baidu.min.css">
  <title>百度一下，你就知道</title>