如何快速的爬取一本你想要的小说

最新推荐文章于 2023-05-21 21:01:00 发布

星夜007

最新推荐文章于 2023-05-21 21:01:00 发布

阅读量1.6k

点赞数 2

分类专栏：爬虫

本文链接：https://blog.csdn.net/q893487191/article/details/106340028

版权

爬虫专栏收录该内容

1 篇文章 0 订阅

订阅专栏

jsoup使用

下面都是以笔趣阁网站为例，请谨慎的使用爬虫，避免给目标网站过大的压力。
以前想下载一本小说txt，找各种网站费老半天功夫，现在你最多只需要10s

1.上一个简单的demo

public static void main(String[] args) throws Exception {
    String furl = "http://www.xbiquge.la/25/25430/12402769.html";
    Document doc = Jsoup.connect(furl).timeout(5000).get();
    Element content = doc.getElementById("content");
    System.out.println(content.text());
}

上述示例分为几个步骤

根据一个url通过get请求获得html页面，最终封装成一个Document对象
jsoup提供了类css 选择器功能

运行可以获得一章节的内容。

demo距离实际运用还需要：

能够同时爬取所有章节，并且按顺序保存到文件中
速度尽可能的快

2.来一个实用的demo

我们需要分以下几个步骤：

首先获得所有章节的url
开启多线程爬取，合并
保存到本地

测试示例url: http://www.xbiquge.la/25/25430/

public static void main(String[] args) throws Exception{
    String furl = "http://www.xbiquge.la/25/25430/";
    long start = System.currentTimeMillis();
    Document doc = Jsoup.connect(furl).timeout(5000).get();
    String title = doc.title();
    String novelName = title.substring(0,title.indexOf("小说"));
    Element list = doc.getElementById("list");

    if(list == null){
        return;
    }
	 //获得所有章节的url
    Elements chapterList = list.select("dl dd a");

    if(chapterList == null || chapterList.size() == 0 ){
        return;
    }

    //多线程，使用原子类进行自增保证线程安全
    AtomicInteger num = new AtomicInteger(0);
	//定义一个线程池，用于多线程爬取提高速度。
    ThreadPoolExecutor executor = new ThreadPoolExecutor(32,32,30, TimeUnit.SECONDS,new LinkedBlockingDeque<>());
    //存储所有章节的element对象(包含了内容)
    Element[] successElemenetArr = new Element[chapterList.size()];
    //存储每个章节名称
    String[] chapNameArr = new String[chapterList.size()];

    int i = 0;
    for (Element cp:chapterList) {
        int finalI = i++;
        executor.submit(new Runnable() {
            @Override
            public void run() {
                try {
                    String href = cp.attr("href");
                    String name = cp.text();

                    String tturl = "http://www.xbiquge.la" + href;
                    Document ttDoc = Jsoup.connect(tturl).get();

                    Element content = ttDoc.getElementById("content");
                    //此处使用了hash法进行存储，比使用线程安全的数据结构更加好
                    successElemenetArr[finalI % chapterList.size()] = content;
                    chapNameArr[finalI % chapterList.size()] = name;

                    System.out.println(num.incrementAndGet());
                }catch (Exception e){}
            }
        });
    }
    //阻塞等待所有任务执行完毕，这个方式并不优雅，有兴趣可改为CountDownLatch。
    while (num.get() != chapterList.size()){
        Thread.sleep(100);
    }
    long end = System.currentTimeMillis();
    //写入到本地磁盘
    writeFile(novelName,successElemenetArr,chapNameArr);
    System.out.println("耗时：" + (end - start));
    executor.shutdown();
}

private static void writeFile(String name,Element[] elements,String[] chapNameArr) throws Exception{
    //这里为了方便直接写在d盘了。
    File file = new File("D:\\" +name + ".txt");
    file.createNewFile();
    FileOutputStream fos = new FileOutputStream(file);
    //写入小说名
    String title = "       " + name + "\n";
    fos.write(title.getBytes("UTF-8"));
    int i = 0;
    for (Element element:elements) {
        //写入章节名
        String chapName = "\n     "+chapNameArr[i++]+ "\n\n";
        fos.write(chapName.getBytes("UTF-8"));
        //写入小说每一行的内容
        List<TextNode> textNodes = element.textNodes();
        for (TextNode tn:textNodes) {
            fos.write(tn.text().getBytes("UTF-8"));
            fos.write("\n".getBytes("UTF-8"));
        }
    }
    fos.flush();
    fos.close();
}

注：最后写入磁盘的代码，性能以及排版并不是很好，有兴趣自己优化一下。

3.代码中的多线程知识

AtomicInteger

在多线程环境下，i++，这种代码会存在脏读自增导致最后的结果偏小。

AtomicInteger使用了 volatile + cas机制避免了这个
ThreadPoolExecutor参数

关于线程池的几个参数：
- 核心线程数：任务加入时没有空闲线程，并且没有到达最大核心线程数会开启新线程
- 最大线程数：当阻塞队列满了，再次增加任务会开启新线程，但总的线程数不能超过这个值
- 空闲线程生存时间
- 阻塞队列：通常使用有界，无界，优先级这三种。
- 丢弃策略: 队列满了，线程数已经到达最大线程数，执行的任务处理的策略，不一定是丢弃。
  
  有丢弃，抛异常，让当前线程执行，加入队尾把队头的任务挤掉，或者自定义。

多线程的使用分为：io密集/cpu密集，此处为io密集通常为 3 * cpu数左右，cpu密集： cpu + 1。

此处为http调用，和带宽也有一定关系。

hash法

代码中没有使用类似ConcurrentMap的实现类进行多线程的数据存储，而是采用array+hash法巧妙的避开了多线程处理以及后期排序问题。如果阅读过hashmap或者刷过leetcode应该会很熟悉。

星夜007

关注

2
点赞
踩
5

收藏

觉得还不错? 一键收藏
0
评论
如何快速的爬取一本你想要的小说

jsoup使用下面都是以笔趣阁网站为例，请谨慎的使用爬虫，避免给目标网站过大的压力。以前想下载一本小说txt，找各种网站费老半天功夫，现在你最多只需要10s1.上一个简单的demopublic static void main(String[] args) throws Exception { String furl = "http://www.xbiquge.la/25/25430/12402769.html"; Document doc = Jsoup.connect(furl)
复制链接

扫一扫