java初试爬虫jsoup爬取纵横小说免费模块
之前一直学习java ee,上个月到深圳工作,被招去做java爬虫,于是自己学着jsoup,写了个简单的爬虫
因为平时喜欢看小说就爬了纵横。
将整个过程分为了
1. 获取当前页小说列表的详细资料
2. 切换到下一分页的列表
3. 获取当前章的内容
4. 切换到下一章再重复 3
获取当前页小说列表的详细资料 与切换到下一分页的列表
先上代码
public class Book {
private String name;//书名
private String author;//作者
private String classify;//书分类
private String url;//书的url
private String path;//保存路径
打开纵横免费已完结的页码,第一页,然后发现点击第二页后 URL变化的
由 http://book.zongheng.com/store/c0/c0/b0/u0/p1/v0/s1/t0/u0/i1/ALL.html
变为http://book.zongheng.com/store/c0/c0/b0/u0/p2/v0/s1/t0/u0/i1/ALL.html
于是用占位符 替换了
http://book.zongheng.com/store/c0/c0/b0/u0/p/v0/s1/t0/u0/i1/ALL.html
再从爬取的分页组件中获取到对应的最大页数,遍历页数进行全部爬取
在换页之前,先将本页的小说爬取下来,我获得了点击书名的跳转链接与书名,作者,分类,将其保存到Book实体中,再放入BlockingQueue中,开启多线程去爬取每一个小说
import org.jsoup.Connection;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import java.io.IOException;
import java.util.Iterator;
import java.util.concurrent.BlockingQueue;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.LinkedBlockingDeque;
public class ZhongHeng {
private final String listUrl = "http://book.zongheng.com/store/c0/c0/b0/u0/p<number>/v0/s1/t0/u0/i1/ALL.html";
private BlockingQueue<Book> books = new LinkedBlockingDeque<>();
private String path;
private int num;
private void getList(int page,int total) throws IOException {
String url = listUrl.replaceAll("<number>",String.valueOf(page));
Connection con = Jsoup.connect(url);
Document dom = con.header("Accept","text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8")
.header("Accept-Encoding"," gzip, deflate")
.header("Accept-Language","zh-CN,zh;q=0.9")
.header("Cache-Control","max-age=0")
.header("Connection","keep-alive")