catch小说内容-从gui到爬虫(2)

CaCu999

于 2021-07-15 11:12:39 发布

阅读量140

点赞数

分类专栏： Ĵava

本文链接：https://blog.csdn.net/qq_41940001/article/details/118527137

版权

Ĵava 专栏收录该内容

12 篇文章 0 订阅

订阅专栏

本文详细介绍了如何使用Java的Jsoup库从指定网页抓取内容，包括创建HttpClient、执行GET请求、解析HTML结构、提取关键信息（如章节名、链接和正文）以及写入文件。还涵盖了使用Jsoup进行DOM和CSS选择器操作，以及处理不同类型的网页内容处理技巧。

摘要由CSDN通过智能技术生成

day2&day3-根据网址读取内容

1、爬虫

根据url地址，发送请求获取响应
解析响应内容，找到想要的数据并解析到新的url路径
参考：https://www.cnblogs.com/sanmubird/p/7857474.html

2、ClientComponent

1）导入

maven

           <dependency>
               <groupId>org.apache.httpcomponents</groupId>
               <artifactId>httpclient</artifactId>
               <version>4.5.2</version>
           </dependency>

参考：https://blog.csdn.net/mao_xiaoxi/article/details/89206313

jar包导入
- 下载
  - 进入网址 http://hc.apache.org/downloads.cgi
  - 下载资源包 https://mirror-hk.koddos.net/apache//httpcomponents/httpclient/binary/httpcomponents-client-5.1-bin.zip
- 导入
  - 在项目中添加lib文件夹，放入jar
  - 引入工程
    - 选择jar包 - 右键 - build path - Add to build path
    - 项目中产生了referenced libraries
- 检查
  - 右击项目名-build path - config build path ，选择libraries，看到添加的jar包
  - 参考：https://www.cnblogs.com/zhxdxf/p/7598371.html

2）基本步骤

创建HttpClient对象
创建请求方法的实力，指定请求的url。针对对应的请求，创建HttpGet/HTTPPost对象
设置请求参数。调用HttpGet.setParams()/HttpPost.setEntity()
发送请求 HttpClient对象的execute(request)发送请求，并返回HttpResponse
获取服务器的响应头：HttpResponse.getAllHeaders()；获取HttpEntity对象：HttpResponse.getEntity()，包装了服务器的响应内容
释放连接

3）HttpClient和CloseableHttpClient的区别

CloseableHttpClient实现了HttpClient接口
HttpClient不主动发起close，链接会维持一段时间。维持的时间内，其他并发进入以后，会抛出句柄不够的异常。TCP链接也可能会禁图CLOSE_WAIT状态，但没有接收到最后一侧握手信息，SOCKET会一直处于这个状态。
- 解决方法：HttpClient client = new HttpClient(new HttpClientParams(),new SimpleHttpConnectionManager(true));
- 缺点：相当于每次用完就关闭，但是会有多次new/close流程，对JVM内存消耗很大，会影响性能
CloseableHttpClient，连接池
参考：https://blog.csdn.net/qq_31868149/article/details/103402184

4）实验代码

读取网页步骤

1、打开浏览器——生成HttpClient
    创建HttpClient对象
    CloseableHttpClient client = HttpClientBuilder.create().build()
2、输入网址——创建get请求
    创建HttpGet对象，并键入网址
    HttpGet get = new HttpGet("");
3、按下确定键——执行get请求
    使用HttpClient执行请求，将结果作为响应输出
    CloseableHttpResponse res = client.execute(httpget);
4、显示结果——获取响应内容
    HttpEntity entity = res.getEntity();
    result = EntityUtils.toString(entity);
    根据响应的编码进行操作
        100：5
        200：成功响应
        300：
        404：页面不存在、资源未找到
    成功响应（响应编码为200），显示内容
    创建HTTPEntity对象，获取响应内容
5、关闭浏览器
    res.close();
参考：https://blog.csdn.net/u014429653/article/details/106985970

写入文件步骤

1、打开文件夹：File file = new File(path);
2、不存在文件夹则新建：file.createNewFile();
3、读写文件（文件写入、字节读写）
    FileWriter fw = new FileWriter(path);
    BufferedWriter bw = new BufferedWriter(fw);

4、逐字节写入字符串：bw.write(str);
5、关闭写入：bw.close();
参考：https://www.cnblogs.com/x_wukong/p/4679116.html

实现代码

读取网页

      public static void catchContent() throws Exception {
          String str ="https://www.jjwxc.net/onebook.php?novelid=4889825";
          //		//将字符串转为网址
          //		URL url = new URL(str);
          //打开浏览器-生成HttpClient
          CloseableHttpClient client = HttpClients.createDefault();
          //输入网址-创建get请求
          HttpGet httpget = new HttpGet(str);
          httpget.setHeader("User-Agent","Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36");
          //按下确定键-执行get请求
          CloseableHttpResponse res = client.execute(httpget);
          //显示结果-将结果返回为字符串
          String result = null;
          HttpEntity entity = res.getEntity();
          if (entity != null) {
              result = EntityUtils.toString(entity, "gbk");
              String path="H:\\daily\\2021\\07\\04\\a.txt";
              writeHTML(result,path);

          }
          //关闭浏览器
          res.close();
          }

写入文件

                  public static void writeHTML(String str,String path)  throws Exception{
                      File file = new File(path);
                      if(!file.exists()) {
                          file.createNewFile();

                      }
                      FileWriter fw = new FileWriter(file.getAbsoluteFile());
                      BufferedWriter bw = new BufferedWriter(fw);
                      bw.write(str);
                      bw.close();
                      System.out.println("ok");
                      }

读取html网页

3、分析html内容

1) Jsoup

解析某个url地址、html文本内容。通过dom、css及类似jQuery的操作方法取出和操作内容

2) 步骤

解析
- 获取网页的所有元素：Document document = Jsoup.parse(str);
- 根据类获取元素：document.getElementsByClass(“readtd”)
- 根据Id获取元素：e.getElementById(“novelintro”)
- 注：一个网页中可能有几个同名class，所有返回值类型为elements。但一个网页中id不重名，故返回值为element
- 参考：https://www.cnblogs.com/sam-uncle/p/10922366.html

代码

读取文案信息

      public static void readInfo(String str) throws Exception{
          //读取所有的文案信息，包括标题，作者，文案
          Document document = Jsoup.parse(str);
          //获取文章标题及作者名
          String title="";
          Element titles= document.getElementById("oneboolt").getElementsByClass("sptd").first();
          title = titles.text();
          //获取第一个class=readtd 的标签，包含所有文案信息
          Element e=document.getElementsByClass("readtd").first();
          String info= title;
          //获取id为novelintro的所有标签和字标签，id不重名，可以直接在整个网页中检索，速度相对会慢一点
          //novelintro下是文案信息，仅包括一些进行换行，故直接打印所有内容，替换br标签
          String str2= e.getElementById("novelintro").html();
          str2=str2.replace("","");
          info = info + "\n\n" +str2;
          //写入文档
          String path="H:\\daily\\2021\\07\\05\\b.txt";
          writeHTML(info,path);
          }

读取每章名称和链接

  public static void readChapterUrl(String str) throws Exception{
      Document document=Jsoup.parse(str);
      Element e = document.getElementById("oneboolt");
      Elements chapters = e.getElementsByAttributeValue("itemprop","headline");
      int i = 1;
      List title = new ArrayList();
      List link = new ArrayList();
      for(Element chapter : chapters) {
          String url= chapter.getElementsByAttributeValue("itemprop","url").attr("href");
              if( url == "" ) 	break;

          link.add(url);
          title.add("第"+(i++)+"章\t "+chapter.text());
          System.out.println(title.get(i-2)+"\n"+link.get(i-2));

      }
      }

读取文章内容

根据正则表达式替换内容：https://www.cnblogs.com/xiaoshen666/articles/10641002.html

public static String readContent() throws Exception{
    String str = "http://www.jjwxc.net/onebook.php?novelid=5392905&chapterid=1";
    String result= surfNet(str);
    Document document = Jsoup.parse(result);
    Element e = document.getElementsByClass("noveltext").first();
    String novel=e.ownText();
    //读取内容时br标签按照空格读取，将字符串中的空格换成回车，即为正确版式
    novel = novel.replaceAll("\s","\n");
    return novel;
    }

写入txt

filewriter 和 printwriter写入（无法在文章末尾添加内容）

   public static void writeHTML(String str,String path)  throws Exception{
       File file = new File(path);
       if(!file.exists()) {
           file.createNewFile();

       }
       FileWriter fw = new FileWriter(file.getAbsoluteFile());
       PrintWriter pw = new PrintWriter(fw);
       pw.println(str);
       pw.flush();
       fw.flush();
       pw.close();
       fw.close();
       }

随机访问文件流写入

    public static void writeHTML(String str,String path)  throws Exception{
        //随机访问文件流，可以从任意位置读写文件
        RandomAccessFile randomfile = new RandomAccessFile(path,"rw");
        long length = randomfile.length();
        randomfile.seek(length);
        //bytes写入，中文乱码
        str=str+"\n";
        randomfile.write(str.getBytes());
        randomfile.close();
        }