Java爬取CSDN博客

最新推荐文章于 2024-05-14 19:21:39 发布

蜗牛5

最新推荐文章于 2024-05-14 19:21:39 发布

阅读量629

点赞数

分类专栏：大数据文章标签： java 爬虫

本文链接：https://blog.csdn.net/leiguang55555/article/details/50918478

版权

最近由于要做一个关于技术博客搜索的搜索工具，我开始接触到了爬虫，因为Java学的比较精通（主要是有很多封装的工具包），写了一个小小的Demo，进入主题吧
参开博客链接
1. 请求的规则，我用的是Jsoup请求网页，它可以在请求时，按照你设置的规则来进行请求，所有要有一个请求规则类，PS：但是后面好像并没有什么用到

规则的父类，一般请求都需要的

package com.lg.po;

import java.io.Serializable;

/**
 * 规则的父类，一般请求都需要的
 * @author LG
 *
 */
public class SuperRule implements Serializable{
   
    /**
     * 网页路径
     */
    private String url;
    /**
     * GET/POST请求
     */
    private int requestType = GET;
    /**
     * GET请求方式
     */
    public final static int GET = 0;
    /**
     * POST请求方式
     */
    public final static int POST = 1;
    /**
     * 参数集合
     */
    private String[] params;
    /**
     * 参数对应的值
     */
    private String[] values;
    /**
     * 对返回的HTML，第一次过滤所使用的标签，
     */
    private String resultTagName;
    /**
     *  CLASS / ID / SELECTION 
     * 设置resultTagName的类型，默认为ID  
     */
    private int type = ID;

    public final static int ID = 0;
    public final static int CLASS = 1;
    public final static int SELECTION = 2;

    public SuperRule(){}
    public SuperRule(String url){
  this.url = url;}
    public SuperRule(String url, int requestType, String[] params, String[] values,
            String resultTagName,int type) {
        super();
        this.url = url;
        this.requestType = requestType;
        this.params = params;
        this.values = values;
        this.resultTagName = resultTagName;
        this.type = type;
    }
    //剩下的set，get方法自己脑补
    }

解析CSDN博客的规则类

package com.lg.po;
/**
 * 博客规则类
 * @author LG
 *
 */
public class BlogRule extends SuperRule{
   


    /**
     * 用于判断是否从博客空间到某一篇具体的博客,默认false
     */
    private boolean isDirect = false;
    /**
     * 该博客链接的类型,默认BLOG
     */
    private int blogType = BLOG;
    /**
     * HOME我的 SPACE博客空间 BLOG博客
     */
    public final static int HOME = 0;
    public final static int SPACE = 1;
    public final static int BLOG = 2;
    public BlogRule(){}
    public BlogRule(boolean isDirect, int blogType) {
        super();
        this.isDirect = isDirect;
        this.blogType = blogType;
    }

2.获取网页的类容，因为向服务器请求的过程中，如果请求的频率和次数都比较高的话，可能遭到别人反爬虫，网上有很多解决反爬的方法，如，降低请求次数，分时段爬取，我这里用的是换代理，一个IP不行了另一个赶紧顶上，免费IP代理地址

package com.lg.utils;

import java.io.IOException;
import java.util.Map;
import java.util.Queue;
import java.util.concurrent.CountDownLatch;

import org.apache.http.HttpEntity;
import org.apache.http.HttpHost;
import org.apache.http.HttpResponse;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.impl.client.HttpClients;
import org.apache.http.util.EntityUtils;
import org.jsoup.Connection;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;

import com.lg.filter.DatasBackUp;
import com.lg.po.BlogRule;
import com.lg.po.SuperRule;
/**
 * 解析网页的工具类
 * @author LG
 *
 */
public class ParseCommenUtil {
   
    /**
     * 标识换IP是只能一个线程换
     */
    private static  boolean ischange = true;
    /**
     * 代理IP
     */
    private static String proxyHost = "218.106.96.196";
    /**
     * 代理端口
     */
    private static int proxyPort = 80;

    /**
     * 下载某个路径的网页
     */
    public static Document download(SuperRule rule,Queue<String> ipQueue){  
        Document doc = downloadHttp(rule,ipQueue);  
        try {
            doc.setBaseUri("http://blog.csdn.net");//设置网页的基本路径，因为网页中有的用的是相对路径
        } catch (Exception e) {
            System.

最低0.47元/天解锁文章

蜗牛5

关注

0
点赞
踩
3

收藏

觉得还不错? 一键收藏
0
评论
Java爬取CSDN博客

最近由于要做一个关于技术博客搜索的搜索工具，我开始接触到了爬虫，因为Java学的比较精通（主要是有很多封装的工具包），写了一个小小的Demo，进入主题吧参开博客链接 1. 请求的规则，我用的是Jsoup请求网页，它可以在请求时，按照你设置的规则来进行请求，所有要有一个请求规则类，PS：但是后面好像并没有什么用到规则的父类，一般请求都需要的package com.lg.po;import ja
复制链接

扫一扫