利用spring boot 写一个稳定的爬虫

1、前言

这篇文章是利用spring boot 写一个稳定的爬虫,爬取的网页数据包含未执行js的网页数据、http/https接口的请求数据、和经过网页渲染的js数据(需要chorme浏览器),数据库使用mysql,程序的运行逻辑定去抓取网页数据,解析数据,存入mysql数据库中,爬取百度股市通的数据为例。

2、创建项目

使用idea开发,首先创建一个spring boot 项目,Group设置为com.crawler,Artifact设置为example,创建项目如图1所示
这里写图片描述

勾选web模块

这里写图片描述
设置项目名称为example
这里写图片描述

3、爬取的数据和存储的数据表结构

1、爬虫百度股市通 https://gupiao.baidu.com/concept/ 上面的概要数据
2、获取某一只股票的今日价格数据

https://gupiao.baidu.com/api/stocks/stocktimeline?from=pc&os_ver=1&cuid=xxx&vv=100&format=json&stock_code=sh600358&timestamp=

概要数据爬取的内容包含热点概念,驱动事件和具体的股票数据,概要数据的数据抽象成msql表如下:

CREATE TABLE `baidu_hot` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `title_line1` varchar(255) DEFAULT NULL,
  `title_line2` int(11) DEFAULT NULL,
  `title_line3` varchar(255) DEFAULT NULL,
  `title_line4` varchar(255) DEFAULT NULL,
  `dirver_thing` text,
  `hot_stock_name_1` varchar(255) DEFAULT NULL,
  `hot_stock_code_1` varchar(11) DEFAULT NULL,
  `hot_stock_price_1` double DEFAULT NULL,
  `hot_stock_increment_1` varchar(20) DEFAULT NULL,
  `hot_stock_name_2` varchar(255) DEFAULT NULL,
  `hot_stock_code_2` varchar(11) DEFAULT NULL,
  `hot_stock_price_2` double DEFAULT NULL,
  `hot_stock_increment_2` varchar(20) DEFAULT NULL,
  `hot_stock_name_3` varchar(255) DEFAULT NULL,
  `hot_stock_code_3` varchar(11) NOT NULL,
  `hot_stock_price_3` double DEFAULT NULL,
  `hot_stock_increment_3` varchar(20) DEFAULT NULL,
  `insert_time` timestamp NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=0 DEFAULT CHARSET=utf8;

百度股市的接口数据直接存为json数据,其表抽象如下

CREATE TABLE `stock` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `stock_id` varchar(30) DEFAULT NULL,
  `data` json DEFAULT NULL,
  `insert_time` timestamp NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=3 DEFAULT CHARSET=utf8;

数据库名字取为:
baidugushi

4、spring boot 项目配置

4.1、日志配置
日志使用logback,每天生成一个info和error级别的日志,logback配置文件如下,配置文件名称为 logback-spring.xml ,logback配置文件放到resources目录下面即可生效:

<?xml version="1.0" encoding="utf-8"?>
<configuration>
    <appender name="consoleLog" class="ch.qos.logback.core.ConsoleAppender">
        <layout class="ch.qos.logback.classic.PatternLayout">
            <pattern>%d - %msg%n</pattern>
        </layout>
    </appender>
    <appender name="fileInfoLog" class="ch.qos.logback.core.rolling.RollingFileAppender">
        <filter class="ch.qos.logback.classic.filter.LevelFilter">
            <level>ERROR</level>
            <onMatch>DENY</onMatch>
            <onMismatch>ACCEPT</onMismatch>
        </filter>
        <encoder>
            <pattern>%msg%n</pattern>
        </encoder>
        <!--滚动策略-->
        <rollingPolicy class="ch.qos.logback.core.rolling.TimeBasedRollingPolicy">
            <!--路径-->
            <!-- winwods -->
            <fileNamePattern>D:\logs\example.info.%d.log</fileNamePattern>
            <!--<fileNamePattern>/data/log/crawler.info.%d.log</fileNamePattern>-->
        </rollingPolicy>
    </appender>
    <appender name="fileErrorLog" class="ch.qos.logback.core.rolling.RollingFileAppender">
        <filter class="ch.qos.logback.classic.filter.ThresholdFilter">
            <level>ERROR</level>
        </filter>
        <encoder>
            <pattern>%msg%n</pattern>
        </encoder>
        <!--滚动策略-->
        <rollingPolicy class="ch.qos.logback.core.rolling.TimeBasedRollingPolicy">
            <!--路径-->
            <fileNamePattern>D:\logs\example.error.%d.log</fileNamePattern>
            <!--<fileNamePattern>/data/log/crawler.error.%d.log</fileNamePattern>-->
        </rollingPolicy>
    </appender>
    <root level="info">
        <appender-ref ref="consoleLog"/>
        <appender-ref ref="fileInfoLog"/>
        <appender-ref ref="fileErrorLog"/>
    </root>
</configuration>

4.2、mysql配置
首先在项目的example包底下创建driver,entity,map,web包,如下图所示
这里写图片描述
mysql使用mybaits去连接数据库
配置之前首先需要在maven导入的包如下:

        <dependency>
            <groupId>mysql</groupId>
            <artifactId>mysql-connector-java</artifactId>
            <version>6.0.6</version>
        </dependency>

        <dependency>
            <groupId>org.mybatis.spring.boot</groupId>
            <artifactId>mybatis-spring-boot-starter</artifactId>
            <version>1.1.1</version>
        </dependency>

application.properties配置文件的内容如下:

mybatis.type-aliases-package=com.crawler.example.entity
spring.datasource.driverClassName = com.mysql.cj.jdbc.Driver
#本机调试
spring.datasource.url = jdbc:mysql://127.0.0.1:3306/baidugushi?useUnicode=true&characterEncoding=UTF-8&useSSL=false&autoReconnect=true&serverTimezone=UTC
spring.datasource.username = root
spring.datasource.password = pwd

其中mybatis.type-aliases-package 是指定mybaits的数据库中表对应类的包,用户名和密码请自行修改
4.3、tomcat 设置
项目最终发布的形式是一个war包,jar包不方便部署和管理。
修改项目的pom.xml文件,packagin修改为war

    <packaging>war</packaging>

删除自带的tomcat,和添加必要的依赖

        <dependency>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-starter-web</artifactId>
            <exclusions>
                <exclusion>
                    <groupId>org.springframework.boot</groupId>
                    <artifactId>spring-boot-starter-tomcat</artifactId>
                </exclusion>
            </exclusions>
        </dependency>

        <dependency>
            <groupId>javax.servlet</groupId>
            <artifactId>javax.servlet-api</artifactId>
            <version>3.1.0</version>
            <scope>provided</scope>
        </dependency>

修改完pom.xml文件还需要添加有一个启动类,其中ExampleApplication是spring boot自动生成的启动类

package com.crawler.example;

import org.springframework.boot.builder.SpringApplicationBuilder;
import org.springframework.boot.web.support.SpringBootServletInitializer;

public class SpringBootStartApplication extends SpringBootServletInitializer {
    @Override
    protected SpringApplicationBuilder configure(SpringApplicationBuilder builder) {
        // 注意这里要指向原先用main方法执行的Application启动类
        return builder.sources(ExampleApplication.class);
    }
}

4.4、idea项目调试设置
首先电脑上要安装了tomcat,随后打开Idea 的Run –Run… edit configurations 删除掉spring boot自带启动设置,随后新建tomcat的配置,tomcat配置如下面两张图所示,将After Launchq勾线去掉,否则项目启动时会启动浏览器。
这里写图片描述
上图主要需要配置的是要选择tomcat的位置
这里写图片描述
上图是配置启动时部署的war包

5、抓取静态页面数据

5.1、定时任务编写
爬虫是定时抓取网页数据,html页面解析使用Jsoup,jsoup 是一款Java 的HTML解析器,可直接解析某个URL地址、HTML文本内容。它提供了一套非常省力的API,可通过DOM,CSS以及类似于jQuery的操作方法来取出和操作数据。
需要增加的pom依赖为

        <dependency>
            <groupId>org.jsoup</groupId>
            <artifactId>jsoup</artifactId>
            <version>1.10.3</version>
        </dependency>

页面下载程序为:

package com.crawler.example.web;

import com.crawler.example.dirver.BaiDuHotProcess;
import com.crawler.example.entity.BaiduHot;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import com.crawler.example.map.BaiduHotMap;
import org.slf4j.Logger;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.scheduling.annotation.Scheduled;
import org.springframework.web.bind.annotation.RestController;
import java.io.IOException;
import java.util.ArrayList;

//百度股市热门下载
@RestController
public class BaiduHotDown {
    public  static Logger logger;
    @Autowired
    BaiduHotMap baiduHotMap;

    @Scheduled(cron = "0/20 * * * * ? ")
    public void downBaiduHot(){
        String url = "https://gupiao.baidu.com/concept/";
        try {
            Document doc = Jsoup.connect(url).get();
            ArrayList<BaiduHot> abh = new BaiDuHotProcess().processBaiduHot(doc);
            for(BaiduHot b:abh){
                baiduHotMap.InsertBaiduHot(b);
            }
        } catch (IOException e) {
            e.printStackTrace();
        }
    }
}

5.2、数据提取解析为:

package com.crawler.example.dirver;

import com.crawler.example.entity.BaiduHot;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

import java.util.ArrayList;

public class BaiDuHotProcess {

    public ArrayList<BaiduHot> processBaiduHot(Document doc){

        ArrayList<BaiduHot> abh = new ArrayList<BaiduHot>();
        //提取数据
        Elements divsBig = doc.getElementsByClass("hot-concept clearfix");
        for(int i=0;i<divsBig.size();i++){
            BaiduHot baiduHot = new BaiduHot();
            //获得行业数据
            Elements cloumn1 = divsBig.get(i).getElementsByClass("concept-header column1");
            //获取行数数据
            baiduHot.title_line1 = cloumn1.get(0).getElementsByClass("text-ellipsis").get(0).ownText();
            //获取热搜指数
            baiduHot.title_line2 =Integer.parseInt( cloumn1.get(0).getElementsByTag("h3").get(0).getElementsByTag("span").get(0).ownText());
            //获得发布时间
            baiduHot.title_line3 = cloumn1.get(0).getElementsByTag("p").get(0).ownText();
            //获得简要内容
            baiduHot.title_line4 = cloumn1.get(0).getElementsByTag("p").get(1).ownText();
            //概述内容
            baiduHot.dirver_thing = divsBig.get(i).getElementsByClass("concept-event column3").get(0).ownText();
            //获得推荐股价
            Elements stockUl = divsBig.get(i).getElementsByClass("no-click");
            //System.out.println(stockUl.get(0).getElementsByTag("a").get(0).getElementsByTag("div").size());
            //股票1名称
            baiduHot.hot_stock_name_1 = stockUl.get(0).getElementsByTag("a").get(0).getElementsByTag("div").get(0).ownText();
            //股票1代码
            baiduHot.hot_stock_code_1 = stockUl.get(0).getElementsByTag("a").get(0).getElementsByTag("div").get(1).ownText();
            //股票1价格
            baiduHot.hot_stock_price_1 = Double.parseDouble(stockUl.get(0).getElementsByClass("column2").get(1).ownText());
            //股票1涨幅

            baiduHot.hot_stock_increment_1 = stockUl.get(0).child(2).ownText();

            //股票2名称
            baiduHot.hot_stock_name_2 = stockUl.get(1).getElementsByTag("a").get(0).getElementsByTag("div").get(0).ownText();
            //股票2代码
            baiduHot.hot_stock_code_2 = stockUl.get(1).getElementsByTag("a").get(0).getElementsByTag("div").get(1).ownText();
            //股票2价格
            baiduHot.hot_stock_price_2 = Double.parseDouble(stockUl.get(1).getElementsByClass("column2").get(1).ownText());
            //股票2涨幅
            baiduHot.hot_stock_increment_2 = stockUl.get(1).child(2).ownText();

            //股票3名称
            baiduHot.hot_stock_name_3 = stockUl.get(2).getElementsByTag("a").get(0).getElementsByTag("div").get(0).ownText();
            //股票3代码
            baiduHot.hot_stock_code_3 = stockUl.get(2).getElementsByTag("a").get(0).getElementsByTag("div").get(1).ownText();
            //股票3价格
            baiduHot.hot_stock_price_3 = Double.parseDouble(stockUl.get(2).getElementsByClass("column2").get(1).ownText());
            //股票3涨幅
            baiduHot.hot_stock_increment_3 = stockUl.get(2).child(2).ownText();

            abh.add(baiduHot);
        }
        return  abh;
    }
}

定时周期使用corn表达式实现,可以使用在线网址:http://cron.qqe2.com/,点点鼠标即可生成想要的定时周期
实体和数据库接口类如下

package com.crawler.example.map;

import com.crawler.example.entity.BaiduHot;
import org.apache.ibatis.annotations.Insert;
import org.apache.ibatis.annotations.Mapper;

//百度接口数据
@Mapper
public interface BaiduHotMap {
    @Insert("insert into baidu_hot(title_line1,title_line2,title_line3,title_line4,dirver_thing,hot_stock_name_1," +
            "hot_stock_code_1,hot_stock_price_1,hot_stock_increment_1,hot_stock_name_2,hot_stock_code_2,hot_stock_price_2," +
            "hot_stock_increment_2,hot_stock_name_3,hot_stock_code_3,hot_stock_price_3,hot_stock_increment_3) values(" +
            "#{title_line1},#{title_line2},#{title_line3},#{title_line4},#{dirver_thing},#{hot_stock_name_1},#{hot_stock_code_1}," +
            "#{hot_stock_price_1},#{hot_stock_increment_1},#{hot_stock_name_2},#{hot_stock_code_2},#{hot_stock_price_2},#{hot_stock_increment_2}," +
            "#{hot_stock_name_3},#{hot_stock_code_3},#{hot_stock_price_3},#{hot_stock_increment_3})")
    public void InsertBaiduHot(BaiduHot baiduHot);
}

实体类:

package com.crawler.example.entity;

public class BaiduHot implements Cloneable {

    public int id;

    public String title_line1;

    public int title_line2;

    public String title_line3;

    public String title_line4;

    public String dirver_thing;

    public String hot_stock_name_1;

    public String hot_stock_code_1;

    public double hot_stock_price_1;

    public String hot_stock_increment_1;

    public String hot_stock_name_2;

    public String hot_stock_code_2;

    public double hot_stock_price_2;

    public String hot_stock_increment_2;

    public String hot_stock_name_3;

    public String hot_stock_code_3;

    public double hot_stock_price_3;

    public String hot_stock_increment_3;
  }

启动类加上注解:@EnableScheduling
最后获得的数据如下图所示:
这里写图片描述

6、抓取http接口的数据

6.1、获取某只股票当日的数据
http下载程序为

        <dependency>
            <groupId>org.json</groupId>
            <artifactId>json</artifactId>
            <version>20160810</version>
        </dependency>

程序源码如下,启动定时器sh60035下载当日的股票数据:

package com.crawler.example.web;

import com.crawler.example.dirver.GetJson;
import com.crawler.example.entity.StockPrice;
import com.crawler.example.map.StockPriceMap;
import org.json.JSONObject;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.scheduling.annotation.Scheduled;
import org.springframework.web.bind.annotation.RestController;

//查询股票的价格
@RestController
public class BaiduStockPrice {

    @Autowired
    StockPriceMap stockPriceMap;

    //下载股票曲线图
    //@Scheduled(cron = "0/20 * * * * ? ")
    public void downStockPrice(){
        //url 生成
        String url = "https://gupiao.baidu.com/api/stocks/stocktimeline?from=pc&os_ver=1&cuid=xxx&vv=100&format=json&stock_code=sh600358&timestamp=" + System.currentTimeMillis();
        //访问获得json数据
        JSONObject stock = new GetJson().getHttpJson(url,1);
        StockPrice stockPrice = new StockPrice();
        stockPrice.stock_id = "sh60035";
        stockPrice.data = stock.toString();
        //将json数据存入数据库中
        stockPriceMap.insertIntoStock(stockPrice);
    }
}

http下载并返回json

package com.crawler.example.dirver;

import org.json.JSONObject;

import javax.net.ssl.HttpsURLConnection;
import java.io.ByteArrayOutputStream;
import java.io.IOException;
import java.io.InputStream;
import java.net.HttpURLConnection;
import java.net.MalformedURLException;
import java.net.URL;

public class GetJson {
    public JSONObject getHttpJson(String url,int comefrom){
        try {
            URL realUrl = new URL(url);
            HttpURLConnection connection = (HttpURLConnection)realUrl.openConnection();
            connection.setRequestProperty("accept", "*/*");
            connection.setRequestProperty("connection", "Keep-Alive");
            connection.setRequestProperty("user-agent","Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1;SV1)");
            // 建立实际的连接
            connection.connect();
            //请求成功
            if(connection.getResponseCode()==200){
                InputStream is=connection.getInputStream();
                ByteArrayOutputStream baos=new ByteArrayOutputStream();
                //10MB的缓存
                byte [] buffer=new byte[10485760];
                int len=0;
                while((len=is.read(buffer))!=-1){
                    baos.write(buffer, 0, len);
                }
                String jsonString=baos.toString();
                baos.close();
                is.close();
                //转换成json数据处理
                JSONObject jsonArray=getJsonString(jsonString,comefrom);
                return jsonArray;
            }
        } catch (MalformedURLException e) {
            e.printStackTrace();
        } catch (IOException ex) {
           ex.printStackTrace();
        }
        return null;
    }

    public JSONObject getHttpsJson(String url){
        try {
            URL realUrl = new URL(url);
            HttpsURLConnection httpsConn = (HttpsURLConnection)realUrl.openConnection();
            httpsConn.setRequestProperty("Accept", "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8");
            httpsConn.setRequestProperty("connection", "Keep-Alive");
            httpsConn.setRequestProperty("user-agent","Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.62 Safari/537.36");
            httpsConn.setRequestProperty("Accept-Charset","utf-8");
            httpsConn.setRequestProperty("contentType", "utf-8");
            httpsConn.connect();
            if(httpsConn.getResponseCode()==200){
                InputStream is = httpsConn.getInputStream();
                ByteArrayOutputStream baos=new ByteArrayOutputStream();
                //10MB的缓存
                byte [] buffer=new byte[10485760];
                int len=0;
                while((len=is.read(buffer))!=-1){
                    baos.write(buffer, 0, len);
                }
                String jsonString=baos.toString("utf-8");
                baos.close();
                is.close();
                return new JSONObject(jsonString);
            }
        } catch (IOException e) {
            e.printStackTrace();
        }
        return  null;
    }

    public JSONObject getJsonString(String str, int comefrom){
           JSONObject jo = null;
            if(comefrom==1){
                return new JSONObject(str);
            }else if(comefrom==2){
                int indexStart = 0;
                //字符处理
                for(int i=0;i<str.length();i++){
                    if(str.charAt(i)=='('){
                        indexStart = i;
                        break;
                    }
                }
                String strNew = "";
                //分割字符串
                for(int i=indexStart+1;i<str.length()-1;i++){
                    strNew += str.charAt(i);
                }
                return new JSONObject(strNew);
            }
           return jo;
    }
}

getHttpJson函数的后面的参数1,表示返回的是json数据,2表示http接口的数据在一个()中的数据
6.2、数据库实体类和数据库的程序如下
数据库实体类

package com.crawler.example.entity;

import org.json.JSONObject;

public class StockPrice {

    public  String stock_id;

    public String data;

    public String getStock_id() {
        return stock_id;
    }

    public void setStock_id(String stock_id) {
        this.stock_id = stock_id;
    }

    public String getData() {
        return data;
    }

    public void setData(String data) {
        this.data = data;
    }
}

数据库接口如下:

package com.crawler.example.map;

import com.crawler.example.entity.StockPrice;
import org.apache.ibatis.annotations.Insert;
import org.apache.ibatis.annotations.Mapper;

@Mapper
public interface StockPriceMap {

    @Insert("insert into stock(stock_id,data) values(#{stock_id},#{data})")
    public void insertIntoStock(StockPrice stockPrice);

}

最后的运行结果为:
这里写图片描述

7、获取动态页面数据

7.1、抓取需要渲染的网页

有时网页的数据经过渲染才能有,或者网站有饭爬虫措施,那么可以借助浏览器下载网页数据,这样爬虫部署到windows server上比较方便,centos 安装chorme 比较困难,这种数据抓取方式几乎可以为所欲为0.0。
操作chorme的java插件有

cdp4j - Chrome DevTools Protocol for Java
链接地址为:
https://github.com/webfolderio/cdp4j

还是抓取百度股市热点网页数据
抓取程序为:

package com.crawler.example.web;

import com.crawler.example.dirver.BaiDuHotProcess;
import com.crawler.example.entity.BaiduHot;
import com.crawler.example.map.BaiduHotMap;
import io.webfolder.cdp.Launcher;
import io.webfolder.cdp.session.Session;
import io.webfolder.cdp.session.SessionFactory;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.scheduling.annotation.Scheduled;
import org.springframework.web.bind.annotation.RestController;

import java.util.ArrayList;

@RestController
public class BaiduHotDownChorme {

    @Autowired

    BaiduHotMap baiduHotMap;
    @Scheduled(cron = "0/20 * * * * ? ")
    public void downBaiDuHot(){
        ArrayList<String> command = new ArrayList<String>();
        //不显示google 浏览器
        command.add("--headless");
        Launcher launcher = new Launcher();
        try (SessionFactory factory = launcher.launch(command);
             Session session = factory.create()){
            session.navigate("https://gupiao.baidu.com/concept/");
            session.waitDocumentReady();
            String content = (String) session.getContent();
            //System.out.println(content);
            Document doc = Jsoup.parse(content);
            ArrayList<BaiduHot> abh = new BaiDuHotProcess().processBaiduHot(doc);
            for(BaiduHot b:abh){
                baiduHotMap.InsertBaiduHot(b);
            }
        }catch (Exception e){
            e.printStackTrace();
        }

    }
}

上面就是java 抓取网页数据的三种方式,项目源码在github上,源码地址为:
https://github.com/xiaoyangmoa/java-crawler

评论 9
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值