当java遇上爬虫,我的数据库再也不缺数据了

1 篇文章 0 订阅
1 篇文章 0 订阅

前言:
网络爬虫(又称为网页蜘蛛,网络机器人,在FOAF社区中间,更经常的称为网页追逐者),是一种按照一定的规则,自动地抓取万维网信息的程序或者脚本。另外一些不常使用的名字还有蚂蚁、自动索引、模拟程序或者蠕虫。
通俗的讲:爬虫就是网络机器人,可以代替人们自动地在互联网中进行数据信息的采集与整理,大家可以理解为在网络上爬行的一只蜘蛛,互联网就比作一张大网,而爬虫便是在这张网上爬来爬去的蜘蛛,如果它遇到自己的猎物(所需要的资源),那么它就会将其抓取下来。

这篇博客,主要是博主进行爬虫学习时的项目过程的详细记录,主要用到了以下技术:Java,SpringBoot,Mybaties,MySQL,httpClient等等,爬取的网站是汽车之家,链接在下面:汽车之家
这次爬虫,是使用java进行的简易爬虫,主要用于爬取简单网页及对登录没有强制限制的网页。
在这里插入图片描述
以下开始我们的爬虫之旅:

步骤一、对网站数据进行分析,解析出需要用的字段并创建数据库,这里我们需要汽车名称,加速,油耗,编辑,编辑点评,汽车图片等。
这是我们需要提取数据的部分:
在这里插入图片描述
博主创建数据库如图所示:
在这里插入图片描述


步骤二、新建爬虫项目,博主这边用的是springboot,并进行相关配置。
在这里插入图片描述
1.项目结构如上图所示:
common——常用工具层(连接配置和工程参数文件包含在这里),
controller——控制层(本次不需要)
job——工作层(本次爬虫主要工程程序都在这里)
mapper——数据库连接层
pojo——数据对象
service——服务层
image——图片存储包
htmll——网页文件存储(本项目不需要)

2.项目配置:
//项目启动器

package com.example.data;

import org.springframework.boot.SpringApplication;
import org.springframework.boot.autoconfigure.SpringBootApplication;

@SpringBootApplication
public class DataApplication {

    public static void main(String[] args) {
        SpringApplication.run(DataApplication.class, args);
    }

}

pom.xml依赖

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 https://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>
    <parent>
        <groupId>org.springframework.boot</groupId>
        <artifactId>spring-boot-starter-parent</artifactId>
        <version>2.3.5.RELEASE</version>
        <relativePath/> <!-- lookup parent from repository -->
    </parent>
    <groupId>com.example</groupId>
    <artifactId>data</artifactId>
    <version>0.0.1-SNAPSHOT</version>
    <name>data</name>
    <description>Demo project for Spring Boot</description>

    <properties>
        <java.version>1.8</java.version>
    </properties>

    <dependencies>
        <dependency>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-starter-data-jdbc</artifactId>
        </dependency>
        <dependency>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-starter-data-jpa</artifactId>
        </dependency>
        <dependency>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-starter-jdbc</artifactId>
        </dependency>
        <dependency>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-starter-thymeleaf</artifactId>
        </dependency>
        <dependency>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-starter-web</artifactId>
        </dependency>
        <dependency>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-starter-web-services</artifactId>
        </dependency>
        <dependency>
            <groupId>org.mybatis.spring.boot</groupId>
            <artifactId>mybatis-spring-boot-starter</artifactId>
            <version>2.1.3</version>
        </dependency>

        <dependency>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-devtools</artifactId>
            <scope>runtime</scope>
            <optional>true</optional>
        </dependency>
        <dependency>
            <groupId>mysql</groupId>
            <artifactId>mysql-connector-java</artifactId>
            <scope>runtime</scope>
        </dependency>
        <dependency>
            <groupId>net.sf.json-lib</groupId>
            <artifactId>json-lib</artifactId>
            <version>2.2.3</version>
            <classifier>jdk15</classifier>
        </dependency>
        <dependency>
            <groupId>org.apache.commons</groupId>
            <artifactId>commons-lang3</artifactId>
            <version>3.4</version>
        </dependency>
        <dependency>
            <groupId>commons-io</groupId>
            <artifactId>commons-io</artifactId>
            <version>2.4</version>
        </dependency>
        <dependency>
            <groupId>org.quartz-scheduler</groupId>
            <artifactId>quartz</artifactId>
            <version>2.3.0</version>
        </dependency>
        <dependency>
            <groupId>org.springframework</groupId>
            <artifactId>spring-context-support</artifactId>
            <version>5.2.8.RELEASE</version>
        </dependency>
        <dependency>
            <groupId>org.apache.httpcomponents</groupId>
            <artifactId>httpclient</artifactId>
            <version>4.5.12</version>
        </dependency>
        <dependency>
            <groupId>org.jsoup</groupId>
            <artifactId>jsoup</artifactId>
            <version>1.11.3</version>
        </dependency>
        <dependency>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-starter-test</artifactId>
            <version>1.5.9.RELEASE</version>
            <scope>test</scope>
        </dependency>
        <dependency>
            <groupId>junit</groupId>
            <artifactId>junit</artifactId>
            <version>4.13</version>
            <scope>test</scope>
        </dependency>
        <dependency>
            <groupId>org.junit.jupiter</groupId>
            <artifactId>junit-jupiter-api</artifactId>
            <version>5.6.3</version>
            <scope>test</scope>
        </dependency>
        <dependency>
            <groupId>net.sf.json-lib</groupId>
            <artifactId>json-lib</artifactId>
            <version>2.2.3</version>
            <scope>test</scope>
        </dependency>
    </dependencies>

    <build>
        <plugins>
            <plugin>
                <groupId>org.springframework.boot</groupId>
                <artifactId>spring-boot-maven-plugin</artifactId>
            </plugin>
        </plugins>
    </build>

</project>

如果大家按上面的依赖会出错,可能是版本冲突的问题,请自行前往《Maven Repository依赖库》进行依赖查找。
application.properties数据源

spring.datasource.url=jdbc:mysql://127.0.0.1/database?serverTimezone=GMT%2B8   #连接的数据库
spring.datasource.username=root   #数据库用户名
spring.datasource.password=1234   #数据库密码
spring.datasource.driver-class-name=com.mysql.cj.jdbc.Driver   #数据库连接驱动
mybatis.mapper-locations=classpath:/mappers/*Mapper.xml   #mapper层配置文件位置
mybatis.type-aliases-package=com.example.back.mapper   #mapper层扫描器扫描包
mybatis.configuration.call-setters-on-nulls=true

步骤三、编写pojo数据对象
在这里插入图片描述
相关代码:

package com.example.data.pojo;

import org.springframework.stereotype.Component;

import java.util.Date;

@Component
public class Car {
    private Integer id;
    private String title;
    private String editor_name1;
    private String editor_name2;
    private String editor_name3;
    private String editor_remark1;
    private String editor_remark2;
    private String editor_remark3;
    private String img;
    private Double test_speed;
    private Double test_oil;
    private Date created;
    private Date updated;

    public Car() {
    }
    public Car(Integer id, String title, String editor_name1, String editor_name2, String editor_name3, String editor_remark1, String editor_remark2, String editor_remark3, String img, Double test_speed, Double test_oil, Date created, Date updated) {
        this.id = id;
        this.title = title;
        this.editor_name1 = editor_name1;
        this.editor_name2 = editor_name2;
        this.editor_name3 = editor_name3;
        this.editor_remark1 = editor_remark1;
        this.editor_remark2 = editor_remark2;
        this.editor_remark3 = editor_remark3;
        this.img = img;
        this.test_speed = test_speed;
        this.test_oil = test_oil;
        this.created = created;
        this.updated = updated;
    }

    public Integer getId() {
        return id;
    }

    public void setId(Integer id) {
        this.id = id;
    }

    public String getTitle() {
        return title;
    }

    public void setTitle(String title) {
        this.title = title;
    }

    public String getEditor_name1() {
        return editor_name1;
    }

    public void setEditor_name1(String editor_name1) {
        this.editor_name1 = editor_name1;
    }

    public String getEditor_name2() {
        return editor_name2;
    }

    public void setEditor_name2(String editor_name2) {
        this.editor_name2 = editor_name2;
    }

    public String getEditor_name3() {
        return editor_name3;
    }

    public void setEditor_name3(String editor_name3) {
        this.editor_name3 = editor_name3;
    }

    public String getEditor_remark1() {
        return editor_remark1;
    }

    public void setEditor_remark1(String editor_remark1) {
        this.editor_remark1 = editor_remark1;
    }

    public String getEditor_remark2() {
        return editor_remark2;
    }

    public void setEditor_remark2(String editor_remark2) {
        this.editor_remark2 = editor_remark2;
    }

    public String getEditor_remark3() {
        return editor_remark3;
    }

    public void setEditor_remark3(String editor_remark3) {
        this.editor_remark3 = editor_remark3;
    }

    public String getImg() {
        return img;
    }

    public void setImg(String img) {
        this.img = img;
    }

    public Double getTest_speed() {
        return test_speed;
    }

    public void setTest_speed(Double test_speed) {
        this.test_speed = test_speed;
    }

    public Double getTest_oil() {
        return test_oil;
    }

    public void setTest_oil(Double test_oil) {
        this.test_oil = test_oil;
    }

    public Date getCreated() {
        return created;
    }

    public void setCreated(Date created) {
        this.created = created;
    }

    public Date getUpdated() {
        return updated;
    }

    public void setUpdated(Date updated) {
        this.updated = updated;
    }

    @Override
    public String toString() {
        return "Car{" +
                "id=" + id +
                ", title='" + title + '\'' +
                ", editor_name1='" + editor_name1 + '\'' +
                ", editor_name2='" + editor_name2 + '\'' +
                ", editor_name3='" + editor_name3 + '\'' +
                ", editor_remark1='" + editor_remark1 + '\'' +
                ", editor_remark2='" + editor_remark2 + '\'' +
                ", editor_remark3='" + editor_remark3 + '\'' +
                ", img='" + img + '\'' +
                ", test_speed=" + test_speed +
                ", test_oil=" + test_oil +
                ", created=" + created +
                ", updated=" + updated +
                '}';
    }
}

博主建议pojo对象属性名和数据库字段名以及类型保持一致,详细可能出现的问题,请搜索Mybatis进行相关知识的学习,相关学习视频参考:《网易云课堂-MyBatis视频教程》

步骤四、编写数据库增删改查操作,mapper层和service层操作,博主这边只用到了增,查,详细代码如下:
在这里插入图片描述

CarMapper

package com.example.data.mapper;

import com.example.data.pojo.Car;
import org.apache.ibatis.annotations.*;

import java.util.ArrayList;

@Mapper
public interface CarMapper {

    @Insert("insert into car_test (title,editor_name1,editor_name2,editor_name3,editor_remark1,editor_remark2,editor_remark3,img,test_speed,test_oil,created,updated) values (#{title},#{editor_name1},#{editor_name2},#{editor_name3},#{editor_remark1},#{editor_remark2},#{editor_remark3},#{img},#{test_speed},#{test_oil},#{created},#{updated}) ")
    public void insertCar(Car car);

    @Select("select title from car_test  limit #{param1},#{param2}")
    public ArrayList<String> queryAllTitle(int page,int num);
    
    @Select("select * from car_test  where title=#{title}")
    public ArrayList<Car> queryCars(String title);
}

CarService

package com.example.data.service;

import com.example.data.mapper.CarMapper;
import com.example.data.pojo.Car;
import org.springframework.stereotype.Service;

import javax.annotation.Resource;
import java.util.ArrayList;

@Service("carService")
public class CarService {
    @Resource(name="carMapper")
    CarMapper carMapper;
    public CarMapper getCarMapper() {
        return carMapper;
    }
    public void setCarMapper(CarMapper carMapper) {
        this.carMapper = carMapper;
    }

    public void insertCar(Car car){
        carMapper.insertCar(car);
     }

    public ArrayList<String> queryAllTitle(int page,int num){
        //return carMapper.queryAllTitle(page,num);
        return carMapper.queryAllTitle((page-1)*num,num);
    }

    public ArrayList<Car> queryCars(String title){
        return carMapper.queryCars(title);
    }
}


步骤五、配置工具类,就http连接以及工程运行参数进行配置
在这里插入图片描述
1.HttpClientConnection–HttpClient连接管理器,代码如下:

package com.example.data.common;

import org.apache.http.impl.conn.PoolingHttpClientConnectionManager;
import org.springframework.context.annotation.Bean;
import org.springframework.context.annotation.Configuration;

@Configuration
public class HttpClientConnection {

    @Bean//配置连接管理器
    public PoolingHttpClientConnectionManager poolingHttpClientConnectionManager(){
        //创建连接管理器
        PoolingHttpClientConnectionManager cm=new PoolingHttpClientConnectionManager();
        //设置最大连接数
        cm.setMaxTotal(200);
        //设置每个主机的最大连接数
        cm.setDefaultMaxPerRoute(20);

        return cm;
    }
}

2.Scheduler–工程触发器,代码如下:

package com.example.data.common;

import com.example.data.job.CrawlerAutohomeJob;
import org.quartz.CronTrigger;
import org.springframework.beans.factory.annotation.Qualifier;
import org.springframework.context.annotation.Bean;
import org.springframework.context.annotation.Configuration;
import org.springframework.scheduling.quartz.CronTriggerFactoryBean;
import org.springframework.scheduling.quartz.JobDetailFactoryBean;
import org.springframework.scheduling.quartz.SchedulerFactoryBean;

@Configuration
public class Scheduler {

    @Bean("closeConnectJobBean")
    public JobDetailFactoryBean closeConnectJobBean(){
        //创建任务描述工程Bean
        JobDetailFactoryBean jobDetailFactoryBean=new JobDetailFactoryBean();
        //设置spring容器的key,任务中,可以根据这个key获取spring容器
        jobDetailFactoryBean.setApplicationContextJobDataKey("context");
        //设置任务
        jobDetailFactoryBean.setJobClass(CrawlerAutohomeJob.class);
        //设置当前触发器和任务绑定,不会删除任务
        jobDetailFactoryBean.setDurability(true);

        return jobDetailFactoryBean;
    }

    @Bean("closeConnectJobTrigger")//定义关闭无效连接触发器
    //@Qualifier注解通过名字注入bean
    public CronTriggerFactoryBean cronTriggerFactoryBean(
            @Qualifier(value = "closeConnectJobBean")JobDetailFactoryBean itemJobBean){
        //创建表达式触发器工厂Bean
        CronTriggerFactoryBean cronTriggerFactoryBean=new CronTriggerFactoryBean();
        //设置任务描述触发器
        cronTriggerFactoryBean.setJobDetail(itemJobBean.getObject());
        //设置七子表达式
        cronTriggerFactoryBean.setCronExpression("0/5 * * * * ? ");

        return cronTriggerFactoryBean;
    }

    @Bean
    public SchedulerFactoryBean schedulerFactoryBean(CronTrigger[] cronTriggerImpl){
        //创建任务调度器
        SchedulerFactoryBean schedulerFactoryBean=new SchedulerFactoryBean();
        //给任务调度器设置触发器
        schedulerFactoryBean.setTriggers(cronTriggerImpl);

        return  schedulerFactoryBean;
    }
}


步骤六、编写http处理服务层ApiService,用于解析网页html与处理网页图片并进行下载,代码如下:
在这里插入图片描述

package com.example.data.service;

import org.apache.http.client.config.RequestConfig;
import org.apache.http.client.methods.CloseableHttpResponse;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.impl.client.HttpClients;
import org.apache.http.impl.conn.PoolingHttpClientConnectionManager;
import org.apache.http.util.EntityUtils;
import org.springframework.stereotype.Service;

import javax.annotation.Resource;
import java.io.File;
import java.io.FileOutputStream;
import java.io.OutputStream;
import java.util.UUID;

@Service("apiService")
public class ApiService {

    //注入httpClient连接管理器
    @Resource
    private PoolingHttpClientConnectionManager cm;
    public PoolingHttpClientConnectionManager getCm() {
        return cm;
    }
    public void setCm(PoolingHttpClientConnectionManager cm) {
        this.cm = cm;
    }

    public String getHtml(String url) {
        //获取HttpClient对象
        CloseableHttpClient httpClient= HttpClients.custom().setConnectionManager(cm).build();
        //声明httpGet请求对象
        HttpGet httpGet=new HttpGet(url);
        //设置用户代理
        httpGet.setHeader("User-Agent","");
        //设置请求参数RequestConfig
        httpGet.setConfig(this.getConfig());

        CloseableHttpResponse response=null;
        try{
            //使用HttpClient发起请求,返回response
            response=httpClient.execute(httpGet);
            //解析返回数据
            if(response.getStatusLine().getStatusCode()==200){
                String html="";
                //如果response.getEntity()获取结果为空,在进行EntityUtils操作会报错
                //需要对结果非不非空进行判断
                if(response.getEntity()!=null){
                   html= EntityUtils.toString(response.getEntity(),"UTF-8");
                }
                //返回解析好的html字符
                return html;
            }
        }catch (Exception e){
            e.printStackTrace();
        }finally {
            try{
                if(response!=null){
                    //关闭连接
                    response.close();
                }
                //不能关闭,使用的是连接管理器
                /*httpClient.close();*/
            }catch(Exception e){
                e.printStackTrace();
            }
        }
        return null;
    }

    public String getImage(String url){
        //获取HttpClient对象
        CloseableHttpClient httpClient= HttpClients.custom().setConnectionManager(cm).build();
        //声明httpGet请求对象
        HttpGet httpGet=new HttpGet(url);
        //设置用户代理
        httpGet.setHeader("User-Agent","");
        //设置请求参数RequestConfig
        httpGet.setConfig(this.getConfig());

        CloseableHttpResponse response=null;
        try {
            //使用HttpClient发起请求,返回response
            response=httpClient.execute(httpGet);
            //解析返回图片
            if(response.getStatusLine().getStatusCode()==200){
                String contentType =response.getEntity().getContentType().getValue();
                //获取图片类型
                String extName="."+contentType.split("/")[1];
                //随机生成图片名
                String imgName= UUID.randomUUID().toString()+extName;
                //输出文件位置
                OutputStream outputStream=new FileOutputStream(new File("E:/Spring/database/src/main/resources/image/car/"+imgName));
                //使用相应体输出图片
                response.getEntity().writeTo(outputStream);
                //返回图片名
                return imgName;
            }
        }catch (Exception e){
            e.printStackTrace();
        }finally {
            try{
                if(response!=null){
                    //关闭连接
                    response.close();
                }
                //不能关闭,使用的是连接管理器
                /*httpClient.close();*/
            }catch (Exception e){
                e.printStackTrace();
            }
        }
        return null;
    }

    private RequestConfig getConfig(){//获取请求对象参数
        RequestConfig config=RequestConfig.custom().setConnectTimeout(1000)//设置创建连接的超时时间
                .setConnectionRequestTimeout(500)//设置获取连接的超时时间
                .setSocketTimeout(10000)//设置连接的超时时间
                .build();
        return config;
    }
}

其中getImage的UUID.randomUUID().toString()+extName,是随机生成的图片名,样式如图:
在这里插入图片描述
步骤七、编写工程程序,是本此爬虫执行的工程代码,CrawlerAutohomeJob,代码如下:
在这里插入图片描述

package com.example.data.job;

import com.example.data.common.TitleFilter;
import com.example.data.pojo.Car;
import com.example.data.service.ApiService;
import com.example.data.service.CarService;
import org.apache.http.impl.conn.PoolingHttpClientConnectionManager;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import org.quartz.DisallowConcurrentExecution;
import org.quartz.JobExecutionContext;
import org.quartz.JobExecutionException;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.context.ApplicationContext;
import org.springframework.scheduling.quartz.QuartzJobBean;

import javax.annotation.Resource;
import java.util.Date;

@DisallowConcurrentExecution//当前任务没有执行完的情况下,不会在启动新的任务
public class CrawlerAutohomeJob extends QuartzJobBean {

    @Resource(name="apiService")
    ApiService apiService;
    public ApiService getApiService() {
        return apiService;
    }
    public void setApiService(ApiService apiService) {
        this.apiService = apiService;
    }

    @Resource(name="carService")
    CarService carService;
    public CarService getCarService() {
        return carService;
    }
    public void CarService(CarService carService) {
        this.carService = carService;
    }

    @Autowired
    TitleFilter titleFilter;
    @Override
    protected void executeInternal(JobExecutionContext jobExecutionContext) throws JobExecutionException {
        ApplicationContext applicationContext= (ApplicationContext) jobExecutionContext.getJobDetail().getJobDataMap().get("context");
        /*applicationContext.getBean(PoolingHttpClientConnectionManager.class).closeExpiredConnections();*/
        apiService=applicationContext.getBean(ApiService.class);
        carService=applicationContext.getBean(CarService.class);
        titleFilter=applicationContext.getBean(TitleFilter.class);

        for(int i=1;i<=100;i++) {
            testCrawler(i);
        }
    }

    public void testCrawler(int val){
        String html=apiService.getHtml("https://www.autohome.com.cn/bestauto/"+val);
        Document dom= Jsoup.parse(html);
        //获取获取评测位置div
        Elements divs=dom.select("#bestautocontent div .uibox");

        for(Element div : divs){
            //解析页面获取评测对象

            String title=div.select("div.uibox-title").first().text();
            if(!carService.queryCars(title).isEmpty()){
                continue;//如果此次对象已存在,则不重复存储,直接进入下一次循环
            }
            Car car=getCar(div);
            //解析页面获取评测图片
            String img =getCarImage(div);
            car.setImg(img);
            Date date=new Date();
            /*SimpleDateFormat simpleDateFormat=new SimpleDateFormat("yyyy-MM-dd hh:mm:ss");
            String time=simpleDateFormat.format(date);*/
            car.setCreated(date);
            car.setUpdated(date);
            carService.insertCar(car);
        }
    }

    //根据传递过来的div对象,解析汽车对象
    private Car getCar(Element div){
        //创建评测对象
        Car car=new Car();
        //设置评测对象;
        String title=div.select("div.uibox-title").first().text();
        Double test_speed= Double.parseDouble(div.select(".tabbox1 dd:nth-child(2)  div.dd-div2").first().text().substring(0,div.select(".tabbox1 dd:nth-child(2)  div.dd-div2").first().text().length()-1));
        Double test_oil= Double.parseDouble(div.select(".tabbox1 dd:nth-child(3)  div.dd-div2").first().text().substring(0,div.select(".tabbox1 dd:nth-child(2)  div.dd-div2").first().text().length()-1));
        String editor_name1=div.select(".tabbox2   dd:nth-child(2) div.dd-div1").first().text();
        String editor_name2=div.select(".tabbox2   dd:nth-child(3) div.dd-div1").first().text();
        String editor_name3=div.select(".tabbox2   dd:nth-child(4) div.dd-div1").first().text();
        String editor_remark1=div.select(".tabbox2 dd:nth-child(2)  div.dd-div3-pp").first().text();
        String editor_remark2=div.select(".tabbox2 dd:nth-child(3)  div.dd-div3-pp").first().text();
        String editor_remark3=div.select(".tabbox2 dd:nth-child(4)  div.dd-div3-pp").first().text();

        car.setEditor_name1(editor_name1);
        car.setEditor_name2(editor_name2);
        car.setEditor_name3(editor_name3);
        car.setEditor_remark1(editor_remark1);
        car.setEditor_remark2(editor_remark2);
        car.setEditor_remark3(editor_remark3);
        car.setTest_speed(test_speed);
        car.setTest_oil(test_oil);
        car.setTitle(title);
        return car;
    }
    //根据传递过来的div对象,解析图片
    private String getCarImage(Element div){
        String image="";
        Elements page=div.select("ul.piclist02 li");
        for(Element i :page){
            String imgUrl="https:"+i.select("img").attr("src");
            String imgName=apiService.getImage(imgUrl);
            image+=imgName+"@";
        }
        if(image!="") {
            image = image.substring(0, image.length() - 1);
        }
        return image;
    }
}

这部分博主觉得最困难的是对dom分析,获取数据这块,
在这里插入图片描述
博主当时也是各种尝试,各种摸索调试才成功的,以下是需要分析的代码部分,也是一台车所需要的数据部分,感觉是按照需要提取dom模块,从外层开始一步一步渗透到需要提取数据的dom元素:
在这里插入图片描述
以上是正式运行的程序,如果大家想要调试发现其中的bug,可以在运行之前写一个测试类进行测试找bug,代码如下:
在这里插入图片描述

package com.example.data.service;

import com.example.data.DataApplication;
import com.example.data.common.MyFilter;
import com.example.data.common.TitleFilter;
import com.example.data.pojo.Car;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import org.junit.jupiter.api.Test;
import org.junit.runner.RunWith;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.boot.test.context.SpringBootTest;
import org.springframework.test.context.junit4.SpringJUnit4ClassRunner;

import javax.annotation.Resource;
import java.util.Date;

@RunWith(SpringJUnit4ClassRunner.class)
@SpringBootTest(classes= DataApplication.class)
public class AutoCrawlerTest {

    @Resource(name="apiService")
    ApiService apiService;
    public ApiService getApiService() {
        return apiService;
    }
    public void setApiService(ApiService apiService) {
        this.apiService = apiService;
    }

    @Resource(name="carService")
    CarService carService;
    public CarService getCarService() {
        return carService;
    }
    public void CarService(CarService carService) {
        this.carService = carService;
    }

    @Autowired
    TitleFilter titleFilter;

    @Test
    public void testCrawler(){
        String html=apiService.getHtml("https://www.autohome.com.cn/bestauto/1");
        Document dom= Jsoup.parse(html);
        //获取获取评测位置div
        Elements divs=dom.select("#bestautocontent div .uibox");

        for(Element div : divs){
            //解析页面获取评测对象

            String title=div.select("div.uibox-title").first().text();
            if(!carService.queryCars(title).isEmpty()){
                continue;
            }
            Car car=getCar(div);
            //解析页面获取评测图片
            String img =getCarImage(div);
            car.setImg(img);
            Date date=new Date();
            /*SimpleDateFormat simpleDateFormat=new SimpleDateFormat("yyyy-MM-dd hh:mm:ss");
            String time=simpleDateFormat.format(date);*/
            car.setCreated(date);
            car.setUpdated(date);
            carService.insertCar(car);
        }
    }

    //根据传递过来的div对象,解析汽车对象
    private Car getCar(Element div){
        //创建评测对象
        Car car=new Car();
        //设置评测对象;
       String title=div.select("div.uibox-title").first().text();
        Double test_speed= Double.parseDouble(div.select(".tabbox1 dd:nth-child(2)  div.dd-div2").first().text().substring(0,div.select(".tabbox1 dd:nth-child(2)  div.dd-div2").first().text().length()-1));
        Double test_oil= Double.parseDouble(div.select(".tabbox1 dd:nth-child(3)  div.dd-div2").first().text().substring(0,div.select(".tabbox1 dd:nth-child(2)  div.dd-div2").first().text().length()-1));
        String editor_name1=div.select(".tabbox2   dd:nth-child(2) div.dd-div1").first().text();
        String editor_name2=div.select(".tabbox2   dd:nth-child(3) div.dd-div1").first().text();
        String editor_name3=div.select(".tabbox2   dd:nth-child(4) div.dd-div1").first().text();
        String editor_remark1=div.select(".tabbox2 dd:nth-child(2)  div.dd-div3-pp").first().text();
        String editor_remark2=div.select(".tabbox2 dd:nth-child(3)  div.dd-div3-pp").first().text();
        String editor_remark3=div.select(".tabbox2 dd:nth-child(4)  div.dd-div3-pp").first().text();

        car.setEditor_name1(editor_name1);
        car.setEditor_name2(editor_name2);
        car.setEditor_name3(editor_name3);
        car.setEditor_remark1(editor_remark1);
        car.setEditor_remark2(editor_remark2);
        car.setEditor_remark3(editor_remark3);
        car.setTest_speed(test_speed);
        car.setTest_oil(test_oil);
        car.setTitle(title);
        return car;
    }
    //根据传递过来的div对象,解析图片
    private String getCarImage(Element div){
        String image="";
        Elements page=div.select("ul.piclist02 li");
        for(Element i :page){
            String imgUrl="https:"+i.select("img").attr("src");
            String imgName=apiService.getImage(imgUrl);
            image+=imgName+"@";
        }
        if(image!="") {
            image = image.substring(0, image.length() - 1);
        }
        return image;
    }
}

步骤八、运行工程进行工程测试
以上所用的工程代码都已经写完了,现在时候需要运行项目,也就是运行项目启动器DataApplication(这是博主这边的名字),你们运行你们自己写的即可。
在这里插入图片描述
运行成功,爬虫成功效果:
数据库爬取的数据:
在这里插入图片描述
在这里插入图片描述
下载的图片:
在这里插入图片描述
注:博主在运行时,会因为网页元素不存在我们需要提取的dom,而报错这是正常情况,大家不用担心。

这篇博客,主要是博主进行爬虫学习时的项目过程的详细记录,爬取的网站是汽车之家,链接在下面:汽车之家
是使用java进行的简易爬虫,主要用于爬取简单网页及对登录没有强制限制的网页。


爬虫学习时的参考学习视频:《腾讯课堂-java爬虫项目|抓取汽车之家百万数据》
博主爬虫项目的详细源代码:《CSDN-当java遇上爬虫,我的数据库再也不缺数据了项目详细源代码》
如果大家觉得这篇博客并你有帮助,可以参考这篇博客和源代码结合上面提到的学习视频学习,博主的代码相较于来视频代码做了一些修改,只为大家提供相关的参考。




感谢大家的支持,如果大家在学习过程中有什么问题,可以和博主一起交流,让我们一起加油进步吧,嘻嘻~~


——————结束——————

  • 1
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值