前言:
网络爬虫(又称为网页蜘蛛,网络机器人,在FOAF社区中间,更经常的称为网页追逐者),是一种按照一定的规则,自动地抓取万维网信息的程序或者脚本。另外一些不常使用的名字还有蚂蚁、自动索引、模拟程序或者蠕虫。
通俗的讲:爬虫就是网络机器人,可以代替人们自动地在互联网中进行数据信息的采集与整理,大家可以理解为在网络上爬行的一只蜘蛛,互联网就比作一张大网,而爬虫便是在这张网上爬来爬去的蜘蛛,如果它遇到自己的猎物(所需要的资源),那么它就会将其抓取下来。
这篇博客,主要是博主进行爬虫学习时的项目过程的详细记录,主要用到了以下技术:Java,SpringBoot,Mybaties,MySQL,httpClient等等,爬取的网站是汽车之家,链接在下面:汽车之家,
这次爬虫,是使用java进行的简易爬虫,主要用于爬取简单网页及对登录没有强制限制的网页。
以下开始我们的爬虫之旅:
步骤一、对网站数据进行分析,解析出需要用的字段并创建数据库,这里我们需要汽车名称,加速,油耗,编辑,编辑点评,汽车图片等。
这是我们需要提取数据的部分:
博主创建数据库如图所示:
步骤二、新建爬虫项目,博主这边用的是springboot,并进行相关配置。
1.项目结构如上图所示:
common——常用工具层(连接配置和工程参数文件包含在这里),
controller——控制层(本次不需要)
job——工作层(本次爬虫主要工程程序都在这里)
mapper——数据库连接层
pojo——数据对象
service——服务层
image——图片存储包
htmll——网页文件存储(本项目不需要)
2.项目配置:
//项目启动器
package com.example.data;
import org.springframework.boot.SpringApplication;
import org.springframework.boot.autoconfigure.SpringBootApplication;
@SpringBootApplication
public class DataApplication {
public static void main(String[] args) {
SpringApplication.run(DataApplication.class, args);
}
}
pom.xml依赖
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 https://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<parent>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-parent</artifactId>
<version>2.3.5.RELEASE</version>
<relativePath/> <!-- lookup parent from repository -->
</parent>
<groupId>com.example</groupId>
<artifactId>data</artifactId>
<version>0.0.1-SNAPSHOT</version>
<name>data</name>
<description>Demo project for Spring Boot</description>
<properties>
<java.version>1.8</java.version>
</properties>
<dependencies>
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-data-jdbc</artifactId>
</dependency>
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-data-jpa</artifactId>
</dependency>
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-jdbc</artifactId>
</dependency>
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-thymeleaf</artifactId>
</dependency>
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-web</artifactId>
</dependency>
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-web-services</artifactId>
</dependency>
<dependency>
<groupId>org.mybatis.spring.boot</groupId>
<artifactId>mybatis-spring-boot-starter</artifactId>
<version>2.1.3</version>
</dependency>
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-devtools</artifactId>
<scope>runtime</scope>
<optional>true</optional>
</dependency>
<dependency>
<groupId>mysql</groupId>
<artifactId>mysql-connector-java</artifactId>
<scope>runtime</scope>
</dependency>
<dependency>
<groupId>net.sf.json-lib</groupId>
<artifactId>json-lib</artifactId>
<version>2.2.3</version>
<classifier>jdk15</classifier>
</dependency>
<dependency>
<groupId>org.apache.commons</groupId>
<artifactId>commons-lang3</artifactId>
<version>3.4</version>
</dependency>
<dependency>
<groupId>commons-io</groupId>
<artifactId>commons-io</artifactId>
<version>2.4</version>
</dependency>
<dependency>
<groupId>org.quartz-scheduler</groupId>
<artifactId>quartz</artifactId>
<version>2.3.0</version>
</dependency>
<dependency>
<groupId>org.springframework</groupId>
<artifactId>spring-context-support</artifactId>
<version>5.2.8.RELEASE</version>
</dependency>
<dependency>
<groupId>org.apache.httpcomponents</groupId>
<artifactId>httpclient</artifactId>
<version>4.5.12</version>
</dependency>
<dependency>
<groupId>org.jsoup</groupId>
<artifactId>jsoup</artifactId>
<version>1.11.3</version>
</dependency>
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-test</artifactId>
<version>1.5.9.RELEASE</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>4.13</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.junit.jupiter</groupId>
<artifactId>junit-jupiter-api</artifactId>
<version>5.6.3</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>net.sf.json-lib</groupId>
<artifactId>json-lib</artifactId>
<version>2.2.3</version>
<scope>test</scope>
</dependency>
</dependencies>
<build>
<plugins>
<plugin>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-maven-plugin</artifactId>
</plugin>
</plugins>
</build>
</project>
如果大家按上面的依赖会出错,可能是版本冲突的问题,请自行前往《Maven Repository依赖库》进行依赖查找。
application.properties数据源
spring.datasource.url=jdbc:mysql://127.0.0.1/database?serverTimezone=GMT%2B8 #连接的数据库
spring.datasource.username=root #数据库用户名
spring.datasource.password=1234 #数据库密码
spring.datasource.driver-class-name=com.mysql.cj.jdbc.Driver #数据库连接驱动
mybatis.mapper-locations=classpath:/mappers/*Mapper.xml #mapper层配置文件位置
mybatis.type-aliases-package=com.example.back.mapper #mapper层扫描器扫描包
mybatis.configuration.call-setters-on-nulls=true
步骤三、编写pojo数据对象
相关代码:
package com.example.data.pojo;
import org.springframework.stereotype.Component;
import java.util.Date;
@Component
public class Car {
private Integer id;
private String title;
private String editor_name1;
private String editor_name2;
private String editor_name3;
private String editor_remark1;
private String editor_remark2;
private String editor_remark3;
private String img;
private Double test_speed;
private Double test_oil;
private Date created;
private Date updated;
public Car() {
}
public Car(Integer id, String title, String editor_name1, String editor_name2, String editor_name3, String editor_remark1, String editor_remark2, String editor_remark3, String img, Double test_speed, Double test_oil, Date created, Date updated) {
this.id = id;
this.title = title;
this.editor_name1 = editor_name1;
this.editor_name2 = editor_name2;
this.editor_name3 = editor_name3;
this.editor_remark1 = editor_remark1;
this.editor_remark2 = editor_remark2;
this.editor_remark3 = editor_remark3;
this.img = img;
this.test_speed = test_speed;
this.test_oil = test_oil;
this.created = created;
this.updated = updated;
}
public Integer getId() {
return id;
}
public void setId(Integer id) {
this.id = id;
}
public String getTitle() {
return title;
}
public void setTitle(String title) {
this.title = title;
}
public String getEditor_name1() {
return editor_name1;
}
public void setEditor_name1(String editor_name1) {
this.editor_name1 = editor_name1;
}
public String getEditor_name2() {
return editor_name2;
}
public void setEditor_name2(String editor_name2) {
this.editor_name2 = editor_name2;
}
public String getEditor_name3() {
return editor_name3;
}
public void setEditor_name3(String editor_name3) {
this.editor_name3 = editor_name3;
}
public String getEditor_remark1() {
return editor_remark1;
}
public void setEditor_remark1(String editor_remark1) {
this.editor_remark1 = editor_remark1;
}
public String getEditor_remark2() {
return editor_remark2;
}
public void setEditor_remark2(String editor_remark2) {
this.editor_remark2 = editor_remark2;
}
public String getEditor_remark3() {
return editor_remark3;
}
public void setEditor_remark3(String editor_remark3) {
this.editor_remark3 = editor_remark3;
}
public String getImg() {
return img;
}
public void setImg(String img) {
this.img = img;
}
public Double getTest_speed() {
return test_speed;
}
public void setTest_speed(Double test_speed) {
this.test_speed = test_speed;
}
public Double getTest_oil() {
return test_oil;
}
public void setTest_oil(Double test_oil) {
this.test_oil = test_oil;
}
public Date getCreated() {
return created;
}
public void setCreated(Date created) {
this.created = created;
}
public Date getUpdated() {
return updated;
}
public void setUpdated(Date updated) {
this.updated = updated;
}
@Override
public String toString() {
return "Car{" +
"id=" + id +
", title='" + title + '\'' +
", editor_name1='" + editor_name1 + '\'' +
", editor_name2='" + editor_name2 + '\'' +
", editor_name3='" + editor_name3 + '\'' +
", editor_remark1='" + editor_remark1 + '\'' +
", editor_remark2='" + editor_remark2 + '\'' +
", editor_remark3='" + editor_remark3 + '\'' +
", img='" + img + '\'' +
", test_speed=" + test_speed +
", test_oil=" + test_oil +
", created=" + created +
", updated=" + updated +
'}';
}
}
博主建议pojo对象属性名和数据库字段名以及类型保持一致,详细可能出现的问题,请搜索Mybatis进行相关知识的学习,相关学习视频参考:《网易云课堂-MyBatis视频教程》。
步骤四、编写数据库增删改查操作,mapper层和service层操作,博主这边只用到了增,查,详细代码如下:
CarMapper
package com.example.data.mapper;
import com.example.data.pojo.Car;
import org.apache.ibatis.annotations.*;
import java.util.ArrayList;
@Mapper
public interface CarMapper {
@Insert("insert into car_test (title,editor_name1,editor_name2,editor_name3,editor_remark1,editor_remark2,editor_remark3,img,test_speed,test_oil,created,updated) values (#{title},#{editor_name1},#{editor_name2},#{editor_name3},#{editor_remark1},#{editor_remark2},#{editor_remark3},#{img},#{test_speed},#{test_oil},#{created},#{updated}) ")
public void insertCar(Car car);
@Select("select title from car_test limit #{param1},#{param2}")
public ArrayList<String> queryAllTitle(int page,int num);
@Select("select * from car_test where title=#{title}")
public ArrayList<Car> queryCars(String title);
}
CarService
package com.example.data.service;
import com.example.data.mapper.CarMapper;
import com.example.data.pojo.Car;
import org.springframework.stereotype.Service;
import javax.annotation.Resource;
import java.util.ArrayList;
@Service("carService")
public class CarService {
@Resource(name="carMapper")
CarMapper carMapper;
public CarMapper getCarMapper() {
return carMapper;
}
public void setCarMapper(CarMapper carMapper) {
this.carMapper = carMapper;
}
public void insertCar(Car car){
carMapper.insertCar(car);
}
public ArrayList<String> queryAllTitle(int page,int num){
//return carMapper.queryAllTitle(page,num);
return carMapper.queryAllTitle((page-1)*num,num);
}
public ArrayList<Car> queryCars(String title){
return carMapper.queryCars(title);
}
}
步骤五、配置工具类,就http连接以及工程运行参数进行配置
1.HttpClientConnection–HttpClient连接管理器,代码如下:
package com.example.data.common;
import org.apache.http.impl.conn.PoolingHttpClientConnectionManager;
import org.springframework.context.annotation.Bean;
import org.springframework.context.annotation.Configuration;
@Configuration
public class HttpClientConnection {
@Bean//配置连接管理器
public PoolingHttpClientConnectionManager poolingHttpClientConnectionManager(){
//创建连接管理器
PoolingHttpClientConnectionManager cm=new PoolingHttpClientConnectionManager();
//设置最大连接数
cm.setMaxTotal(200);
//设置每个主机的最大连接数
cm.setDefaultMaxPerRoute(20);
return cm;
}
}
2.Scheduler–工程触发器,代码如下:
package com.example.data.common;
import com.example.data.job.CrawlerAutohomeJob;
import org.quartz.CronTrigger;
import org.springframework.beans.factory.annotation.Qualifier;
import org.springframework.context.annotation.Bean;
import org.springframework.context.annotation.Configuration;
import org.springframework.scheduling.quartz.CronTriggerFactoryBean;
import org.springframework.scheduling.quartz.JobDetailFactoryBean;
import org.springframework.scheduling.quartz.SchedulerFactoryBean;
@Configuration
public class Scheduler {
@Bean("closeConnectJobBean")
public JobDetailFactoryBean closeConnectJobBean(){
//创建任务描述工程Bean
JobDetailFactoryBean jobDetailFactoryBean=new JobDetailFactoryBean();
//设置spring容器的key,任务中,可以根据这个key获取spring容器
jobDetailFactoryBean.setApplicationContextJobDataKey("context");
//设置任务
jobDetailFactoryBean.setJobClass(CrawlerAutohomeJob.class);
//设置当前触发器和任务绑定,不会删除任务
jobDetailFactoryBean.setDurability(true);
return jobDetailFactoryBean;
}
@Bean("closeConnectJobTrigger")//定义关闭无效连接触发器
//@Qualifier注解通过名字注入bean
public CronTriggerFactoryBean cronTriggerFactoryBean(
@Qualifier(value = "closeConnectJobBean")JobDetailFactoryBean itemJobBean){
//创建表达式触发器工厂Bean
CronTriggerFactoryBean cronTriggerFactoryBean=new CronTriggerFactoryBean();
//设置任务描述触发器
cronTriggerFactoryBean.setJobDetail(itemJobBean.getObject());
//设置七子表达式
cronTriggerFactoryBean.setCronExpression("0/5 * * * * ? ");
return cronTriggerFactoryBean;
}
@Bean
public SchedulerFactoryBean schedulerFactoryBean(CronTrigger[] cronTriggerImpl){
//创建任务调度器
SchedulerFactoryBean schedulerFactoryBean=new SchedulerFactoryBean();
//给任务调度器设置触发器
schedulerFactoryBean.setTriggers(cronTriggerImpl);
return schedulerFactoryBean;
}
}
步骤六、编写http处理服务层ApiService,用于解析网页html与处理网页图片并进行下载,代码如下:
package com.example.data.service;
import org.apache.http.client.config.RequestConfig;
import org.apache.http.client.methods.CloseableHttpResponse;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.impl.client.HttpClients;
import org.apache.http.impl.conn.PoolingHttpClientConnectionManager;
import org.apache.http.util.EntityUtils;
import org.springframework.stereotype.Service;
import javax.annotation.Resource;
import java.io.File;
import java.io.FileOutputStream;
import java.io.OutputStream;
import java.util.UUID;
@Service("apiService")
public class ApiService {
//注入httpClient连接管理器
@Resource
private PoolingHttpClientConnectionManager cm;
public PoolingHttpClientConnectionManager getCm() {
return cm;
}
public void setCm(PoolingHttpClientConnectionManager cm) {
this.cm = cm;
}
public String getHtml(String url) {
//获取HttpClient对象
CloseableHttpClient httpClient= HttpClients.custom().setConnectionManager(cm).build();
//声明httpGet请求对象
HttpGet httpGet=new HttpGet(url);
//设置用户代理
httpGet.setHeader("User-Agent","");
//设置请求参数RequestConfig
httpGet.setConfig(this.getConfig());
CloseableHttpResponse response=null;
try{
//使用HttpClient发起请求,返回response
response=httpClient.execute(httpGet);
//解析返回数据
if(response.getStatusLine().getStatusCode()==200){
String html="";
//如果response.getEntity()获取结果为空,在进行EntityUtils操作会报错
//需要对结果非不非空进行判断
if(response.getEntity()!=null){
html= EntityUtils.toString(response.getEntity(),"UTF-8");
}
//返回解析好的html字符
return html;
}
}catch (Exception e){
e.printStackTrace();
}finally {
try{
if(response!=null){
//关闭连接
response.close();
}
//不能关闭,使用的是连接管理器
/*httpClient.close();*/
}catch(Exception e){
e.printStackTrace();
}
}
return null;
}
public String getImage(String url){
//获取HttpClient对象
CloseableHttpClient httpClient= HttpClients.custom().setConnectionManager(cm).build();
//声明httpGet请求对象
HttpGet httpGet=new HttpGet(url);
//设置用户代理
httpGet.setHeader("User-Agent","");
//设置请求参数RequestConfig
httpGet.setConfig(this.getConfig());
CloseableHttpResponse response=null;
try {
//使用HttpClient发起请求,返回response
response=httpClient.execute(httpGet);
//解析返回图片
if(response.getStatusLine().getStatusCode()==200){
String contentType =response.getEntity().getContentType().getValue();
//获取图片类型
String extName="."+contentType.split("/")[1];
//随机生成图片名
String imgName= UUID.randomUUID().toString()+extName;
//输出文件位置
OutputStream outputStream=new FileOutputStream(new File("E:/Spring/database/src/main/resources/image/car/"+imgName));
//使用相应体输出图片
response.getEntity().writeTo(outputStream);
//返回图片名
return imgName;
}
}catch (Exception e){
e.printStackTrace();
}finally {
try{
if(response!=null){
//关闭连接
response.close();
}
//不能关闭,使用的是连接管理器
/*httpClient.close();*/
}catch (Exception e){
e.printStackTrace();
}
}
return null;
}
private RequestConfig getConfig(){//获取请求对象参数
RequestConfig config=RequestConfig.custom().setConnectTimeout(1000)//设置创建连接的超时时间
.setConnectionRequestTimeout(500)//设置获取连接的超时时间
.setSocketTimeout(10000)//设置连接的超时时间
.build();
return config;
}
}
其中getImage的UUID.randomUUID().toString()+extName,是随机生成的图片名,样式如图:
步骤七、编写工程程序,是本此爬虫执行的工程代码,CrawlerAutohomeJob,代码如下:
package com.example.data.job;
import com.example.data.common.TitleFilter;
import com.example.data.pojo.Car;
import com.example.data.service.ApiService;
import com.example.data.service.CarService;
import org.apache.http.impl.conn.PoolingHttpClientConnectionManager;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import org.quartz.DisallowConcurrentExecution;
import org.quartz.JobExecutionContext;
import org.quartz.JobExecutionException;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.context.ApplicationContext;
import org.springframework.scheduling.quartz.QuartzJobBean;
import javax.annotation.Resource;
import java.util.Date;
@DisallowConcurrentExecution//当前任务没有执行完的情况下,不会在启动新的任务
public class CrawlerAutohomeJob extends QuartzJobBean {
@Resource(name="apiService")
ApiService apiService;
public ApiService getApiService() {
return apiService;
}
public void setApiService(ApiService apiService) {
this.apiService = apiService;
}
@Resource(name="carService")
CarService carService;
public CarService getCarService() {
return carService;
}
public void CarService(CarService carService) {
this.carService = carService;
}
@Autowired
TitleFilter titleFilter;
@Override
protected void executeInternal(JobExecutionContext jobExecutionContext) throws JobExecutionException {
ApplicationContext applicationContext= (ApplicationContext) jobExecutionContext.getJobDetail().getJobDataMap().get("context");
/*applicationContext.getBean(PoolingHttpClientConnectionManager.class).closeExpiredConnections();*/
apiService=applicationContext.getBean(ApiService.class);
carService=applicationContext.getBean(CarService.class);
titleFilter=applicationContext.getBean(TitleFilter.class);
for(int i=1;i<=100;i++) {
testCrawler(i);
}
}
public void testCrawler(int val){
String html=apiService.getHtml("https://www.autohome.com.cn/bestauto/"+val);
Document dom= Jsoup.parse(html);
//获取获取评测位置div
Elements divs=dom.select("#bestautocontent div .uibox");
for(Element div : divs){
//解析页面获取评测对象
String title=div.select("div.uibox-title").first().text();
if(!carService.queryCars(title).isEmpty()){
continue;//如果此次对象已存在,则不重复存储,直接进入下一次循环
}
Car car=getCar(div);
//解析页面获取评测图片
String img =getCarImage(div);
car.setImg(img);
Date date=new Date();
/*SimpleDateFormat simpleDateFormat=new SimpleDateFormat("yyyy-MM-dd hh:mm:ss");
String time=simpleDateFormat.format(date);*/
car.setCreated(date);
car.setUpdated(date);
carService.insertCar(car);
}
}
//根据传递过来的div对象,解析汽车对象
private Car getCar(Element div){
//创建评测对象
Car car=new Car();
//设置评测对象;
String title=div.select("div.uibox-title").first().text();
Double test_speed= Double.parseDouble(div.select(".tabbox1 dd:nth-child(2) div.dd-div2").first().text().substring(0,div.select(".tabbox1 dd:nth-child(2) div.dd-div2").first().text().length()-1));
Double test_oil= Double.parseDouble(div.select(".tabbox1 dd:nth-child(3) div.dd-div2").first().text().substring(0,div.select(".tabbox1 dd:nth-child(2) div.dd-div2").first().text().length()-1));
String editor_name1=div.select(".tabbox2 dd:nth-child(2) div.dd-div1").first().text();
String editor_name2=div.select(".tabbox2 dd:nth-child(3) div.dd-div1").first().text();
String editor_name3=div.select(".tabbox2 dd:nth-child(4) div.dd-div1").first().text();
String editor_remark1=div.select(".tabbox2 dd:nth-child(2) div.dd-div3-pp").first().text();
String editor_remark2=div.select(".tabbox2 dd:nth-child(3) div.dd-div3-pp").first().text();
String editor_remark3=div.select(".tabbox2 dd:nth-child(4) div.dd-div3-pp").first().text();
car.setEditor_name1(editor_name1);
car.setEditor_name2(editor_name2);
car.setEditor_name3(editor_name3);
car.setEditor_remark1(editor_remark1);
car.setEditor_remark2(editor_remark2);
car.setEditor_remark3(editor_remark3);
car.setTest_speed(test_speed);
car.setTest_oil(test_oil);
car.setTitle(title);
return car;
}
//根据传递过来的div对象,解析图片
private String getCarImage(Element div){
String image="";
Elements page=div.select("ul.piclist02 li");
for(Element i :page){
String imgUrl="https:"+i.select("img").attr("src");
String imgName=apiService.getImage(imgUrl);
image+=imgName+"@";
}
if(image!="") {
image = image.substring(0, image.length() - 1);
}
return image;
}
}
这部分博主觉得最困难的是对dom分析,获取数据这块,
博主当时也是各种尝试,各种摸索调试才成功的,以下是需要分析的代码部分,也是一台车所需要的数据部分,感觉是按照需要提取dom模块,从外层开始一步一步渗透到需要提取数据的dom元素:
以上是正式运行的程序,如果大家想要调试发现其中的bug,可以在运行之前写一个测试类进行测试找bug,代码如下:
package com.example.data.service;
import com.example.data.DataApplication;
import com.example.data.common.MyFilter;
import com.example.data.common.TitleFilter;
import com.example.data.pojo.Car;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import org.junit.jupiter.api.Test;
import org.junit.runner.RunWith;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.boot.test.context.SpringBootTest;
import org.springframework.test.context.junit4.SpringJUnit4ClassRunner;
import javax.annotation.Resource;
import java.util.Date;
@RunWith(SpringJUnit4ClassRunner.class)
@SpringBootTest(classes= DataApplication.class)
public class AutoCrawlerTest {
@Resource(name="apiService")
ApiService apiService;
public ApiService getApiService() {
return apiService;
}
public void setApiService(ApiService apiService) {
this.apiService = apiService;
}
@Resource(name="carService")
CarService carService;
public CarService getCarService() {
return carService;
}
public void CarService(CarService carService) {
this.carService = carService;
}
@Autowired
TitleFilter titleFilter;
@Test
public void testCrawler(){
String html=apiService.getHtml("https://www.autohome.com.cn/bestauto/1");
Document dom= Jsoup.parse(html);
//获取获取评测位置div
Elements divs=dom.select("#bestautocontent div .uibox");
for(Element div : divs){
//解析页面获取评测对象
String title=div.select("div.uibox-title").first().text();
if(!carService.queryCars(title).isEmpty()){
continue;
}
Car car=getCar(div);
//解析页面获取评测图片
String img =getCarImage(div);
car.setImg(img);
Date date=new Date();
/*SimpleDateFormat simpleDateFormat=new SimpleDateFormat("yyyy-MM-dd hh:mm:ss");
String time=simpleDateFormat.format(date);*/
car.setCreated(date);
car.setUpdated(date);
carService.insertCar(car);
}
}
//根据传递过来的div对象,解析汽车对象
private Car getCar(Element div){
//创建评测对象
Car car=new Car();
//设置评测对象;
String title=div.select("div.uibox-title").first().text();
Double test_speed= Double.parseDouble(div.select(".tabbox1 dd:nth-child(2) div.dd-div2").first().text().substring(0,div.select(".tabbox1 dd:nth-child(2) div.dd-div2").first().text().length()-1));
Double test_oil= Double.parseDouble(div.select(".tabbox1 dd:nth-child(3) div.dd-div2").first().text().substring(0,div.select(".tabbox1 dd:nth-child(2) div.dd-div2").first().text().length()-1));
String editor_name1=div.select(".tabbox2 dd:nth-child(2) div.dd-div1").first().text();
String editor_name2=div.select(".tabbox2 dd:nth-child(3) div.dd-div1").first().text();
String editor_name3=div.select(".tabbox2 dd:nth-child(4) div.dd-div1").first().text();
String editor_remark1=div.select(".tabbox2 dd:nth-child(2) div.dd-div3-pp").first().text();
String editor_remark2=div.select(".tabbox2 dd:nth-child(3) div.dd-div3-pp").first().text();
String editor_remark3=div.select(".tabbox2 dd:nth-child(4) div.dd-div3-pp").first().text();
car.setEditor_name1(editor_name1);
car.setEditor_name2(editor_name2);
car.setEditor_name3(editor_name3);
car.setEditor_remark1(editor_remark1);
car.setEditor_remark2(editor_remark2);
car.setEditor_remark3(editor_remark3);
car.setTest_speed(test_speed);
car.setTest_oil(test_oil);
car.setTitle(title);
return car;
}
//根据传递过来的div对象,解析图片
private String getCarImage(Element div){
String image="";
Elements page=div.select("ul.piclist02 li");
for(Element i :page){
String imgUrl="https:"+i.select("img").attr("src");
String imgName=apiService.getImage(imgUrl);
image+=imgName+"@";
}
if(image!="") {
image = image.substring(0, image.length() - 1);
}
return image;
}
}
步骤八、运行工程进行工程测试
以上所用的工程代码都已经写完了,现在时候需要运行项目,也就是运行项目启动器DataApplication(这是博主这边的名字),你们运行你们自己写的即可。
运行成功,爬虫成功效果:
数据库爬取的数据:
下载的图片:
注:博主在运行时,会因为网页元素不存在我们需要提取的dom,而报错这是正常情况,大家不用担心。
这篇博客,主要是博主进行爬虫学习时的项目过程的详细记录,爬取的网站是汽车之家,链接在下面:汽车之家,
是使用java进行的简易爬虫,主要用于爬取简单网页及对登录没有强制限制的网页。
爬虫学习时的参考学习视频:《腾讯课堂-java爬虫项目|抓取汽车之家百万数据》
博主爬虫项目的详细源代码:《CSDN-当java遇上爬虫,我的数据库再也不缺数据了项目详细源代码》
如果大家觉得这篇博客并你有帮助,可以参考这篇博客和源代码结合上面提到的学习视频学习,博主的代码相较于来视频代码做了一些修改,只为大家提供相关的参考。
感谢大家的支持,如果大家在学习过程中有什么问题,可以和博主一起交流,让我们一起加油进步吧,嘻嘻~~
——————结束——————