阅读文本大概需要3分钟。
xxl-crawler是 许雪里 大佬开源的一个java爬虫,熟悉java语言的用起来可以非常顺手。
代码仓库:
https://github.com/xuxueli/xxl-crawler
官网文档:
https://www.xuxueli.com/xxl-crawler/#爬虫示例参考
0x01:新建工程,并在pom.xml文件引入
<dependency>
<groupId>com.xuxueli</groupId>
<artifactId>xxl-crawler</artifactId>
<version>1.2.2</version>
</dependency>
<dependency>
<groupId>org.jsoup</groupId>
<artifactId>jsoup</artifactId>
<version>1.11.2</version>
</dependency>
0x02:编写页面数据对象
在此推荐两款工具,可以直观迅速的获取页面元素的Jquery cssQuery表达式。
Chrome DevTools:首先定位元素位置,然后从Element选中选中元素,点击右键选择“Copy + Copy selector”即可;
Chrome DevTools使用如图
Jquery Selector Helper(Chrome插件):首先定位元素位置,然后从Element右侧打开Selector界面,然后定位元素即可;
package com.spider.page.vo;
import com.xuxueli.crawler.annotation.PageFieldSelect;
import com.xuxueli.crawler.annotation.PageSelect;
import com.xuxueli.crawler.conf.XxlCrawlerConf.SelectType;
@PageSelect(cssQuery = "body > div.container > div > div > table > tbody > tr")
public class GzGemasComCnPageMainVo {
@PageFieldSelect(cssQuery = "td:nth-child(1)")
private String code;
@PageFieldSelect(cssQuery = "td:nth-child(2)")
private String title;
@PageFieldSelect(cssQuery = "td:nth-child(3)")
private String status;
@PageFieldSelect(cssQuery = "td:nth-child(4)")
private String date;
@PageFieldSelect(cssQuery = "td:nth-child(2) > a", selectType=SelectType.ATTR, selectVal="onclick")
private String url;
public String getCode() {
return code;
}
public void setCode(String code) {
this.code = code;
}
public String getTitle() {
return title;
}
public void setTitle(String title) {
this.title = title;
}
public String getStatus() {
return status;
}
public void setStatus(String status) {
this.status = status;
}
public String getDate() {
return date;
}
public void setDate(String date) {
this.date = date;
}
public String getUrl() {
return url;
}
public void setUrl(String url) {
this.url = url;
}
}
0x03:创建爬虫爬取数据
XxlCrawler crawler = new XxlCrawler.Builder()
.setUrls("http://gz.gemas.com.cn/portal/article/proList.shtml?proType=guquan&typeGz=G3T3&proSource=&pageIndex=1")
.setAllowSpread(false) //允许扩散爬取,将会以现有URL为起点扩散爬取整站
.setThreadCount(1)
.setPageParser(new PageParser<GzGemasComCnPageMainVo>() {
@Override
public void parse(Document html, Element pageVoElement, GzGemasComCnPageMainVo gzGemasComCnPageVo) {
// 解析封装 PageVo 对象
String pageUrl = html.baseUri();
logger.info("pageUrl: " + pageUrl);
logger.info("Code: " + gzGemasComCnPageVo.getCode() + ", Title: " + gzGemasComCnPageVo.getTitle()
+ ", sdate: " + gzGemasComCnPageVo.getDate() + ", url: " + gzGemasComCnPageVo.getUrl()
+ ", status: " + gzGemasComCnPageVo.getStatus());
}
})
.build();
crawler.start(true);
关键步骤视频说明:
☆
往期精彩
☆
02 Nacos源码编译
03 基于Apache Curator框架的ZooKeeper使用详解
关注我
每天进步一点点
喜欢!在看☟