webmagic简单介绍
WebMagic是一个简单灵活的Java爬虫框架。基于WebMagic,你可以快速开发出一个高效、易维护的爬虫。对于小白来说也是非常快上手的爬虫框架。
webmagic有四大组件PageProcessor(页面解析器)、Scheduler(管理)、Downloader(下载)、Pipeline(持久化)
webmagic这里不做过多介绍。想要了解更多请看官网webmagic官网
快速使用并整合spring boot
在spring boot工程中引入webmagic依赖
<dependency>
<groupId>us.codecraft</groupId>
<artifactId>webmagic-core</artifactId>
<version>0.7.3</version>
</dependency>
<dependency>
<groupId>us.codecraft</groupId>
<artifactId>webmagic-extension</artifactId>
<version>0.7.3</version>
</dependency>
1、创建XXXPageProcessor类实现PageProcessor并实现PageProcessor的两个方法process()和getSite(),代码如下:
简单说明:这里获取的是文章列表,大家可根据自己的实际业务需求修改即可
@Component
public class CompanyNoticeProcessor implements PageProcessor {
private static Site site = Site.me().setSleepTime(1000).setTimeOut(10000).setRetryTimes(0);
private static CompanyNoticeService companyNoticeService;
private static ArticleService articleService;
@Override
public void process(Page page) {
Html html = page.getHtml();
List<String> listComUrl = html.xpath("//div[@id=\"container2\"]/table/tbody/tr[2]/td/table/tbody/tr").all();
for (int i=1; i < listComUrl.size(); i++){
if (i >= 3 && i < listComUrl.size()){
String path = "//div[@id=\"container2\"]/table/tbody/tr[2]/td/table/tbody/tr["+i+"]/td/a/@href";
String href = html.xpath(path).get();
GdEnergyDto energyDto = GdEnergyUtil.href(href, 2);
Article articleByTitle = articleService.getArticleByTitle(energyDto.getTitle(), energyDto.getSheetId());
if (articleByTitle == null) {
GdElectricityConstant.ARTICLE_SHEETID = energyDto.getSheetId();
companyNoticeService.executeCompanyNoticeDetail(energyDto);
}
}
}
}
@Override
public Site getSite() {
return site;
}
@Autowired
public void setCompanyNoticeService(CompanyNoticeService companyNoticeService){
CompanyNoticeProcessor.companyNoticeService = companyNoticeService;
}
@Autowired
public void setArticleService(ArticleService articleService){
CompanyNoticeProcessor.articleService = articleService;
}
}
在刚开始使用的时候,并不知道如何获取某个标签下的所有子标签,经过查询一翻,并没有找到合适方法,有的说使用**smartContent()方法,测试过获取到内容是空,后面自己慢慢摸索使用all()**的方法就可以了。
2、创建XXXPipeline实现Pipeline类并实现process()
Pipeline主要是用于持久层操作
@Component
public class NoticeDetailPipeline implements Pipeline {
private static ArticleService articleService;
@Override
public void process(ResultItems resultItems, Task task) {
Object article = resultItems.get("article");
Article art = (Article) article;
QueryWrapper<Article> queryWrapper = new QueryWrapper<>();
queryWrapper.lambda().eq(Article::getMark, 1).eq(Article::getTitle, art.getTitle());
Article one = articleService.getOne(queryWrapper);
if (StringUtils.isNull(one)) {
articleService.edit(art);
}
}
@Autowired
public void setArticleService(ArticleService articleService){
NoticeDetailPipeline.articleService = articleService;
}
}
3、创建service类
这里的service主要使用来调用实现PageProcessor、Pipeline的类。
@Service
public class CompanyNoticeServiceImpl implements CompanyNoticeService {
@Override
public void executeCompanyNotice(){
Spider.create(new CompanyNoticeProcessor())
.addUrl(GdElectricityConstant.COMPANY_NOTICES_URL)
.thread(100)
.run();
}
@Override
public void executeCompanyNoticeDetail(GdEnergyDto energyDto){
Spider.create(new CompanyNoticeDetailProcessor())
.addUrl(energyDto.getUrl())
.thread(100)
.addPipeline(new NoticeDetailPipeline())
.run();
}
}
4、获取文章详情的实现类
@Component
public class CompanyNoticeDetailProcessor implements PageProcessor {
private static Site site = Site.me().setSleepTime(1000).setTimeOut(10000).setRetryTimes(0);
@Override
public void process(Page page) {
Html html = page.getHtml();
String title = html.xpath("//div[@class=\"layui-col-xs9\"]/table/tbody/tr/td/table[1]/tbody/tr[1]/td/text()").toString();
String auth = html.xpath("//div[@class=\"layui-col-xs9\"]/table/tbody/tr/td/table[1]/tbody/tr[2]/td/text()").toString();
String content = html.xpath("//div[@class=\"layui-col-xs9\"]/table/tbody/tr/td/table[2]/tbody/tr/td/p").all().toString();
String substrCon = content.substring(1, content.length() - 1);
String[] split = substrCon.split(",");
String conData = "";
for (int i=0; i < split.length; i++) {
conData += split[i];
}
//List<String> content = html.xpath("//div[@class=\"layui-col-xs9\"]/table/tbody/tr/td/table[2]/tbody/tr/td/p").all();
//String conData = StringUtils.join(content);
Article article = new Article();
article.setTitle(title);
article.setCover("");
article.setContent(conData);
article.setCateId(30);
article.setAuthor(auth);
article.setAuditorId(0);
article.setAuditStatus(2);
article.setDocumentStatus(1);
article.setAuditTime(new Date());
article.setSheetId(GdElectricityConstant.ARTICLE_SHEETID);
page.putField("article", article);
}
@Override
public Site getSite() {
return site;
}
}
注意:在实现PageProcessor、Pipeline的类上一定使用@Component注解