一. WebMagic简介
webmagic是一个开源的Java垂直爬虫框架,目标是简化爬虫的开发流程,让开发者专注于逻辑功能的开发。webmagic的核心非常简单,但是覆盖爬虫的整个流程,也是很好的学习爬虫开发的材料。见官网http://webmagic.io/
二. 快速开始
webmagic使用maven管理依赖,在项目中添加对应的依赖即可使用webmagic:
<dependency>
<groupId>us.codecraft</groupId>
<artifactId>webmagic-core</artifactId>
<version>0.7.4</version>
</dependency>
<dependency>
<groupId>us.codecraft</groupId>
<artifactId>webmagic-extension</artifactId>
<version>0.7.4</version>
</dependency>
WebMagic使用slf4j-log4j12作为slf4j的实现.如果你自己定制了slf4j的实现,请在项目中去掉此依赖。
<dependency>
<groupId>us.codecraft</groupId>
<artifactId>webmagic-extension</artifactId>
<version>0.7.4</version>
<exclusions>
<exclusion>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-log4j12</artifactId>
</exclusion>
</exclusions>
</dependency>
三. 爬取数据(springBoot+mybatis+webMagic)存入数据库
1. controller层代码如下,然后通过http://localhost:8090/robinBootApi/crawlData/crawlBookList 来发起数据爬取
package com.robinbootweb.dmo.controller;
import com.robinboot.facade.CrawlFacadeService;
import com.robinboot.query.CrawlQuery;
import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.stereotype.Controller;
import org.springframework.web.bind.annotation.RequestMapping;
import org.springframework.web.bind.annotation.RequestMethod;
import org.springframework.web.bind.annotation.ResponseBody;
/**
* @auther: TF12778
* @date: 2021/1/28 14:02
* @description:
*/
@Controller
@RequestMapping(value = "/crawlData")
public class CrawlController {
private static Log logger = LogFactory.getLog(CrawlController.class);
@Autowired
CrawlFacadeService webMagicFacadeService;
/**
* @auther: TF12778
* @date: 2021/1/27 11:15
* @description: http://localhost:8090/robinBootApi/crawlData/crawlDoubanMoive
* 请求豆瓣数据
*/
@ResponseBody
@RequestMapping(value = "/crawlDoubanMoive", method = RequestMethod.GET)
public void crawlDoubanMoive(CrawlQuery query) {
webMagicFacadeService.crawlDoubanMoive(query);
}
/**
* @auther: TF12778
* @date: 2021/1/27 11:15
* @description: http://localhost:8090/robinBootApi/crawlData/crawlBookList
* 请求全部的列表数据
*/
@ResponseBody
@RequestMapping(value = "/crawlBookList", method = RequestMethod.GET)
public void crawlBookList(CrawlQuery query) {
webMagicFacadeService.crawlBookList(query);
}
}
2. facade层发起真正的请求
这里通过page.addTargetRequests()方法来增加要抓取的URL,并通过page.putField()来保存抽取结果。page.getHtml().xpath()则是按照某个规则对结果进行抽取,这里抽取支持链式调用。调用结束后,toString()表示转化为单个String,all()则转化为一个String列表。
Spider是爬虫的入口类。Pipeline是结果输出和持久化的接口,这里bookListPipeline表示结果存到数据库。
package com.robinboot.service.facade.impl;
import com.robinboot.facade.CrawlFacadeService;
import com.robinboot.query.CrawlQuery;
import com.robinboot.service.WebMagic.Book.BookDetailPipeline;
import com.robinboot.service.WebMagic.Book.BookListPipeline;
import com.robinboot.service.WebMagic.Book.LittleBookDetailProcessor;
import com.robinboot.service.WebMagic.Book.LittleBookProcessor;
import com.robinboot.service.WebMagic.DoubanMoiveProcessor;
import com.robinboot.service.WebMagic.HttpClientDownloader;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.stereotype.Service;
import us.codecraft.webmagic.Spider;
/**
* @auther: TF12778
* @date: 2021/1/28 13:44
* @description:
*/
@Service("crawlFacadeService")
public class CrawlFacadeServiceImpl implements CrawlFacadeService {
@Autowired
BookListPipeline bookListPipeline;
@Autowired
BookDetailPipeline bookDetailPipeline;
@Autowired
LittleBookDetailProcessor littleBookDetailProcessor;
@Override
public void crawlDoubanMoive(CrawlQuery query) {
Spider.create(new DoubanMoiveProcessor())
.setDownloader(new HttpClientDownloader())
.addUrl("https://www.douban.com/doulist/3907668/")
// .addPipeline(new MySqlPipeline())
.thread(5)
.run();
}
/**
* @auther: TF12778
* @date: 2021/1/27 11:13
* @description: http://localhost:8090/robinBootApi/user/requestBookList
*/
@Override
public void crawlBookList(CrawlQuery query) {
Spider.create(new LittleBookProcessor())
.setDownloader(new HttpClientDownloader())
.addUrl("http://deeee.cn:8080/tcshop/1111/catalog/toyy.html?orderBy=0")
.addPipeline(this.bookListPipeline) //这个this很关键,否则调dao层时,到接口不能自动被容器加载
.thread(5)
.run();
}
}
3.LittleBookProcessor爬虫处理类。爬虫主要的参数设置,以及url等在此设置
PageProcessor是webmagic-core的一部分,定制一个PageProcessor即可实现自己的爬虫逻辑。
package com.robinboot.service.WebMagic.Book;
import com.robinboot.service.WebMagic.HttpClientDownloader;
import lombok.extern.slf4j.Slf4j;
import org.springframework.beans.factory.annotation.Value;
import org.springframework.stereotype.Component;
import us.code