一、需求
因为业务需求,需要实现新闻资讯功能。初步方案通过第三方提供的服务接口来实现此功能。由于谈判失败,因此决定自开发一套爬虫接口。因此通过查询相关文档,决定采用webmagic开源框架实现自己的爬虫功能。
二、实施过程
1、引入依赖
在pom文件中添加依赖:
<!-- 爬虫 -->
<dependency>
<groupId>us.codecraft</groupId>
<artifactId>webmagic-core</artifactId>
<version>0.7.3</version>
</dependency>
<dependency>
<groupId>us.codecraft</groupId>
<artifactId>webmagic-extension</artifactId>
<version>0.7.3</version>
</dependency>
2、创建相关接口
创建实现类,代码如下(仅供参考):
import com.alibaba.fastjson.JSONArray;
import com.alibaba.fastjson.JSONObject;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.stereotype.Component;
import team.biteeny.admin.db.write.cache.ConfigMapper;
import team.biteeny.admin.db.write.mapper.CrawlMapper;
import team.biteeny.admin.db.write.model.CrawlModel;
import team.biteeny.push.getui.PushApp;
import us.codecraft.webmagic.Page;
import us.codecraft.webmagic.Site;
import us.codecraft.webmagic.processor.PageProcessor;
import us.codecraft.webmagic.selector.Html;
import us.codecraft.webmagic.selector.Json;
import java.util.ArrayList;
import java.util.Date;
import java.util.List;
import java.util.Map;
import java.util.concurrent.ConcurrentHashMap;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
@Component
public class HuobiInfoProcessor implements PageProcessor {
@Autowired
private CrawlMapper crawlMapper;
@Autowired
private ConfigMapper configMapper;
private Site site;
private static Map map = new ConcurrentHashMap();
@Override
public void process(Page page) {
if (page.getUrl().toString().contains("flash")){
insertFlash