webMagic爬虫框架

官网地址:http://webmagic.io/docs/zh/

直接上代码不BB

单元测试启动类

@Test
public void webMagic() {
    OOSpider.create(Site.me().setSleepTime(5000)
            , pipeline, WebMagic.class)
            .addUrl("http://www.119.cn/content/201903/28/c102918.html")
            .addUrl("http://www.119.cn/content/201902/21/c99922.html")
            .addUrl("http://www.119.cn/content/201902/20/c99843.html")
            .addUrl("http://www.119.cn/content/201901/30/c98820.html")
            .addUrl("http://www.119.cn/content/201901/16/c97731.html")
            .addUrl("http://www.119.cn/content/201901/15/c97590.html")
            .addUrl("http://www.119.cn/content/201901/10/c97269.html")
            .addUrl("http://www.119.cn/content/201901/03/c96718.html")
            .addUrl("http://www.119.cn/content/201901/03/c96677.html")
            .addUrl("http://www.119.cn/content/201901/03/c96663.html")
            .addUrl("http://www.119.cn/content/201812/28/c96264.html")
            .addUrl("http://www.119.cn/content/201812/27/c96167.html")
            .addUrl("http://www.119.cn/content/201812/26/c96028.html")
            .addUrl("http://www.119.cn/content/201812/26/c96027.html")
            .addUrl("http://www.119.cn/content/201812/26/c96025.html")
            .thread(5).run();
}

业务处理组件类

@Component
public class Pipeline implements PageModelPipeline {
    @Resource
    private NoticeMapper noticeMapper;

    public Pipeline() {

    }

    @Override
    public void process(Object o, Task task) {
        WebMagic webMagic = (WebMagic) o;
        String title = webMagic.getTitle();
        NoticeEntity old = noticeMapper.selectTitle(title);
        if(null == old){
            NoticeEntity noticeEntity = new NoticeEntity();
            noticeEntity.setNoticeId(UUID.randomUUID().toString().replaceAll("-", ""));
            noticeEntity.setClassifyId("402882456ad2f2b9016ad3f03a6a000a");
            noticeEntity.setContent(webMagic.getContent());
            noticeEntity.setIsHomeShow(1);
            noticeEntity.setIsTop(1);
            noticeEntity.setStatus(1);
            noticeEntity.setTitle(title);
            noticeEntity.setVisiblePersonId("2");
            noticeMapper.insert(noticeEntity);
        }
    }

}

注解扫描类

@TargetUrl("http://www.119.cn/content/*/*/*.html")
@HelpUrl("http://www.119.cn/picture.html")
public class WebMagic {

    @ExtractBy("//h1[@class='title']/text()")
    private String title;

    //    @ExtractBy("//div[@class='contentBox']")
    @ExtractBy("//div[@class='bodyBox']")
    private String content;

    public String getTitle() {
        return title;
    }

    public void setTitle(String title) {
        this.title = title;
    }

    public String getContent() {
        return content;
    }

    public void setContent(String content) {
        this.content = content;
    }

}

这样一个爬虫就成型了。

 

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值