webmagic 获取文本_通过Heritrix或者webmagic如何爬去指定url里的内容？

最新推荐文章于 2024-03-21 22:39:00 发布

你踩到我法袍了

最新推荐文章于 2024-03-21 22:39:00 发布

阅读量260

点赞数

文章标签： webmagic 获取文本

版权声明：本文为博主原创文章，遵循 CC 4.0 BY-SA 版权协议，转载请附上原文出处链接和本声明。

本文链接：https://blog.csdn.net/weixin_36231030/article/details/112864701

版权

@wangxing.xu: 没听懂，爬虫技术通常是要深度爬取的，就是对二级链接继续爬取。只对url去解析，通常就是用jsoup，webmagic底层也是用jsoup去解析html的，不过是封装了深度爬取的步骤。

另外Document这个对象本身就是抽象的，他不仅仅可以容纳html，也能容纳json字符串等等。总之他能容纳url对应的响应内容。

如果你非要用爬虫技术，我给你贴一下webmagic的示例：

package com.scistor.datavision.operator.webcrawler.webmagic;

import us.codecraft.webmagic.Page;

import us.codecraft.webmagic.Site;

import us.codecraft.webmagic.Spider;

import us.codecraft.webmagic.pipeline.ConsolePipeline;

import us.codecraft.webmagic.processor.PageProcessor;

import us.codecraft.webmagic.selector.Html;

public class GithubRepoPageProcessor implements PageProcessor {

// 部分一：抓取网站的相关配置，包括编码、抓取间隔、重试次数等

private Site site = Site.me().setRetryTimes(3).setSleepTime(1000);

// process是定制爬虫逻辑的核心接口，在这里编写抽取逻辑

public void process(Page page) {

// 部分二：定义如何抽取页面信息，并保存下来

page.putField("author",

page.getUrl().regex("https://github\\.com/(\\w+)/.*")

.toString());

page.putField(

"name",

page.getHtml()

.xpath("//h1[@class='entry-title public']/strong/a/text()")

.toString());

if (page.getResultItems().get("name") == null) {

// skip this page

page.setSkip(true);

}

page.putField("readme",

page.getHtml().xpath("//div[@id='readme']/tidyText()"));

Html html = page.getHtml();

// 部分三：从页面发现后续的url地址来抓取

page.addTargetRequests(page.getHtml().links()

.regex("(https://github\\.com/\\w+/\\w+)").all());

}

public Site getSite() {

return site;

}

public static void main(String[] args) {

Spider.create(new GithubRepoPageProcessor())

// 从"https://github.com/code4craft"开始抓

.addUrl("https://github.com/code4craft").addPipeline(new ConsolePipeline())

// 开启5个线程抓取

.thread(5)

// 启动爬虫

.run();

}

你踩到我法袍了

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。