WebMagic

罗汉翔

已于 2022-04-13 14:46:23 修改

阅读量1.6k

点赞数

分类专栏： Java 文章标签： GitHub

于 2022-04-13 11:57:52 首次发布

本文链接：https://blog.csdn.net/qq_44885775/article/details/124129875

版权

Java 专栏收录该内容

65 篇文章 4 订阅

订阅专栏

WebMagic官网：Introduction · WebMagic Documents

Part1

原文链接：Introduction · WebMagic Documents

方法	说明	示例
create(PageProcessor)	创建Spider	Spider.create(new GithubRepoProcessor())
addUrl(String…)	添加初始的URL	spider .addUrl("Choose a language · WebMagic Documents")
addRequest(Request...)	添加初始的Request	spider .addRequest("Choose a language · WebMagic Documents")
thread(n)	开启n个线程	spider.thread(5)
run()	启动，会阻塞当前线程执行	spider.run()
start()/runAsync()	异步启动，当前线程继续执行	spider.start()
stop()	停止爬虫	spider.stop()
test(String)	抓取一个页面进行测试	spider .test("Choose a language · WebMagic Documents")
addPipeline(Pipeline)	添加一个Pipeline，一个Spider可以有多个Pipeline	spider .addPipeline(new ConsolePipeline())
setScheduler(Scheduler)	设置Scheduler，一个Spider只能有个一个Scheduler	spider.setScheduler(new RedisScheduler())
setDownloader(Downloader)	设置Downloader，一个Spider只能有个一个Downloader	spider .setDownloader(new SeleniumDownloader())
get(String)	同步调用，并直接取得结果	ResultItems result = spider .get("Choose a language · WebMagic Documents")
getAll(String…)	同步调用，并直接取得一堆结果	List<ResultItems> results = spider .getAll("Choose a language · WebMagic Documents", "http://webmagic.io/xxx")

Site

对站点本身的一些配置信息，例如编码、HTTP头、超时时间、重试策略等、代理等，都可以通过设置Site对象来进行配置。

方法	说明	示例
setCharset(String)	设置编码	site.setCharset("utf-8")
setUserAgent(String)	设置UserAgent	site.setUserAgent("Spider")
setTimeOut(int)	设置超时时间，单位是毫秒	site.setTimeOut(3000)
setRetryTimes(int)	设置重试次数	site.setRetryTimes(3)
setCycleRetryTimes(int)	设置循环重试次数	site.setCycleRetryTimes(3)
addCookie(String,String)	添加一条cookie	site.addCookie("dotcomt_user","code4craft")
setDomain(String)	设置域名，需设置域名后，addCookie才可生效	site.setDomain("github.com")
addHeader(String,String)	添加一条addHeader	site.addHeader("Referer","https://github.com")
setHttpProxy(HttpHost)	设置Http代理	site.setHttpProxy(new HttpHost("127.0.0.1",8080))

其中循环重试cycleRetry是0.3.0版本加入的机制。

该机制会将下载失败的url重新放入队列尾部重试，直到达到重试次数，以保证不因为某些网络原因漏抓页面。

从0.4.0版本开始，WebMagic开始支持Http代理。因为场景的多样性，代理这部分API一直处于不稳定状态，但是因为需求确实存在，所以WebMagic会继续支持代理部分的完善。目前发布的API只是beta版，后续API可能会有更改。代理相关的设置都在Site类中。

API	说明
Site.setHttpProxy(HttpHost httpProxy)	设置单一的普通HTTP代理
Site.setUsernamePasswordCredentials(UsernamePasswordCredentials usernamePasswordCredentials)	为HttpProxy设置账号密码
Site.setHttpProxyPool(List\ httpProxyList, boolean isUseLastProxy)	设置代理池

设置单一的普通HTTP代理

101.101.101.101的8888端口，并设置密码为"username","password"

site.setHttpProxy(new HttpHost("101.101.101.101",8888))
    .setUsernamePasswordCredentials(new UsernamePasswordCredentials("username","password"))

设置代理池

101.101.101.101和102.102.102.102两个IP

site.setHttpProxyList输入是IP+PORT, isUseLastProxy是指重启时是否使用上一次的代理配置

List<String[]> poolHosts = new ArrayList<String[]>();
poolHosts.add(new String[]{"username","password","101.101.101.101","8888"});
poolHosts.add(new String[]{"username","password","102.102.102.102","8888"});
site.setHttpProxyPool(poolHosts,false);httpProxyList输入是IP+PORT, isUseLastProxy是指重启时是否使用上一次的代理配置

0.6.0版本后，允许实现自己的代理池，通过扩展接口ProxyPool来实现。目前WebMagic的代理池逻辑是：轮流使用代理池中的IP，如果某个IP失败超过20次则增加2小时的重用时间，具体实现可以参考SimpleProxyPool。

Part 2

原文链接：WebMagic基本使用_小白学习之旅的博客-CSDN博客_webmagic教程

Part3

WebMagic基本使用_小白学习之旅的博客-CSDN博客_webmagic教程

Part4

4.7 配置代理 · WebMagic Documents

从0.7.1版本开始，WebMagic开始使用了新的代理APIProxyProvider。因为相对于Site的“配置”，ProxyProvider定位更多是一个“组件”，所以代理不再从Site设置，而是由HttpClientDownloader设置。

API	说明
HttpClientDownloader.setProxyProvider(ProxyProvider proxyProvider)	设置代理

设置单一的普通HTTP代理为101.101.101.101的8888端口，并设置密码为"username","password"

HttpClientDownloader httpClientDownloader = new HttpClientDownloader();
httpClientDownloader.setProxyProvider(SimpleProxyProvider.from(new Proxy("101.101.101.101",8888,"username","password")));
spider.setDownloader(httpClientDownloader);

设置代理池，其中包括101.101.101.101和102.102.102.102两个IP，没有密码

HttpClientDownloader httpClientDownloader = new HttpClientDownloader();
httpClientDownloader.setProxyProvider(SimpleProxyProvider.from(new Proxy("101.101.101.101",8888),new Proxy("102.102.102.102",8888)));

Part5 爬虫监控

4.6 爬虫的监控 · WebMagic Documents

windows下是在DOS下输入jconsole.exe

Part6

Java爬虫框架WebMagic的介绍及使用(定时任务、代理)_Piconjo_Official的博客-CSDN博客_webmagic定时任务

CSS选择器抽取

page.putField("ul",page.getHtml().css("div.right-1 ul").all());

page.putField("ul",page.getHtml().$("div.right-1 ul").all());

css()等价于$()

// 实现PageProcessor类 复写方法
public class JobProcessor implements PageProcessor {

    // 解析页面
    public void process(Page page)
    {
        将结果以键值对的形式放入ResultItems中
        page.putField("ul",page.getHtml().css("div.right-1 ul").all());
    }

    private Site site=Site.me();
    public Site getSite()
    {
        return site;
    }

    // 主函数 执行爬虫
    public static void main(String[] args)
    {
        // Spider容器创建解析器 添加url地址 run执行爬虫
        Spider.create(new JobProcessor()).addUrl("http://www.zjitc.net/xwzx/tztg.htm").run();
    }
}

获取元素
一条抽取规则无论是XPath CSS选择器或是正则表达式有可能抽取到多条元素
WebMagic可以通过不同的API获取到一个或多个元素
返回一条String类型的结果：
(默认返回第一条)

get()
例：String link=html.links().get()

toString()
例：String link=html.links().toString()

返回所有抽取结果：

        all()
        例：List links=html.links().all()

// 将class为right-1的div中的ul中的所有a标签作为目标链接                page.addTargetRequests(page.getHtml().css("div.right-1 ul a").links().all());

Site爬虫配置

Site.me()可对爬虫进行一些配置包括编码字符抓取间隔超时时间重试次数等

private Site site=Site.me()
            .setCharset("utf8") //设置编码
            .setTimeOut(10000) //设置超时时间(单位:毫秒)
            .setRetryTimes(3000) //设置重试的时间间隔(单位:毫秒)
            .setSleepTime(3); //设置重试次数
            
public Site getSite() {
        return site;
}

其它设置：
setUserAgent(String)：设置代理
addCookie(String)：添加Cookie
setDomain(String)：设置域名
addHeader(String,String)：添加请求头
setHttpProxy(HttpHost)：设置Http代理

返回Site

public class Xxxxxxxx  implements PageProcessor {


    private Site site=Site.me()
            .setCharset("utf8") //设置编码
            .setTimeOut(10000) //设置超时时间(单位:毫秒)
            .setRetryTimes(3000) //设置重试的时间间隔(单位:毫秒)
            .setSleepTime(3); //设置重试次数


   
    private Site site=Site.me();
    @Override
    public Site getSite() {
        return site;
    }

}

创建代理的爬虫

@Scheduled(fixedDelay = 1000)
public void Process()
{
 	// 创建下载器Downloader
    HttpClientDownloader httpClientDownloader=new HttpClientDownloader();

    // 给下载器设置代理服务器信息
    httpClientDownloader.setProxyProvider(SimpleProxyProvider.from(new Proxy("183.91.33.41",89)));

    Spider.create(new ProxyTest())
            .addUrl("http://ip.chinaz.com/")
            // 设置下载器
            .setDownloader(httpClientDownloader)
            .run();
}

定时任务

使用Spring内置的Spring Task来实现
这是Spring3.0加入的定时任务功能
使用@Scheduled注解的方式定时启动爬虫进行数据爬取

属性：
cron：cron表达式指定任务在特定时间执行
fixedDelay：上一次任务执行完后多久再执行参数类型为long 单位毫秒
fixedDelayString：上一次任务执行完后多久再执行参数类型为String 单位毫秒
fixedRate：按一定的频率执行任务参数类型为long 单位毫秒
fixedRateString：按一定的频率执行任务参数类型为String 单位毫秒
initialDelay：延迟多久后第一次执行任务参数类型为long 单位毫秒
initialDelayString：延迟多久后第一次执行任务参数类型为String 单位毫秒
zone：时区默认为当前时区

cron表达式
某些业务要求较高并不是定时定期处理而是在特定的时间进行处理
此时需要使用cron表达式
cron表达式实际上是由七个子表达式描述个别细节的时间表
这些子表达式用空格进行分隔每位分别代表：

1、Seconds 0-59
2、Minutes 0-59
3、Hours 0-23
4、Day-of-Month 1-31
5、Month 0-11或字符串JAN FEB…
6、Day-of-Week 1-7或字符串SUN MON…
7、Year(可省略)
其中：
/代表"每" 例如0/15代表每隔15分钟从第0分钟开始执行
?代表每月的某一天或每周某一天
*代表整个时间段
L代表每月或每周的最后一天或每个月的最后一个星期几
例：6L代表每月的最后一个星期五

因此 0 0 12 ? * WED 就是代表在每星期三下午12:00执行

@Component
public class TaskTest {

    @Scheduled(cron = "0/8 * * * * *")
    public void test()
    {
        System.out.println("定时任务start");
    }
}

Part6

webMagic发送post请求绑定参数_简单.的博客-CSDN博客_webmagic设置请求头

webMagic 0.7.0版本中移除了老的在request.extra中设置NameValuePair的方式，使用RequestBody 。

webMagic 0.7.0以下版本使用 Extra

Request request = new Request("");
request.setMethod(HttpConstant.Method.POST);
NameValuePair[] nameValuePair = new NameValuePair[](){
new BasicNameValuePair("id","100"),new BasicNameValuePair("tag","2")};
request.setExtra("nameValuePair", nameValuePair);
spider.addRequest(request);

webMagic 0.7.0以上版本使用 RequestBody

Request request = new Request("");
request.setMethod(HttpConstant.Method.POST);
request.setRequestBody(HttpRequestBody.json("{'id':1}","utf-8"));

罗汉翔

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
WebMagic

WebMagic
复制链接

扫一扫

专栏目录