WebMagic和WebCollector的学习

最新推荐文章于 2022-12-29 11:51:40 发布

weixin_43030390

最新推荐文章于 2022-12-29 11:51:40 发布

阅读量994

点赞数

分类专栏： java 爬虫

本文链接：https://blog.csdn.net/weixin_43030390/article/details/85072529

版权

java 同时被 2 个专栏收录

5 篇文章 0 订阅

订阅专栏

爬虫

1 篇文章 0 订阅

订阅专栏

前言

为了抓一些偏向真实的测试数据,学习了java 的爬虫框架WebMagic和WebCollector
在WebMagic踩了几个坑以至于不得不放弃转向WebCollector

WebMagic

slf4j日志问题

这个好解决,排除一下就可以

		<dependency>
			<groupId>us.codecraft</groupId>
			<artifactId>webmagic-core</artifactId>
			<version>0.7.3</version>
			<exclusions>
				<exclusion>
					<groupId>org.slf4j</groupId>
					<artifactId>slf4j-log4j12</artifactId>
				</exclusion>
			</exclusions>
		</dependency>

httpcore和httpclient jar冲突问题

我这里选了个可以用的

	<dependency>
   		<groupId>org.apache.httpcomponents</groupId>
   		<artifactId>httpclient</artifactId>
   		<version>4.5.5</version>
   	</dependency>
   	<dependency>
         <groupId>org.apache.httpcomponents</groupId>
         <artifactId>httpcore</artifactId>
         <version>4.4</version>
    </dependency>

不支持TLS1.2的https站点

会报错javax.net.ssl.SSLException: Received fatal alert: protocol_version
WebMagic 作者给的解决方案,说下一个版本可以解决,这么久了也没更新,估计废了

		1,复制HttpClientDownloader 和 HttpClientGenerator 
			这两个类到自己的项目中
		2,修改HttpClientDownloader类中buildSSLConnectionSocketFactory方法,
			返回值为:
			return new SSLConnectionSocketFactory(createIgnoreVerifySSL(), 
					new String[]{"SSLv3", "TLSv1", "TLSv1.1", "TLSv1.2"},
                    null,
                    new DefaultHostnameVerifier()); // 优先绕过安全证书
		3,将HttpClientDownloader类中引入的HttpClientGenerator类,换成自己项目中
			修改过的;
		4. 启动的时候加上setDownloader,自己项目中的HttpClientDownloader类
			 SpiderSpider
                .create(new webMagicDemo())
                .setDownloader(new HttpClientDownloader())
                .addUrl("http://www.dytt8.net")
                .thread(1)
                .run();
		ps:HttpClientGenerator类中的构造方法中的Registry,如果引入报错,
			可以去掉泛型

demo

顺利的话,这里就可以用了…我就没这运气(菜)

public class MyProcessor implements PageProcessor {
    // 抓取网站的相关配置，包括编码、抓取间隔、重试次数等
    private Site site = Site.me().setRetryTimes(3).setSleepTime(100);
    private static int count =0;
     
    @Override
    public Site getSite() {
        return site;
    }
 
    @Override
    public void process(Page page) {
        //判断链接是否符合http://www.cnblogs.com/任意个数字字母-/p/7个数字.html格式
        if(!page.getUrl().regex("http://www.cnblogs.com/[a-z 0-9 -]+/p/[0-9]{7}.html").match()){
            //加入满足条件的链接
            page.addTargetRequests(
                    page.getHtml().xpath("//*[@id=\"post_list\"]/div/div[@class='post_item_body']/h3/a/@href").all());
        }else{                             
            //获取页面需要的内容
            System.out.println("抓取的内容："+
                    page.getHtml().xpath("//*[@id=\"Header1_HeaderTitle\"]/text()").get()
                    );
            count ++;
        }
    }
 
    public static void main(String[] args) {
        long startTime, endTime;
        System.out.println("开始爬取...");
        startTime = System.currentTimeMillis();
        Spider.create(new MyProcessor()).addUrl("https://www.cnblogs.com/").thread(5).run();
        endTime = System.currentTimeMillis();
        System.out.println("爬取结束，耗时约" + ((endTime - startTime) / 1000) + "秒，抓取了"+count+"条记录");
    }
 
}

未解决的问题

表示获取不到html,九十九拜都熬过来了,倒在了最后一哆嗦上…大神在哪!!!求指教

WebCollector

没啥问题!!!完美爬取

weixin_43030390

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
WebMagic和WebCollector的学习

前言为了抓一些偏向真实的测试数据,学习了java 的爬虫框架WebMagic和WebCollector在WebMagic踩了几个坑以至于不得不放弃转向WebCollectorWebMagicslf4j日志问题这个好解决,排除一下就可以 &lt;dependency&gt; &lt;groupId&gt;us.codecraft&lt;/groupId&gt; &...
复制链接

扫一扫

专栏目录