基于webmagic框架的多主题爬虫关键词切换

最新推荐文章于 2020-12-30 09:04:13 发布

zhg_vincent

最新推荐文章于 2020-12-30 09:04:13 发布

阅读量593

点赞数 2

分类专栏： java爬虫笔记文章标签： webmagic 多主题关键词切换

本文链接：https://blog.csdn.net/vincent2014linux/article/details/90377313

版权

笔记同时被 2 个专栏收录

15 篇文章 2 订阅

订阅专栏

java爬虫

6 篇文章 3 订阅

订阅专栏

1、背景介绍

多主题爬虫中，我们一般先分析网站的url特点（重点是列表页），再根据项目需求预先设定好关键词，对待爬取url，或者称为种子url进行精准控制。

1.1、分析一

带关键词的url场景有很多，如网站的特定版块、某模块发送的AJAX请求等都嵌入了关键词。

eg：我们需要爬取同程旅游网杭州的旅游景点信息，url是：https://so.ly.com/hot?q=%E6%9D%AD%E5%B7%9E；

其中%E6%9D%AD%E5%B7%9E是Unicode对关键词--“杭州”编码的结果

eg：同程旅游网从杭州到北京的国内游，url是：

https://gny.ly.com/list?src=%E6%9D%AD%E5%B7%9E&dest=%E5%8C%97%E4%BA%AC&prop=0，等同于https://gny.ly.com/list?src=杭州&dest=北京&prop=0，实际上也等同于https://gny.ly.com/list?src=杭州&dest=北京。在浏览器输入上述url后会显示该主题列表的第一页，点击下一页我们会发现该主题列表第二页url是：

https://gny.ly.com/list?src=%E6%9D%AD%E5%B7%9E&dest=%E5%8C%97%E4%BA%AC&start=2

第三页是：

https://gny.ly.com/list?src=%E6%9D%AD%E5%B7%9E&dest=%E5%8C%97%E4%BA%AC&start=3

····

https://gny.ly.com/list?src=%E6%9D%AD%E5%B7%9E&dest=%E5%8C%97%E4%BA%AC&start=n

由此，我们就可以得出该模块的url拼接规则为：https://gny.ly.com/list?src=关键词1（Unicode编码）+“&dest=”+关键词2（Unicode编码）+“&start=”+index（页面索引）

再比如：百度新闻，关键词搜索url：

https://www.baidu.com/s?rtt=1&bsst=1&cl=2&tn=news&rsv_dl=ns_pc&word=浙江++消防&pn=10

https://www.baidu.com/s?rtt=1&bsst=1&cl=2&tn=news&rsv_dl=ns_pc&word=浙江++消防&pn=20

1.2、分析二

再进一步抽象，一般我们在配置文件中设定关键词，或者写入数据库中，然后爬虫从中读取并存入kw1List和kw2List中，

两种方式的配置举例如下：

配置1（yaml文件）

filters:
  searchfilter:
    kwfixvalue: [ 浙江, 江苏, 上海, 北京, 天津 ]
    kwvalue: [ 火灾, 坍塌, 爆炸, 事故, 安全, 伤亡 ]

配置2（数据库）

起始城市ID	起始城市名称	目的地ID	目的地名称
0510	无锡	0571	杭州
001	北京	021	南京
0519	常州	0996	乌鲁木齐

在拼接下一列表页的逻辑中(即换页，列表页切换)，我们需要用到的变量是：当前关键词1、当前关键词2、当前关键词1所处list1中的索引index1、当前关键词2所处list2中的索引index2，以及已爬取到的页面index（即网站所显示的第几页）；

2、解决

经上述分析，将列表页url拼接逻辑中表示关键词的选择切换抽取出来，并用一个pojo类定义，可以命名为KeywordOptions，代码如下：

currenyPage: 当前页面索引（表示关键词1+关键词2搜索结果中的第几页）
currentFixIndex: 关键词2所在list2中的index
kwFixValue: 关键词2
currentIndex: 关键词1所在list1中的index
kwValue: 关键词1

public class KeywordOptions {
    Long currentPage;
    Integer currentFixIndex;
    String kwFixValue = null;
    Integer currentIndex;
    String kwValue = null;

    public KeywordOptions() {
    }

    public Long getCurrentPage() {
        return this.currentPage;
    }

    public void setCurrentPage(Long currentPage) {
        this.currentPage = currentPage;
    }

    public Integer getCurrentFixIndex() {
        return this.currentFixIndex;
    }

    public void setCurrentFixIndex(Integer currentFixIndex) {
        this.currentFixIndex = currentFixIndex;
    }

    public Integer getCurrentIndex() {
        return this.currentIndex;
    }

    public void setCurrentIndex(Integer currentIndex) {
        this.currentIndex = currentIndex;
    }

    public String getKwFixValue() {
        return this.kwFixValue;
    }

    public void setKwFixValue(String kwFixValue) {
        this.kwFixValue = kwFixValue;
    }

    public String getKwValue() {
        return this.kwValue;
    }

    public void setKwValue(String kwValue) {
        this.kwValue = kwValue;
    }
}

并基于webmagic框架中的PageProcessor接口编写抽象类BasePageProcessor，在该抽象类中根据通用性业务需求编写相关方法，首先是关键词切换逻辑：

private boolean nextKeyword(KeywordOptions ko) {
    if (this.searchFilterConfig == null) {
        return false;
    } else {
        int kwSize = this.kwValues.size();
        int kwFixSize;
        
        if (this.kwFixValues == null) {
            kwFixSize = 0;
        } else {
            kwFixSize = this.kwFixValues.size();
        }

        if (ko.getCurrentIndex() >= kwSize - 1) {
            ko.setCurrentIndex(0);
            if (ko.getCurrentFixIndex() >= kwFixSize - 1) {
                return false;
            } else {
                ko.setCurrentFixIndex(ko.getCurrentFixIndex() + 1);
                ko.setKwValue((String)this.kwValues.get(ko.getCurrentIndex()));
                if (this.kwFixValues != null) {
                    ko.setKwFixValue((String)this.kwFixValues.get(ko.getCurrentFixIndex()));
                }
                return true;
            }
        } else {
            ko.setCurrentIndex(ko.getCurrentIndex() + 1);
            ko.setKwValue((String)this.kwValues.get(ko.getCurrentIndex()));
            if (this.kwFixValues != null) {
                ko.setKwFixValue((String)this.kwFixValues.get(ko.getCurrentFixIndex()));
            }
            return true;
        }
    }
}

其中根据KeywordOptions对象拼接url的方法如下，将该方法设为public，以便后续根据不同拼接规则可以继承重写

public String koToUrl(KeywordOptions ko) {
    StringBuilder builder = new StringBuilder(this.baseUrl);
    builder.append(ko.getCurrentPage());
    if (this.searchFilterConfig == null) {
        return builder.toString();
    }else if (ko.getKwValue() == null && ko.getKwFixValue() == null) {
        return builder.toString();
    } else {
        builder.append("&");
        if (ko.getKwValue() != null) {
            if (this.kwCharset != null) {
                try {
                    builder.append(URLEncoder.encode(ko.getKwValue(), this.kwCharset));
                } catch (UnsupportedEncodingException var5) {
                    var5.printStackTrace();
                }
            } else {
                builder.append(ko.getKwValue());
            }
        }

        if (ko.getKwFixValue() != null) {
            builder.append("+");
            if (this.kwCharset != null) {
                try {
                    builder.append(URLEncoder.encode(ko.getKwFixValue(), this.kwCharset));
                } catch (UnsupportedEncodingException var4) {
                    var4.printStackTrace();
                }
            } else {
                builder.append(ko.getKwFixValue());
            }
        }

        return builder.toString();
    }
}

最后得到下一列表页请求（封装url）

public synchronized Request nextListPage(KeywordOptions ko) {
    //判断任务是否结束，列表切换是否锁定
    if (!this.listAddLock && !this.isComplete) {
        //获取配置文件解析器实例
        ConfigParser parser = ConfigParser.getInstance();
        Boolean fixed = (Boolean)parser.getValue(this.commonConfig, "fixed", false, this.spiderConfig.getConfigPath() + ".common");
        //判断页面url是否为固定
        if (fixed) {
            return null;
        } else {
            String url;
            //判断当前页是否为列表页尾页
            if (ko.getCurrentPage() >= this.totalPages) {
                //为真则切换关键词
                ko.setCurrentPage(Long.valueOf(String.valueOf(this.commonConfig.get("firstpage"))));
                if (this.nextKeyword(ko)) {
                    url = this.koToUrl(ko);
                    return this.nextListPageHook(this.pushRequest(url, ko));
                } else {
                    this.isComplete = true;
                    return this.nextListPageHook((Request)null);
                }
            } else {
                //非尾页，则当前页面索引加一
                ko.setCurrentPage(ko.getCurrentPage() + 1L);
                url = this.koToUrl(ko);
                return this.nextListPageHook(this.pushRequest(url, ko));
            }
        }
    } else {
        return null;
    }
}

在BasePageProcessor中编写页面处理逻辑，相关代码如下：

public void process(Page page) {
    Iterator var4;
    if (page.getUrl().toString().contains(this.baseUrl)) {
        //判断是否下载异常，自定义错误码600
        if (page.getStatusCode() == 600) {
            this.listAddLock = false;
            return;
        }
        //解析列表页，后续业务会重写processList(page)方法
        if (this.processList(page)) {
            this.processSuccessListPageCount.incrementAndGet();
            logger.info("list page crawl success url={}", page.getUrl());
            this.listAddLock = false;
        } else {
            this.processErrorListPageCount.incrementAndGet();
            logger.warn("list page crawl failed url={}", page.getUrl());
        }
        //每个List request中存储KeywordOptions实例
        KeywordOptions ko = (KeywordOptions)JSON.parseObject((String)page.getRequest().getExtra("ko"), KeywordOptions.class);
        if (ko != null) {
            List<Request> requests = page.getTargetRequests();
            var4 = requests.iterator();

            while(var4.hasNext()) {
                Request request = (Request)var4.next();
                request.putExtra("kw", ko.getKwValue());
            }
        }
        //获取下一列表页
        Request listpage = this.nextListPage(ko);
        if (listpage != null) {
            listpage.putExtra("nocheckdup", true);
            page.putField("listPage", listpage);
        }
    } else {
        //详细页解析，同样先进行异常检查
        if (page.getStatusCode() == 600) {
            return;
        }
        
        try {
            //processPage方法也会被后续具体业务重写
            this.processPage(page);
            this.processSuccessPageCount.incrementAndGet();
        } catch (Exception var7) {
            this.processErrorPageCount.incrementAndGet();
            logger.warn("page process failed url={} , error:{}", new Object[]{page.getUrl(), var7});
        }

        ResultItems items = page.getResultItems();
        String keyword = (String)page.getRequest().getExtra("kw");
        if (keyword == null) {
            keyword = this.kwValues != null ? (String)this.kwValues.get(0) : null;
        }

        if (keyword != null) {
            var4 = items.getAll().entrySet().iterator();

            while(var4.hasNext()) {
                Map.Entry<String, Object> entry1 = (Map.Entry)var4.next();
                Map<String, Object> map = (Map)entry1.getValue();
                map.put("keyword", keyword);
            }
        }
    }
}

zhg_vincent

关注

2
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
基于webmagic框架的多主题爬虫关键词切换

1、背景介绍多主题爬虫中，我们一般先分析网站的url特点（重点是列表页），再根据项目需求预先设定好关键词，对待爬取url，或者称为种子url进行精准控制。1.1、分析一带关键词的url场景有很多，如网站的特定版块、某模块发送的AJAX请求等都嵌入了关键词。 eg：我们需要爬取同程旅游网杭州的旅游景点信息，url是：https://so....
复制链接

扫一扫

专栏目录