用web-harvest爬取yahoo！answers数据

最新推荐文章于 2024-04-18 14:47:18 发布

moonsheep_liu

最新推荐文章于 2024-04-18 14:47:18 发布

阅读量2.8k

点赞数 2

分类专栏：网络爬虫文章标签： yahoo encoding class url function list

本文链接：https://blog.csdn.net/moonsheep_liu/article/details/7230903

版权

关于web-harvest的使用，上篇转载的文章已经有简单的说明，本文主要以爬取yahoo！answers的数据为例，说明在使用过程中需要注意的问题。当然，最好的使用文档就是官方网站的user manual。 web-harvest有三个版本，这里用的是源码包。要完成数据的爬取，最重要的是配置config文件。源码包中有个Java类，Test.java，源代码如下：publ

摘要由CSDN通过智能技术生成

关于web-harvest的使用，上篇转载的文章已经有简单的说明，本文主要以爬取yahoo！answers的数据为例，说明在使用过程中需要注意的问题。当然，最好的使用文档就是官方网站的user manual。

web-harvest有三个版本，这里用的是源码包。要完成数据的爬取，最重要的是配置config文件。源码包中有个Java类，Test.java，源代码如下：

public class Test {

    public static void main(String[] args) throws IOException {

        ScraperConfiguration config = new ScraperConfiguration("e:/temp/yahooanswer/auto racing.xml"); //line a
        Scraper scraper = new Scraper(config, "e:/temp/wikianswer"); //line b

scraper.setDebug(true);

        long startTime = System.currentTimeMillis();
        scraper.execute();
        System.out.println("time elapsed: " + (System.currentTimeMillis() - startTime));
    }

}

line a中的.xml文件即抓取配置数据，line b 为抓取后数据的存放路径。其功能是完成yahoo！answers分类中sports/auto racing的resolved问题中的前5页内容，每页20条，以如下格式写入文件中：

下面主要来分析一下auto racing.xml,xml文件如下：

<?xml version="1.0" encoding="utf-8"?>

<include path="functions.xml"/>

<var-def name="home">http://answers.yahoo.com</var-def>

<var-def name="QALinks"> //定义变量QALinks，其值为函数download-m

最低0.47元/天解锁文章

moonsheep_liu

关注

2
点赞
踩
1

收藏

觉得还不错? 一键收藏
2
评论
用web-harvest爬取yahoo！answers数据

关于web-harvest的使用，上篇转载的文章已经有简单的说明，本文主要以爬取yahoo！answers的数据为例，说明在使用过程中需要注意的问题。当然，最好的使用文档就是官方网站的user manual。 web-harvest有三个版本，这里用的是源码包。要完成数据的爬取，最重要的是配置config文件。源码包中有个Java类，Test.java，源代码如下：publ
复制链接

扫一扫

专栏目录