基于Web-Harvest抓取百度的搜索结果

最新推荐文章于 2020-11-25 08:55:38 发布

iteye_19961

最新推荐文章于 2020-11-25 08:55:38 发布

阅读量149

点赞数

分类专栏：软件 - JAVA 文章标签：百度 Web XML PHP C

软件 - JAVA 专栏收录该内容

28 篇文章 0 订阅

订阅专栏

1.建立工程，导入相应的包（我使用的是1.0版本） 编写配置文件
<config charset="gbk">


<var-def name="start" id="startpage">
<html-to-xml>
<http url="http://www.baidu.com/s?wd=%CD%E6%BE%DF"/>
</html-to-xml>
</var-def>


<var-def name="urlList" id="urlList">
<xpath expression="//div[@class='r']">
<var name="start"/>
</xpath>
</var-def>


    <file action="write" path="baidu/catalog.xml" charset="utf-8">
        <![CDATA[ <catalog> ]]>
        <loop item="item" index="i">
            <list><var name="urlList"/></list>
            <body>
                <xquery>
                    <xq-param name="item" type="node()"><var name="item"/></xq-param>
                    <xq-expression><![CDATA[
                            declare variable $item as node() external;

                            let $name := data($item//span/font[1]/text()[1])
                            let $url := data($item//span/font[2]/text())
                                return
                                    <website>
                                        <name>{normalize-space($name)}</name>
                                        <url>{normalize-space($url)}</url>
                                    </website>
                    ]]></xq-expression>
                </xquery>
            </body>
        </loop>
        <![CDATA[ </catalog> ]]>
    </file>
</config>

2.编写Java代码

import java.io.IOException;

import org.webharvest.definition.ScraperConfiguration;
import org.webharvest.runtime.Scraper;

public class Test {

    public static void main(String[] args) throws IOException {

        ScraperConfiguration config = new ScraperConfiguration("c:/baidu.xml");
        Scraper scraper = new Scraper(config, "c:/tmp/");
        scraper.setDebug(true);

        long startTime = System.currentTimeMillis();
        scraper.execute();
        System.out.println("time elapsed: " + (System.currentTimeMillis() - startTime));

    }
}

3.查看执行结果

<catalog>
<website>
<name>上海丽强专业大型</name>
<url>www.liqiang-toy.com</url>
</website>
<website>
<name>多样型大型</name>
<url>www.yonglangplay.com</url>
</website>
<website>
<name>童博士卡通</name>
<url>www.tbs88.com</url>
</website>
<website>
<name>芝麻街</name>
<url>c49.txooo.js.cn</url>
</website>
<website>
<name>童博士, 中国平价学生用品..</name>
<url>www.cfsj8.cn</url>
</website>
<website>
<name>充气</name>
<url>www.xmcaili.com</url>
</website>
<website>
<name>找木制</name>
<url>www.tengyuetoys.com</url>
</website>
<website>
<name>米多迪</name>
<url>b146.txooo.com</url>
</website>
</catalog>

4.结论
是不是很酷，就可以对这个结果进行使用分析。比如关键词挖掘中心网站（http://www.wordtracker.cc/）是不是很快就创建起来了。