采集网址:http://www.google.com.hk/search?num=100&hl=zh-CN&tbm=nws&q=搜索引擎
各参数含义:
num:返回100条结果(最多100条)
hl:语言
tbm:搜索类别(nws表示搜索新闻类别)
q:搜索的关键词
打开SourceViewer,点击左下角的“原始文件”标签,在弹出的窗口中输入下面配置
<?xml version="1.0" encoding="UTF-8"?>
<config charset="UTF-8">
<var-def name="url">
<template>$$URL$$</template>
</var-def>
<var-def name="content">
<http url="${url}">
<http-header name="User-Agent">Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.0.1) Gecko/20060111 Firefox/1.5.0.1</http-header>
<http-header name="Accept">*/*</http-header>
<http-header name="Accept-Charset">GBK,utf-8;q=0.7,*;q=0.3</http-header>
<http-header name="Accept-Language">zh-CN,zh;q=0.8</http-header>
</http>
</var-def>
<file action="write" type="text" path="$$OUTHTML$$">
<template>${content}</template>
</file>
</config>
点击“确认”按钮关闭配置窗口,重新打开SourceViewer,即可看到采集到的数据
返回主窗口,在配置编辑窗口输入如下配置:
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
<xsl:template match="/">
<news>
<xsl:for-each select="//li[@class='g']">
<item>
<datetime>
<xsl:value-of select=".//div[@class='slp']/span/text()" />
</datetime>
<url>
<xsl:value-of select=".//h3[@class='r']/a/@href" />
</url>
<title>
<xsl:for-each select=".//h3[@class='r']/a//text()">
<xsl:value-of select="concat(.,'')" />
</xsl:for-each>
</title>
</item>
</xsl:for-each>
</news>
</xsl:template>
</xsl:stylesheet>
点击运行,即可看到结果窗口