运行:
1,下载jspider-0.5.0-dev.zip,解压缩.
2,开始->运行->cmd,进入命令行窗口,进入jspider-0.5.0-dev/bin目录
3, 试着抓取网站http: //j-spider.sourceforge.net的内容:
可以看见屏幕上显示 :
Build: 20030502
Started from .
[Engine] jspider.home = ..
[Engine] default output folder = ..\output
[Engine] starting with configuration ' default '
bin 目录下多了两个文件 ,out.txt, velocity.log.
out.txt内容如下:
JSpider startup script
JSPIDER_HOME = ..
------------------------------------------------------------
INFO [ core.impl.PluginFactory ] Loading 4 plugins.
INFO [ core.impl.PluginFactory ] Loading plugin configuration 'console'
INFO [ mod.plugin.console.ConsolePlugin ] Prefix set to ' [ Plugin ] '
INFO [ core.impl.PluginFactory ] Plugin not configured for local event filtering
INFO [ core.impl.PluginFactory ] Plugin Name : Console writer JSpider module
INFO [ core.impl.PluginFactory ] Plugin Version : v1 .0
INFO [ core.impl.PluginFactory ] Plugin Vendor : http://www.javacoding.net
INFO [ core.impl.PluginFactory ] Loading plugin configuration 'velocity'
INFO [ core.impl.PluginFactory ] Plugin uses local event filtering
INFO [ core.impl.PluginFactory ] Plugin Name : Velocity Template JSpider module
INFO [ core.impl.PluginFactory ] Plugin Version : v1 .0
INFO [ core.impl.PluginFactory ] Plugin Vendor : http://www.javacoding.net
INFO [ core.impl.PluginFactory ] Loading plugin configuration 'statusbasedfilewriter'
INFO [ mod.plugin.statusbasedfilewriter.StatusBasedFileWriterPlugin ] initialized.
INFO [ core.impl.PluginFactory ] Plugin not configured for local event filtering
INFO [ core.impl.PluginFactory ] Plugin Name : Status based Filewriter JSpider plugin
INFO [ core.impl.PluginFactory ] Plugin Version : v1 .0
INFO [ core.impl.PluginFactory ] Plugin Vendor : http://www.javacoding.net
INFO [ core.impl.PluginFactory ] Loading plugin configuration 'xmldump'
INFO [ core.impl.PluginFactory ] Plugin uses local event filtering
INFO [ core.impl.PluginFactory ] Plugin Name : Velocity Template JSpider module
INFO [ core.impl.PluginFactory ] Plugin Version : v1 .0
INFO [ core.impl.PluginFactory ] Plugin Vendor : http://www.javacoding.net
INFO [ core.impl.PluginFactory ] Loaded 4 plugins.
INFO [ mod.plugin.velocity.VelocityPlugin ] writing trace file: true
INFO [ mod.plugin.velocity.VelocityPlugin ] writing dump file: true
INFO [ mod.plugin.velocity.VelocityPlugin ] Velocity template folder : velocity
INFO [ mod.plugin.velocity.VelocityPlugin ] Writing to trace file: ./velocity-trace.out
INFO [ mod.plugin.velocity.VelocityPlugin ] Writing to dump file: ./velocity-dump.out
INFO [ mod.plugin.velocity.VelocityPlugin ] writing trace file: false
INFO [ mod.plugin.velocity.VelocityPlugin ] writing dump file: true
INFO [ mod.plugin.velocity.VelocityPlugin ] Velocity template folder : xmldump
INFO [ mod.plugin.velocity.VelocityPlugin ] Writing to dump file: ./xml-dump.xml
INFO [ core.storage.StorageFactory ] Storage provider class is 'class net.javacoding.jspider.core.storage.memory.InMemoryStorageProvider'
INFO [ core.SpiderContext ] default user Agent is 'JSpider v0 .5.0 -dev (http://j-spider.sourceforge.net)'
INFO [ core.task.SchedulerFactory ] TaskScheduler provider class is 'class net.javacoding.jspider.core.task.impl.DefaultSchedulerProvider'
INFO [ core.Spider ] Spider born - threads: spiders: 5 , thinkers: 1
[ Plugin ] Module : Console writer JSpider module
[ Plugin ] Version: v1 .0
[ Plugin ] Vendor : http://www.javacoding.net
[ Plugin ] Spidering Started , baseURL = http://j-spider.sourceforge.net
INFO [ core.SpiderContext ] using userAgent 'JSpider v0 .5.0 -dev (http://j-spider.sourceforge.net)' for site 'http://j-spider.sourceforge.net'
[ Plugin ] site discovered : http://j-spider.sourceforge.net
[ Plugin ] resource discovered: http://j-spider.sourceforge.net
INFO [ core.throttle.ThrottleFactory ] Throttle provider class is 'class net.javacoding.jspider.core.throttle.impl.DistributedLoadThrottleProvider'
[ Plugin ] Job monitor: 0 % ( 0 / 1 ) [ S:0% (0/1) | T:0% (0/0) ] [ blocked:1 ] [ assigned:1 ]
[ Plugin ] resource discovered: http://j-spider.sourceforge.net/robots.txt
[ Plugin ] 200 - http://j-spider.sourceforge.net/robots.txt - text/plain 527 461 ms
INFO [ mod.plugin.statusbasedfilewriter.StatusBasedFileWriterPlugin ] creating file for status ' 200 '
[ Plugin ] robots.txt fetched from site [ Site: http://j-spider.sourceforge.net - ROBOTSTXT_HANDLED * ]
[ Plugin ] net.javacoding.jspider.api.event.site.UserAgentObeyedEvent obeyed rules for useragent 'JSpider' as found in robots.txt on site http://j-spider.sourceforge.net
[ Plugin ] ThreadPool Thinkers occupation: 0 % [ idle: 100%, blocked: 0%, busy: 0% ] , size: 1
[ Plugin ] ThreadPool Spiders occupation: 20 % [ idle: 80%, blocked: 20%, busy: 0% ] , size: 5
[ Plugin ] Job monitor: 66 % ( 2 / 3 ) [ S:50% (1/2) | T:100% (1/1) ] [ blocked:0 ] [ assigned:1 ]
[ Plugin ] ThreadPool Thinkers occupation: 0 % [ idle: 100%, blocked: 0%, busy: 0% ] , size: 1
[ Plugin ] ThreadPool Spiders occupation: 20 % [ idle: 80%, blocked: 0%, busy: 20% ] , size: 5
……
[ Plugin ] Job monitor: 66 % ( 2 / 3 ) [ S:50% (1/2) | T:100% (1/1) ] [ blocked:0 ] [ assigned:1 ]
[ Plugin ] ThreadPool Thinkers occupation: 0 % [ idle: 100%, blocked: 0%, busy: 0% ] , size: 1
[ Plugin ] ThreadPool Spiders occupation: 20 % [ idle: 80%, blocked: 0%, busy: 20% ] , size: 5
[ Plugin ] 200 - http://j-spider.sourceforge.net - text/html 5687 9673 ms
[ Plugin ] resource discovered: http://j-spider.sourceforge.net/css/ie.css
[ Plugin ] resource discovered: http://j-spider.sourceforge.net/img/grey.gif
……
[ Plugin ] resource discovered: http://j-spider.sourceforge.net/img/title_information.gif
INFO [ core.SpiderContext ] site http://www.sourceforge.net must not be handled.
[ Plugin ] site discovered : http://www.sourceforge.net
……
[ Plugin ] http://j-spider.sourceforge.net parsed (handled)
[ Plugin ] 200 - http://j-spider.sourceforge.net/css/ie.css - text/css 114 440 ms
[ Plugin ] http://j-spider.sourceforge.net/css/ie.css - Ignored for parsing
[ Plugin ] Job monitor: 58 % ( 29 / 50 ) [ S:12% (3/24) | T:100% (26/26) ] [ blocked:0 ] [ assigned:6 ]
[ Plugin ] ThreadPool Thinkers occupation: 0 % [ idle: 100%, blocked: 0%, busy: 0% ] , size: 1
[ Plugin ] ThreadPool Spiders occupation: 100 % [ idle: 0%, blocked: 100%, busy: 0% ] , size: 5
[ Plugin ] 200 - http://j-spider.sourceforge.net/img/grey.gif - image/gif 49 441 ms
[ Plugin ] http://j-spider.sourceforge.net/img/grey.gif - Ignored for parsing
[ Plugin ] Job monitor: 60 % ( 31 / 51 ) [ S:16% (4/24) | T:100% (27/27) ] [ blocked:0 ] [ assigned:6 ]
[ Plugin ] ThreadPool Thinkers occupation: 0 % [ idle: 100%, blocked: 0%, busy: 0% ] , size: 1
[ Plugin ] ThreadPool Spiders occupation: 100 % [ idle: 0%, blocked: 100%, busy: 0% ] , size: 5
ERROR [ core.task.work.SpiderHttpURLTask ] exception during spidering
java.io.IOException
at sun.net.www.protocol.http.HttpURLConnection.getInputStream(Unknown Source)
at net.javacoding.jspider.core.task.work.SpiderHttpURLTask.execute(Unknown Source)
at net.javacoding.jspider.core.threading.WorkerThread.run(Unknown Source)
Caused by: java.io.FileNotFoundException: http://j-spider.sourceforge.net/img/logo.gif
at sun.net.www.protocol.http.HttpURLConnection.getInputStream(Unknown Source)
at java.net.HttpURLConnection.getResponseCode(Unknown Source)
2 more
[ Plugin ] 404 - ERROR !!!http://j-spider.sourceforge.net/img/logo.gif
INFO [ mod.plugin.statusbasedfilewriter.StatusBasedFileWriterPlugin ] creating file for status ' 404 '
[ Plugin ] Job monitor: 62 % ( 32 / 51 ) [ S:20% (5/24) | T:100% (27/27) ] [ blocked:0 ] [ assigned:6 ]
……
[ Plugin ] 200 - http://j-spider.sourceforge.net/img/title_other.gif - image/gif 1044 9403 ms
[ Plugin ] http://j-spider.sourceforge.net/img/title_other.gif - Ignored for parsing
INFO [ core.Spider ] Stopped spider workers
INFO [ core.Spider ] Stopped thinker workers
[ Plugin ]
SPIDERING SUMMARY :
known urls . : 47
visited urls .. : 27
parsed urls : 11
parse ignored urls .. : 16
parse error urls . : 0
not visited urls . : 20
fetching ignored urls .. : 19
forbidden urls : 0
fetch error urls . : 1
not yet visited urls .. : 0
[ Plugin ] Spidering Stopped
INFO [ mod.plugin.velocity.VelocityPlugin ] writing dump - this could take a while
INFO [ mod.plugin.velocity.VelocityPlugin ] writing dump - this could take a while
[ Plugin ] Job monitor: 100 % ( 92 / 92 ) [ S:100% (28/28) | T:100% (64/64) ] [ blocked:0 ] [ assigned:0 ]
INFO [ core.Spider ] Spidering done!
INFO [ core.Spider ] Elapsed time : 46127
[ Plugin ] ThreadPool Thinkers occupation: 0 % [ idle: 100%, blocked: 0%, busy: 0% ] , size: 1
[ Plugin ] ThreadPool Spiders occupation: 0 % [ idle: 100%, blocked: 0%, busy: 0% ] , size: 5
可以看出,具体的 spider , parse,dump 动作都是由插件实现的。 velocity.log 是 velocity 插件的日志。
默认输出目录是output,进去可以看见7个文件: 200.out, 404.out, log4j.out, README.txt, velocity-dump.out, velocity-trace.out, xml-dump.xml;这些文件记录的就是扫描结果。*.out是文本文件,可以用文本编辑器打开.output里面没有http页面,也就是说默认配置不保存抓取下来的页面.
配置文件在conf目录下.配置文件有两种:
(1)*.properties――程序配置文件,配置程序的行为
如conf\default\plugins\download\sites.properties文件内容:
# Websites configuration file
# -----------------------------------------------------------------------------
#
# $Id: sites.properties , v 1.4 2003 / 04 / 25 21 : 28 : 55 vanrogu Exp $
#
# -----------------------------------------------------------------------------
jspider.site.config.base = base
jspider.site.config.default = skip
(2)*.vm――输出格式配置文件,配置程序的输出。JSpider采用的是第三方工具velocity。
*.vm是velocity模板文件。如,conf\default\plugins\velocity\engineSpideringStoppedEvent.vm:
known urls . : ${event.summary.known}
visited urls .. : ${event.summary.visited}
parsed urls : ${event.summary.parsed}
parse ignored urls .. : ${event.summary.ignoredForParsing}
parse error urls . : ${event.summary.parseErrors}
not visited urls . : ${event.summary.notVisited}
fetching ignored urls .. : ${event.summary.ignoredForFetching}
forbidden urls : ${event.summary.forbidden}
fetch error urls . : ${event.summary.fetchErrors}
not yet visited urls .. : ${event.summary.unvisited}