1 Cygwin
Cygwin是一个运行于Windows下的免费的UNIX的子系统,使用一个Dll(动态链接库)来实现。(windows下的一个模拟UNIX的系统)
安装:http://cppforever.spaces.live.com/Blog/cns!9DB38506DFDBED76!998.entry (在线安装)
(直接下载)
2 nutch : web搜索引擎
Nutch主要分为两个部分:爬虫crawler和查询searcher。Crawler主要用于从网络上抓取网页并为这些网页建立索引。Searcher主要利用这些索引检索用户的查找关键词来产生查找结果。两者之间的接口是索引,所以除去索引部分,两者之间的耦合度很低。
http://baike.baidu.com/view/46642.htm nutch介绍。
3
解压nutch到d盘
启动cygwin cd /cygdrive/d/nutch
crawl-urlfilter.txt
+^http://([a-z0-9]*/.)*csdn.net/
nutch-site.xml
<property>
<name>http.agent.name</name>
<value>csdn.net</value>
<description>csdn.net</description>
</property>
mkdir urls
echo http://www.csdn.net/ > urls/csdn
export JAVA_HOME = /cygdrive/c/Program Files/Java/jdk1.6.0_07
bin/nutch crawl urls -dir csdn -depth 3 -threads 4 >& crawl.log
生成索引文件放到 nutch/csdn下
(可以使用 luke来查看索引数据库 http://www.getopt.org/luke/)
将nutch-1.0.war放到 webapps 下
修改class/nutch-site.xml 添加
<name>searcher.dir</name>
<value>D:/nutch/csdn</value>
指定搜索使用的索引文件
参照:http://billion318.spaces.live.com/blog/cns!36A80F7663730159!109.entry
4 使用paoding分词
将paoding.jar中properties添加到nutch classpath下
修改paoding-dic-home.properties
paoding.dic.home=D:/paoding/dic (注意 /)
在nutch 分析器 NutchDocumentAnalyzer 中使用PaodingAnalyzer
private static Analyzer PAODING_ANALYZER;
public NutchDocumentAnalyzer(Configuration conf) {
PAODING_ANALYZER = new PaodingAnalyzer();
}
public TokenStream tokenStream(String fieldName, Reader reader) {
analyzer = PAODING_ANALYZER;
}
5 搜索 pdf|doc|xls|ppt|txt
默认情况下可以搜索txt内容
pdf|doc|xls|ppt 需要配置:
parse-plugins.xml: 指定文件使用的parser
nutch-default.xml:
<name>plugin.includes</name>//指定include那些文件类型
<value>.....|parse-(text|html|js|pdf|msexcel|mspowerpoint|msword)|...
<name>http.content.limit</name> //指定文件大小
<value>-1</value>//表示无限制 不然可能会有 parse incomplete exception
regex-urlfilter.txt 将ppt xls从过滤列表中去掉(pdf,doc默认不过滤)
crawl-urlfilter.txt 将ppt xls从过滤列表中去掉(pdf,doc默认不过滤)
6 window 调用crawl
配置环境变量: NUTCH_HOME=D:/nutch
PATH: %PATH%+d:/cygwin/bin;d:/cygwin/usr/bin;
nutch.bat crawl d:/nutch/csdn -dir d:/nutch/csdn -depth 2 -topN 5000 (这里路径要用绝对路径)
(该bat参考 http://wangxuliangboy.javaeye.com/blog/279552)
@echo off
set JAVA_HEAP_MAX="-Xmx512M"
if not "%1"=="" goto INIT else goto echoMSG
:echoMSG
echo Title: Nutch
echo Usage: nutch COMMAND
echo where COMMAND is one of:
echo crawl one-step crawler for intranets
echo inject inject new urls into the database
echo generate generate new segments to fetch
echo fetchlist print the fetchlist of a segment
echo fetch fetch a segment's pages
echo parse parse a segment's pages
echo index run the indexer on a segment's fetcher output
echo merge merge several segment indexes
echo dedup remove duplicates from a set of segment indexes
echo updatedb update db from segments after fetching
echo updatesegs update segments with link data from the db
echo mergesegs merge multiple segments into a single segment
echo analyze adjust database link-analysis scoring
echo segread read, fix and dump segment data
echo segslice append, join and slice segment data
echo server run a search server
echo namenode run the NDFS namenode
echo datanode run an NDFS datanode
echo ndfs run an NDFS admin client
echo jobtracker run the MapReduce job Tracker node
echo tasktracker run a MapReduce task Tracker node
echo or
echo CLASSNAME run the class named CLASSNAME
echo Most commands print help when invoked w/o parameters.
goto end;
:INIT
set NUTCH_HOME=%NUTCH_HOME%
if "%NUTCH_HOME%"=="" echo NUTCH_HOME IN PATH ONT FOUND
set CLASSPATH=%NUTCH_HOME%;%NUTCH_HOME%/conf;%NUTCH_HOME%/plugin;%NUTCH_HOME%/lib
@echo @echo off>setclasspath.bat
for %%i in (%NUTCH_HOME%/nutch-*.jar) do @echo set CLASSPATH=%%CLASSPATH%%;%%i>>setclasspath.bat;& for %%i in (%NUTCH_HOME%/lib/*.jar) do @echo set CLASSPATH=%%CLASSPATH%%;%%i>>setclasspath.bat;
goto EXEC
:EXEC
call setclasspath
if "%1" == "crawl" set CLASS=org.apache.nutch.crawl.Crawl
if "%1" == "inject" set CLASS=org.apache.nutch.crawl.Injector
if "%1" == "generate" set CLASS=org.apache.nutch.crawl.Generator
if "%1" == "fetchlist" set CLASS=org.apache.nutch.pagedb.FetchListEntry
if "%1" == "fetch" set CLASS=org.apache.nutch.fetcher.Fetcher
if "%1" == "fetch2" set CLASS=org.apache.nutch.fetcher.Fetcher2
if "%1" == "convdb" set CLASS=org.apache.nutch.tools.compat.CrawlDbConverter
if "%1" == "parse" set CLASS=org.apache.nutch.parse.ParseSegment
if "%1" == "index" set CLASS=org.apache.nutch.indexer.Indexer
if "%1" == "merge" set CLASS=org.apache.nutch.indexer.IndexMerger
if "%1" == "dedup" set CLASS=org.apache.nutch.indexer.DeleteDuplicates
if "%1" == "updatedb" set CLASS=org.apache.nutch.crawl.CrawlDb
if "%1" == "mergesegs" set CLASS=org.apache.nutch.segment.SegmentMerger
if "%1" == "readdb" set CLASS=org.apache.nutch.crawl.CrawlDbReader
if "%1" == "segread" echo "[DEPRECATED] Command 'segread' is deprecated, use 'readseg' instead." set CLASS=org.apache.nutch.segment.SegmentReader
if "%1" == "server" set CLASS=org.apache.nutch.searcher.DistributedSearch$Server
echo %CLASSPATH%
call "%JAVA_HOME%/bin/java" %JAVA_HEAP_MAX% -classpath "%CLASSPATH%" %CLASS% %2 %3 %4 %5 %6 %7 %8 %9
:end