linux环境下安装ant,svn
svn检出nutch1.8版本的源码
svn co http://svn.apache.org/repos/asf/nutch/tags/release-1.8/
进入cd ./release-1.8,运行ant命令,下载下来nutch相关的各个jar包
nutch通过ivy进行依赖管理,里面有
ant构建后生成build和runtime两个文件夹,runtime包含了deploy和local两种nutch运行方式
runtime文件夹下面有local和deploy文件夹,local文件夹下面有bin conf lib plugins test urls文件,bin目录下面有nucth和crawl命令
可以vi查看他们,得到具体的源码说明
deploy文件夹下面有apache-nutch-1.8.job bin,我们运行deploy命令时,将apache-nutch-1.8.job 提交给jobtrack运行mr命令
在/usr/local/release-1.8/runtime/local/bin/运行抓取命令crawl
./crawl
Missing seedDir : crawl <seedDir> <crawlDir> <solrURL> <numberOfRounds>参数是抓取地址,crawlDir抓取数据放到的目录,solrURL是solr的目录,numberOfRounds是原来版本中depth的意思。
在local文件夹下面建立mkdir url.txt 输入http://www.163.com
为了运行1.8还得安装solr版本,选取solr4.8.1。
从http://mirror.bit.edu.cn/apache/lucene/solr/4.8.1/下载solr-4.8.1,然后unzip到指定目录下,指定SOLR_HOME
然后cd ${SOLR_HOME}/example ,运行java -jar start.jar
5. 检查 Solr 安装
安装solr-4.8.1之后,输入一下网址,验证是否安装成功,
http://localhost:8983/solr/#/
这样,我们在/usr/local/release-1.8/runtime/local/bin/crawl ../urls ../data http://master:8983/solr/ 2抓取指定文件夹下的文件。
6. 整合 Solr、Nutch
We have both Nutch and Solr installed and setup correctly. And Nutch already created crawl data from the seed URL(s). Below are the steps to delegate searching to Solr for links to be searchable:
- mv ${APACHE_SOLR_HOME}/example/solr/collection1/conf/schema.xml ${APACHE_SOLR_HOME}/example/solr/collection1/conf/schema.xml.org
-
cp ${NUTCH_RUNTIME_HOME}/conf/schema.xml ${APACHE_SOLR_HOME}/example/solr/collection1/conf/
- vi ${APACHE_SOLR_HOME}/example/solr/conf/schema.xml
-
Copy exactly in 351 line: <field name="_version_" type="long" indexed="true" stored="true"/>
-
restart Solr with the command “java -jar start.jar” under ${APACHE_SOLR_HOME}/example
- run the Solr Index command:
bin/nutch solrindex http://127.0.0.1:8983/solr/ ../data/crawldb -linkdb ../data/linkdb ../data/segments/*
The call signature for running the solrindex has changed. The linkdb is now optional, so you need to denote it with a "-linkdb" flag on the command line.
This will send all crawl data to Solr for indexing. For more information please see bin/nutch solrindex
If all has gone to plan, we are now ready to search with http://localhost:8983/solr/admin/. If you want to see the raw HTML indexed by Solr, change the content field definition inschema.xml to:
<field name="content" type="text" stored="true" indexed="true"/>