nutch安装,与solr整合

最新推荐文章于 2018-08-18 21:10:41 发布

天边tbdp

最新推荐文章于 2018-08-18 21:10:41 发布

阅读量1.3k

点赞数

本文链接：https://blog.csdn.net/tbdp6411/article/details/26700347

版权

java 同时被 3 个专栏收录

56 篇文章 0 订阅

订阅专栏

hadoop

37 篇文章 0 订阅

订阅专栏

cdh

10 篇文章 0 订阅

订阅专栏

linux环境下安装ant，svn

svn检出nutch1.8版本的源码

svn co http://svn.apache.org/repos/asf/nutch/tags/release-1.8/

进入cd ./release-1.8，运行ant命令，下载下来nutch相关的各个jar包

nutch通过ivy进行依赖管理，里面有

ant构建后生成build和runtime两个文件夹，runtime包含了deploy和local两种nutch运行方式

runtime文件夹下面有local和deploy文件夹，local文件夹下面有bin conf lib plugins test urls文件，bin目录下面有nucth和crawl命令

可以vi查看他们，得到具体的源码说明

deploy文件夹下面有apache-nutch-1.8.job bin，我们运行deploy命令时，将apache-nutch-1.8.job 提交给jobtrack运行mr命令

在/usr/local/release-1.8/runtime/local/bin/运行抓取命令crawl

./crawl
Missing seedDir : crawl <seedDir> <crawlDir> <solrURL> <numberOfRounds>参数是抓取地址，crawlDir抓取数据放到的目录，solrURL是solr的目录，numberOfRounds是原来版本中depth的意思。

在local文件夹下面建立mkdir url.txt 输入http://www.163.com

为了运行1.8还得安装solr版本，选取solr4.8.1。

从http://mirror.bit.edu.cn/apache/lucene/solr/4.8.1/下载solr-4.8.1，然后unzip到指定目录下，指定SOLR_HOME

然后cd ${SOLR_HOME}/example ，运行java -jar start.jar

5. 检查 Solr 安装

安装solr-4.8.1之后，输入一下网址，验证是否安装成功，

http://localhost:8983/solr/#/

这样，我们在/usr/local/release-1.8/runtime/local/bin/crawl ../urls ../data http://master:8983/solr/ 2抓取指定文件夹下的文件。

6. 整合 Solr、Nutch

We have both Nutch and Solr installed and setup correctly. And Nutch already created crawl data from the seed URL(s). Below are the steps to delegate searching to Solr for links to be searchable:

mv ${APACHE_SOLR_HOME}/example/solr/collection1/conf/schema.xml ${APACHE_SOLR_HOME}/example/solr/collection1/conf/schema.xml.org
cp ${NUTCH_RUNTIME_HOME}/conf/schema.xml ${APACHE_SOLR_HOME}/example/solr/collection1/conf/
vi ${APACHE_SOLR_HOME}/example/solr/conf/schema.xml
Copy exactly in 351 line: <field name="_version_" type="long" indexed="true" stored="true"/>
restart Solr with the command “java -jar start.jar” under ${APACHE_SOLR_HOME}/example
run the Solr Index command:

bin/nutch solrindex http://127.0.0.1:8983/solr/ ../data/crawldb -linkdb ../data/linkdb ../data/segments/*

The call signature for running the solrindex has changed. The linkdb is now optional, so you need to denote it with a "-linkdb" flag on the command line.

This will send all crawl data to Solr for indexing. For more information please see bin/nutch solrindex

If all has gone to plan, we are now ready to search with http://localhost:8983/solr/admin/. If you want to see the raw HTML indexed by Solr, change the content field definition inschema.xml to:

<field name="content" type="text" stored="true" indexed="true"/>

天边tbdp

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
nutch安装,与solr整合

linux环境下安装ant，svn
复制链接

扫一扫

专栏目录

nutch安装,与solr整合

6. 整合 Solr、Nutch

“相关推荐”对你有帮助么？