nutch solr mysql_Ubuntu 13.10下配置Nutch1.7和Solr4.6集成

苟渝

于 2021-01-28 09:35:19 发布

阅读量80

点赞数

文章标签： nutch solr mysql

版权声明：本文为博主原创文章，遵循 CC 4.0 BY-SA 版权协议，转载请附上原文出处链接和本声明。

本文链接：https://blog.csdn.net/weixin_34305716/article/details/113469237

版权

1。系统准备

安装Ubuntu13.10，设置源，之后sudo apt-get update和sudo apt-get upgrade

2。相关软件准备

(1)安装ant

sudo apt-get install ant1.7,检查安装情况ant -version出现

Apache Ant version 1.7.1 compiled on September 3 2011

表明安装成功。

(2)jdk安装配置

从官网下载jdk，解压到目录/opt/jdk

环境变量配置：sudo gedit /etc/profile文末添加内容

export Java_HOME=/opt/jdk

export PATH=$JAVA_HOME/bin:$PATH

export CLASSPATH=.:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar

保存推出，source /etc/profile使配置生效。

检验：java -version和java均有内容(内容省了粘贴)

(3)nutch

下载nutch1.7，解压到/opt/nutch

cd /opt/nutch

bin/nutch

此时会出现用法帮助，表示安装成功了。下面进行相关配置。

step1：修改文件conf/nutch-site.xml，设置HTTP请求中agent的名字：

http.agent.name

Friendly Crawler

step2:创建种子文件夹

mkdir -p urls

step3:将种子URL写到文件urls/seed.txt中：sudo gedit seed.txt

http://www.linuxidc.com

step4:配置 conf/regex-urlfilter.txt

# accept anything else

# +.

# added by yoyo

+36kr.com

step5:修改conf/nutch-site.xml，在里面增加一个parser.skip.truncated属性:

parser.skip.truncated

false

这是因为用tcpdump或者wireshark抓包发现，该网站的页面内容采用truncate的方式分段返回，而nutch的默认设置是不处理这种方式的，需要打开之，

参考：http://lucene.472066.n3.nabble.com/Content-Truncation-in-Nutch-2-1-MySQL-td4038888.html

step6:爬取实验

bin/nutch crawl urls -dir crawl

(4)Solr安装

下载solr4.6，解压到/opt/solr

cd /opt/solr/example

java -jar start.jar

如能正常打开网页http://localhost:8983/solr/则说明成功。

3.Nutch与Solr集成

(1)环境变量设置：

sudo gedit /etc/profile 添加

export NUTCH_RUNTIME_HOME=/opt/nutch

export APACHE_SOLR_HOME=/opt/solr

(2)集成

mkdir ${APACHE_SOLR_HOME}/example/solr/conf

cp ${NUTCH_RUNTIME_HOME}/conf/schema.xml ${APACHE_SOLR_HOME}/example/solr/conf/

重启solr：

java -jar start.jar

建立索引：

bin/nutch crawl urls -dir crawl -depth 2 -topN 5 -solrhttp://localhost:8983/solr/

出错：

Active IndexWriters :

SOLRIndexWriter

solr.server.url : URL of the SOLR instance (mandatory)

solr.commit.size : buffer size when sending to SOLR (default 1000)

solr.mapping.file : name of the mapping file for fields (default solrindex-mapping.xml)

solr.auth : use authentication (default false)

solr.auth.username : use authentication (default false)

solr.auth : username for authentication

solr.auth.password : password for authentication

Exception in thread "main" java.io.IOException: Job failed!

at org.apache.Hadoop.mapred.JobClient.runJob(JobClient.java:1357)

at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:123)

at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:81)

at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:65)

at org.apache.nutch.crawl.Crawl.run(Crawl.java:155)

at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)

at org.apache.nutch.crawl.Crawl.main(Crawl.java:55)

解决方法是参考http://stackoverflow.com/questions/13429481/error-while-indexing-in-solr-data-crawled-by-nutch

类似的还有其他一些字段需要补充，方法是编辑 ~/solr-4.4.0/example/solr/collection1/conf/schema.xml，在…中增加以下的字段：

(3)验证

rm crawl/ -Rf

bin/nutch crawl urls -dir crawl -depth 2 -topN 5 -solrhttp://localhost:8983/solr/

…………

…………

CrawlDb update: Merging segment data into db.

CrawlDb update: finished at 2014-03-03 08:55:30, elapsed: 00:00:01

LinkDb: starting at 2014-03-03 08:55:30

LinkDb: linkdb: crawl/linkdb

LinkDb: URL normalize: true

LinkDb: URL filter: true

LinkDb: internal links will be ignored.

LinkDb: adding segment: file:/opt/nutch/crawl/segments/20140303085430

LinkDb: adding segment: file:/opt/nutch/crawl/segments/20140303085441

LinkDb: finished at 2014-03-03 08:55:31, elapsed: 00:00:01

Indexer: starting at 2014-03-03 08:55:31

Indexer: deleting gone documents: false

Indexer: URL filtering: false

Indexer: URL normalizing: false

Active IndexWriters :

SOLRIndexWriter

solr.server.url : URL of the SOLR instance (mandatory)

solr.commit.size : buffer size when sending to SOLR (default 1000)

solr.mapping.file : name of the mapping file for fields (default solrindex-mapping.xml)

solr.auth : use authentication (default false)

solr.auth.username : use authentication (default false)

solr.auth : username for authentication

solr.auth.password : password for authentication

Indexer: finished at 2014-03-03 08:55:35, elapsed: 00:00:03

SolrDeleteDuplicates: starting at 2014-03-03 08:55:35

SolrDeleteDuplicates: Solr url: http://localhost:8983/solr/

SolrDeleteDuplicates: finished at 2014-03-03 08:55:36, elapsed: 00:00:01

crawl finished: crawl

检索抓取到的内容，用浏览器打开 http://localhost:8983/solr/#/collection1/query ，点击Excute Query即可。

Nutch的详细介绍：请点这里

Nutch的下载地址：请点这里

相关阅读：

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
nutch solr mysql_Ubuntu 13.10下配置Nutch1.7和Solr4.6集成

1。系统准备安装Ubuntu13.10，设置源，之后sudo apt-get update和sudo apt-get upgrade2。相关软件准备(1)安装antsudo apt-get install ant1.7,检查安装情况ant -version出现Apache Ant version 1.7.1 compiled on September 3 2011表明安装成功。(2)jdk安装配置...
复制链接

扫一扫

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。