nutch solr系列之（一）win7下安装nutch solr

最新推荐文章于 2024-07-06 17:43:25 发布

巨峰

最新推荐文章于 2024-07-06 17:43:25 发布

阅读量409

点赞数 1

分类专栏： nutch与solr 文章标签： nutch solr

本文链接：https://blog.csdn.net/xzf19901108/article/details/78255736

版权

nutch与solr 专栏收录该内容

3 篇文章 0 订阅

订阅专栏

版本选择
nutch 1.9 + solr 4.8.1（nutch 1.x版本提供编译版本，而nutch 2.x只提供源码需自行编译）
nutch 1.9 下载地址：nutch 1.9
solr 4.8.1 下载地址：solr 4.8.1
环境选择
nutch只能运行在linux系统中，所以选择在cgywin中运行
solr运行在本地tomcat8中
JDK也安装在本地win7中
所以环境为：win7 64位 + cygwin 64位 + JDK1.8 64位 + tomcat8.5.9(JDK1.8对于tomcat 7.0支持有问题)
nutch1.9的安装（cgywin的安装自行安装）
3.1 将下载好的nutch1.9解压后拷贝到cgywin安装路径的home目录下D:\cygwin64\home\apache-nutch-1.9

3.2 由于nutch设计为运行在linux中，所以即使在cgywin中直接运行时，还会有个小问题，需要通过更改apache-nutch-1.9\lib目录下的hadoop-core-1.2.0.jar包中的org.apache.hadoop.fs.FileUtil类的源码才能正常运行，但网上提供了修改后的jar包，可以直接下载即可，网址为：hadoop-core-1.2.0.jar,将下载好的hadoop-core-1.2.0.jar替换掉原有的jar包即可，进入nutch的bin目录运行nutch命令即可验证安装是否成功（验证nutch安装成功的脚本详见最后的代码）
solr4.8.1的安装（部署在tomcat8.5.9中）
4.1 将solr4.8.1解压后取solr-4.8.1\dist目录下的solr-4.8.1.war包（也可不改名），改名为solr.war放置到某个目录（我放在D:\apps目录下）

4.2 将solr-4.8.1\example\lib\ext目录下的所有jar包拷贝到tomcat的lib目录下（solr运行时需要这些jar包，否则solr无法正常部署）

4.3 将solr-4.8.1\example下面的solr文件夹（包括子目录中的内容）拷贝到tomcat根目录下（solr运行时需要用到，否则solr无法正常部署）

4.4 在tomcat的tomcat-8.5.9\conf\Catalina\localhost目录下面（如果没有可自己建这些文件夹或者运行一次tomcat也能自动新建）添加一个名叫solr.xml（此文件即为solr的部署描述文件，以代替直接将war包放置在tomcat的webapps下的部署模式）的文件，文件内容为
<Context docBase=”D:/apps/solr.war” crossContext=”true” >
<Environment name=”solr/home” type=”java.lang.String” value=”D:/tomcat-8.5.9/solr” override=”true” />
</Context>

4.5 将D:\cygwin64\home\apache-nutch-1.9\conf目录下的schema.xml配置文件复制到tomcat的solr（即为4.3步骤中拷贝到tomcat根目录下的文件夹）的D:\tomcat-8.5.9\solr\collection1\conf目录下面（请先备份好该目录下原有的schema.xml文件）

4.6 将schema.xml中的
<filter class=”solr.EnglishPorterFilterFactory”
protected=”protwords.txt”/>
内容注释掉（否则solr无法正常访问）

4.7 在schema.xml的<fields>节点中添加
<– 为了解决solr报version field must exist in schema而加上的 –>
<field name=”version” type=”long” stored=”true” indexed=”true” multiValued=”false”/>
内容。（否则solr启动报version field must be set错误）
为了tomcat更好的支持中文，需修改tomcat-8.5.9\conf目录下的server.xml文件,修改该文件中的
<Connector port=”8080” protocol=”HTTP/1.1”
connectionTimeout=”20000”
redirectPort=”8443” />
内容为
<Connector port=”8080” protocol=”HTTP/1.1”
connectionTimeout=”20000”
redirectPort=”8443”
maxHttpHeaderSize=”8192”
maxThreads=”150” minSpareThreads=”25” maxSpareThreds=”25”
enableLookups=”false” acceptCount=”100” disableUploadTimeout=”true”
URIEncoding=”UTF-8” useBodyEncodingForURI=”true”/>
通过http://127.0.0.1:8080/solr访问solr服务，如能正常访问服务即说明solr部署成功
为了更好的管理tomcat可以向tomcat-8.5.9\conf目录下的tomcat-users.xml文件中加入
<role rolename=”manager-gui”/>
<user username=”tomcat” password=”s3cret” roles=”manager-gui”/>
内容以便管理tomcat

Administrator@Magic ~
$ cd /home/apache-nutch-1.9/bin

Administrator@Magic /home/apache-nutch-1.9/bin
$ ./nutch
Usage: nutch COMMAND
where COMMAND is one of:
  readdb            read / dump crawl db
  mergedb           merge crawldb-s, with optional filtering
  readlinkdb        read / dump link db
  inject            inject new urls into the database
  generate          generate new segments to fetch from crawl db
  freegen           generate new segments to fetch from text files
  fetch             fetch a segment's pages
  parse             parse a segment's pages
  readseg           read / dump segment data
  mergesegs         merge several segments, with optional filtering and slicing
  updatedb          update crawl db from segments after fetching
  invertlinks       create a linkdb from parsed segments
  mergelinkdb       merge linkdb-s, with optional filtering
  index             run the plugin-based indexer on parsed segments and linkdb
  dedup             deduplicate entries in the crawldb and give them a special status
  solrindex         run the solr indexer on parsed segments and linkdb - DEPRECATED use the index command instead
  solrdedup         remove duplicates from solr - DEPRECATED use the dedup command instead
  solrclean         remove HTTP 301 and 404 documents from solr - DEPRECATED use the clean command instead
  clean             remove HTTP 301 and 404 documents and duplicates from indexing backends configured via plugins
  parsechecker      check the parser for a given url
  indexchecker      check the indexing filters for a given url
  domainstats       calculate domain statistics from crawldb
  webgraph          generate a web graph from existing segments
  linkrank          run a link analysis program on the generated web graph
  scoreupdater      updates the crawldb with linkrank scores
  nodedumper        dumps the web graph's node scores
  plugin            load a plugin and run one of its classes main()
  junit             runs the given JUnit test
 or
  CLASSNAME         run the class named CLASSNAME
Most commands print help when invoked w/o parameters.

巨峰

关注

1
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
nutch solr系列之（一）win7下安装nutch solr

win7下安装nutch&solr安装在win7 64位下安装nutch 1.9 和 solr 4.8.1，包括使用cgywin安装nutch 1.9和使用tomcat 8.5.9 部署solr 4.8.1
复制链接

扫一扫