Nutch+MongoDB+ElasticSearch+Kibana 搭建搜索引擎

Nutch+MongoDB+ElasticSearch+Kibana 搭建搜索引擎

前言:

文章讲述如何通过Nutch、MongoDB、ElasticSearch、Kibana搭建网络爬虫,其中Nutch用于网页数据爬取,MongoDB用于存储爬虫而来的数据,ElasticSearch用来作Index索引,Kibana用来形象化查看索引结果。具体步骤如下:

配置环境:

系统环境:Ubuntu 14.04

JDK版本:jdk1.8.0_45
通过wget获取下载安装包:

gannyee@ubuntu:~/download$ wget https://www.reucon.com/cdn/java/jdk-8u45-linux-x64.tar.gz
tar zxvf jdk-8u45-linux-x64.tar.gz

解压后得到jdk1.8.0_45这个文件夹,先查看/usr/lib/路径下有没有jvm这个文件夹,若没有,则新建一个jvm文件夹:

gannyee@ubuntu:~/download$ mkdir /usr/lib/jvm

1

将当前解压得到的jdk1.8.0_45复制到/usr/lib/jvm中:

gannyee@ubuntu:~/download$mv jdk1.8.0_45 /usr/lib/jvm

1

打开profile设置环境变量:

gannyee@ubuntu:~/download$vim /etc/profile

在profile的末尾加入以下内容:

export JAVA_HOME=/usr/lib/jvm/jdk1.8.0_45
export CLASSPATH=.: JAVAHOME/jre/lib/rt.jar: JAVA_HOME/lib/dt.jar: JAVAHOME/lib/tools.jarexportPATH= PATH:$JAVA_HOME/bin

然后使用以下命令使得环境变量生效:

gannyee@ubuntu:~/download$source /etc/profile

1

到此为止,JDK就安装完成了。查看JDK的版本:

gannyee@ubuntu:~/download$java –version
java version “1.8.0_45”
Java(TM) SE Runtime Environment (build 1.8.0_45-b14)
Java HotSpot(TM) 64-Bit Server VM (build 25.45-b02, mixed mode)

若以上命令没有成功显示版本信息,那有可能是之前的操作出现问题,请仔细检查之前的操作。

Ant版本:1.9.4
通过wget下载安装包:
https://archive.apache.org/dist/ant/binaries/binaries/apache-ant-1.9.4-bin.tar.gz

gannyee@ubuntu:~/download$ wget https://archive.apache.org/dist/ant/binaries/binaries/apache-ant-1.9.4-bin.tar.gz

1

解压后可得到apache-ant-1.9.6这个文件夹,将其移动到/usr/local/ant文件夹中:

gannyee@ubuntu:~/download sudotarzvxfapacheant1.9.4bin.tar.gzgannyee@ubuntu: /download sudo mkdir /usr/local/ant
gannyee@ubuntu:~/download$mv apache-ant-1.9.4 /usr/local/ant

打开profile设置环境变量:

gannyee@ubuntu:~/download$vim /etc/profile

1

在profile文件末尾加入以下内容:

export ANT_HOME=/usr/local/ant/apache-ant-1.9.4
export PATH= PATH: ANT_HOME/bin

1
2

使用以下命令使得环境变量生效:

gannyee@ubuntu:~/download$source /etc/profile

1

查看Ant版本:

gannyee@ubuntu:~/download$ant -version
Apache Ant(TM) version 1.9.4 compiled on April 29 2014

1
2

至此,配置引擎所需的环境预先配置完成!

引擎数据流如图示:

图片来源博客:http://www.aossama.com/search-engine-with-apache-nutch-mongodb-and-elasticsearch/

这里写图片描述

Mongodb下载、安装、启动

开源文档数据库,Nosql数据典型代表之一。
版本:MongoDB-2.6.11

gannyee@ubuntu:~/download$ wget https://fastdl.mongodb.org/src/mongodb-src-r2.6.11.tar.gz
gannyee@ubuntu:~/download sudotarzxvfmongodbsrcr2.6.11.tar.gzgannyee@ubuntu: /download mv mongodb-src-r2.6.11/ ../mongodb/
gannyee@ubuntu:~ cdmongodb/gannyee@ubuntu: /mongodb
sudo mkdir log/ conf/ data/

从2.6版开始,mongodb使用YAML-based配置文件格式。参考下面的配置可以在这里找到。

创建se.yml

gannyee@ubuntu:~/mongodb$ vim conf/se.yml
net:
port: 27017
bindIp: 127.0.0.1
systemLog:
destination: file
path: “/opt/mongodb/log/mongodb.log”
logAppend: true
processManagement:
fork: true
pidFilePath: “/opt/mongodb/log/mongodb.pid”
storage:
dbPath: “/opt/mongodb/data”
directoryPerDB: true
smallFiles: true

启动Mongodb

gannyee@ubuntu:~/mongodb$ ./bin/mongod -f conf/se.yml

1

进入Mongodb以检查Mongodb是否启动成功

gannyee@ubuntu:~/mongodb$ ./bin/mongo
MongoDB shell version: 2.6.11
connecting to: test

show dbs
admin (empty)
local 0.031GB
exit
bye

关闭Mongodb:

use admin
db.shutdownServer()

如Ubuntu使用Mongodb的图形化界面管理工具,推荐使用robomongo
下载地址:
http://app.robomongo.org/files/linux/robomongo-0.8.5-x86_64.deb
使用robomongo链接数据库
下载、安装robomongo

gannyee@ubuntu:~/mongodb$ sudo wget http://app.robomongo.org/files/linux/robomongo-0.8.5-x86_64.deb
gannyee@ubuntu:~/mongodb$sudo dpkg -i robomongo-0.8.5-x86_64.deb

gannyee@ubuntu:~$robomongo就可以打开客户端。
建立新连接,只需要添加host和port即可。
note:我第一次安装成功后链接也成功,但是看不到任何数据。
解决办法:重新使用root权限安装即可。
软件界面如图所示:
这里写图片描述
如果需要外网访问的话,需要配置文件中的bindIp: 127.0.0.1改为bindIp: 0.0.0.0

然后在浏览器中输入:http://localhost:27017,如果出现以下内容,说明外网可以访问:
It looks like you are trying to access MongoDB over HTTP on the native driver port.

如果出现无法执行./mongod的错误
大部分是因为mongodb 服务在不正常关闭的情况下,mongod 被锁,想想可能是上次无故死机造成的.
如何解决这种问题:

    删除 mongod.lock 文件和日志文件 mongodb.log.2016-1-26T06-55-20 ,如果有必要把 log日志全部删除
    mongod –repair –dbpath /home/gannyee/mongodb/data/db / –repairpath /home/gannyee/mongodb

ElasticSearch下载、安装

从Apache Lucene提取高性能的分布式搜索引擎。
版本:ElastricSearch-1.4.4

gannyee@ubuntu:~/download$wget https://download.elasticsearch.org/elasticsearch/elasticsearch/elasticsearch-1.4.4.tar.gz
gannyee@ubuntu:~/download tarzxvfelasticsearch1.4.4.tar.gzgannyee@ubuntu: /download mv elasticsearch-1.4.4 ../elasticsearch
gannyee@ubuntu:~$cd /elasticsearch

修改config下文件elasticsearch.yml

gannyee@ubuntu:~/elasticsearch$ vim config/elasticsearch.yml
……
cluster.name: gannyee
node.name: “gannyee”
node.master: true
node.data: true
path.conf: /home/gannyee/elasticsearch/config
path.data: /home/gannyee/elasticsearch/data
http.enabled: true
network.bind_host: 127.0.0.1
network.publish_host: 127.0.0.1
network.host: 127.0.0.1
…….

后台启动ElasticSearch

gannyee@ubuntu:~/elasticsearch$ ./bin/elasticsearch -d

终止ElasticSearch进程
关闭单一节点

gannyee@ubuntu:~/elasticsearch$curl -XPOST
http://localhost:9200/_cluster/nodes/_shutdown

关闭节点BlrmMvBdSKiCeYGsiHijdg

gannyee@ubuntu:~/elasticsearch$curl –XPOST
http://localhost:9200/_cluster/nodes/BlrmMvBdSKiCeYGsiHijdg/_shutdown

检测是否成功运行ElasticSearch

gannyee@ubuntu:~/elasticsearch$ curl -XGET ‘http://localhost:9200
{
“status” : 200,
“name” : “gannyee”,
“cluster_name” : “gannyee”,
“version” : {
“number” : “1.4.4”,
“build_hash” : “c88f77ffc81301dfa9dfd81ca2232f09588bd512”,
“build_timestamp” : “2015-02-19T13:05:36Z”,
“build_snapshot” : false,
“lucene_version” : “4.10.3”
},
“tagline” : “You Know, for Search”
}

elasticsearch-head是一个elasticsearch的集群管理工具,它是完全由html5编写的独立网页程序,你可以通过插件把它集成到es
安装 elasticsearch-head插件

gannyee@ubuntu:~/elasticsearch cdelasticsearchgannyee@ubuntu: /elasticsearch ./bin/plugin -install mobz/elasticsearch-head

运行重启elasticsearch
在浏览器输入:http://localhost:9200/_plugin/head/
界面的右边有些按钮,如:node stats, cluster nodes,这些是直接请求es的相关状态的api,返回结果为json,如下图:
这里写图片描述

Kibana下载、安装

基于分析和搜索Elasticsearch仪表板的开源浏览器
版本:kibana-4.0.1

gannyee@ubuntu:~/download$wget https://download.elasticsearch.org/kibana/kibana/kibana-4.0.1-linux-x64.tar.gz
gannyee@ubuntu:~/download tarzxvf/downloadkibana4.0.1linuxx64.tar.gzgannyee@ubuntu: /download mv kibana-4.0.1-linux-x64/ ../kibana/
gannyee@ubuntu:~/download cd../kibana/gannyee@ubuntu: /kibana ./bin/kibana

下面你就可以通过http://127.0.0.1:5601端口访问了,界面如图所示:
这里写图片描述
Apache Nutch 安装、编译、配置:

在Lucene发展来的开源网络爬虫,本次配置只能使用nutch2.x系列,1.x系列不支持MongoDB等其他如Mysql,Habase数据库。
版本:apache-nutch-2.3.1

Nutch2.3下载、编译、配置

gannyee@ubuntu:~/download$ wget
http://www.apache.org/dyn/closer.lua/nutch/2.3.1/apache-nutch-2.3.1-src.tar.gz
gannyee@ubuntu:~/download tarzxvfapachenutch2.3.1src.tar.gzgannyee@ubuntu: /download mv apache-nutch-2.3.1 ../nutch
gannyee@ubuntu:~/download cd../nutchgannyee@ubuntu: /nutch export NUTCH_HOME=$(pwd)

修改/conf/nutch-site.xml使Mongodb作为GORA的存储单元

gannyee@ubuntu:~/nutch/conf$ vim nutch-site.conf


storage.data.store.class
org.apache.gora.mongodb.store.MongoStore
Default class for storing data

从/ivy/ivy.xml文件中取消下面部分的注释

gannyee@ubuntu:~/nutch/conf vim NUTCH_HOME/ivy/ivy.xml
default” />

确保MongoStore设置为默认数据存储

gannyee@ubuntu:~/nutch$ vim conf/gora.properties
/#######################
/# MongoDBStore properties #
/#######################
gora.datastore.default=org.apache.gora.mongodb.store.MongoStore
gora.mongodb.override_hadoop_configuration=false
gora.mongodb.mapping.file=/gora-mongodb-mapping.xml
gora.mongodb.servers=localhost:27017
gora.mongodb.db=nutch

开始编译nutch

gannyee@ubuntu:~/nutch$ant runtime

1

如果编译过程中有如下错误

Trying to override old definition of task javac
[taskdef] Could not load definitions from resource org/sonar/ant/antlib.xml. It could not be found.

ivy-probe-antlib:

ivy-download:
[taskdef] Could not load definitions from resource org/sonar/ant/antlib.xml. It could not be found.

Trying to override old definition of task javac
[taskdef] Could not load definitions from resource org/sonar/ant/antlib.xml. It could not be found.

ivy-probe-antlib:

ivy-download:
[taskdef] Could not load definitions from resource org/sonar/ant/antlib.xml. It could not be found.

是因为缺少lib包,解决办法如下(其实可以无视):
下载 sonar-ant-task-2.1.jar,拷贝到 $NUTCH_HOME/lib 目录下面

修改 $NUTCH_HOME/build.xml,引入上面添加






1
2
3
4
5
6

编译后的文件将被放在新生成的文件夹/nutch/runtime中

最后确认nutch已经正确地编译和运行,输出如下:

gannyee@ubuntu:~/nutch/runtime/local$ ./bin/nutch
Usage: nutch COMMAND
where COMMAND is one of:
inject inject new urls into the database
hostinject creates or updates an existing host table from a text file
generate generate new batches to fetch from crawl db
fetch fetch URLs marked during generate
parse parse URLs marked during fetch
updatedb update web table after parsing
updatehostdb update host table after parsing
readdb read/dump records from page database
readhostdb display entries from the hostDB
index run the plugin-based indexer on parsed batches
elasticindex run the elasticsearch indexer - DEPRECATED use the index command instead
solrindex run the solr indexer on parsed batches - DEPRECATED use the index command instead
solrdedup remove duplicates from solr
solrclean remove HTTP 301 and 404 documents from solr - DEPRECATED use the clean command instead
clean remove HTTP 301 and 404 documents and duplicates from indexing backends configured via plugins
parsechecker check the parser for a given url
indexchecker check the indexing filters for a given url
plugin load a plugin and run one of its classes main()
nutchserver run a (local) Nutch server on a user defined port
webapp run a local Nutch web application
junit runs the given JUnit test
or
CLASSNAME run the class named CLASSNAME
Most commands print help when invoked w/o parameters.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27

定制你的爬取特性

gannyee@ubuntu:~$ sudo vim /nutch/runtime/local/conf/nutch-site.xml

< ?xml version=”1.0”?>
< ?xml-stylesheet type=”text/xsl” href=”configuration.xsl”?>



storage.data.store.class
org.apache.gora.mongodb.store.MongoStore
Default class for storing data


http.agent.name
Hist Crawler


plugin.includes
protocol-(httphttpclient)urlfilter-regexindex-(basicmore)query-(basicsiteurllang)indexer-elasticnutch-extensionpointsparse-(texthtmlmsexcelmswordmspowerpointpdf)summary-basicscoring-opicurlnormalizer-(passregexbasic)parse-(htmltikametatags)index-(basicanchormoremetadata)


elastic.host
localhost


elastic.cluster
hist


elastic.index
nutch


parser.character.encoding.default
utf-8


http.content.limit
6553600

爬取自己第一个网页
创建一个URL种子列表

gannyee@ubuntu:~ mkdirp/nutch/runtime/local/urlsgannyee@ubuntu:  echo ‘http://www.aossama.com/’ >/nutch/runtime/local/urls/seed.txt

1
2

编辑conf/regex-urlfilter.txt文件,并且替换以下内容

/# accept anything else
+.

1
2

使用正则表达式匹配你想要爬取的域名

+^http://([a-z0-9]*.)*aossama.com/

1

初始化crawldb

gannyee@ubuntu:~/nutch/runtime/local$ ./bin/nutch inject urls/

1

从 crawldb生成urls

gannyee@ubuntu:~/nutch/runtime/local$ ./bin/nutch generate -topN 80

1

获取生成的所有urls

gannyee@ubuntu:~/nutch/runtime/local$ ./bin/nutch fetch -all

1

解析获取的urls

gannyee@ubuntu:~/nutch/runtime/local$./ bin/nutch parse -all

1

更新database数据库

gannyee@ubuntu:~/nutch/runtime/local$ ./bin/nutch updatedb -all

1

索引解析的urls

gannyee@ubuntu:~/nutch/runtime/local$ bin/nutch index -all

1

爬取完给定网页,mongoDB会生成一个新的数据库:nutch_1

gannyee@ubuntu:~/mongodb$ ./bin/mongo
MongoDB shell version: 2.6.11
connecting to: test

show dbs
admin (empty)
local 0.031GB
nutch_1 0.031GB
test (empty)
use nutch_1
switched to db nutch_1
show tables
system.indexes
webpage

具体数据可以在terminal下用指令或在图形界面下直接点击查看

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值