nutch2.3 mysql教程_Nutch2.2.1+MySQL+Solr4.10.3安装部署

引言

本文总结了本人搭建Nutch平台的过程,也为初探nutch的小伙伴提供一些指导。

环境说明

·操作系统:Ubuntu18.04LTS

·软件版本:nutch2.2.1、solr4.10.3

平台结构

如同文章标题一样,平台可以分为3个部分:Nutch、数据库、前端

Nutch:图中Index左边的一部分,负责对网页进行抓取解析,调用数据库进行存储

数据库:存储抓取到的网页数据。1.x版本是基于Hadoop架构的,底层存储使用的是HDFS,而2.x通过使用Apache Gora,使得Nutch可以访问HBase、Accumulo、Cassandra、MySQL、DataFileAvroStore、AvroStore等数据库。

前端:Tomcat 是一个免费的开放源代码的Web 应用服务器,Solr是一个搜索应用。

d38f99f74870?tdsourcetag=s_pcqq_aiomsg

结构图

平台部署

我们从一台全新的Ubuntu18.04服务器开始,先新建一个文件夹来存放平台所需软件,这里可以根据个人情况选择文件夹的位置。若无十足把握确保接下来教程中的路径没有问题,可以按照教程一字不差地进行操作。

lemon@ubuntu:~$ mkdir ~/download/ #新建一个文件夹用来存放下载文件

一、安装JDK

step1.下载OracleJDK

step2. 解压

step3. 加入环境变量

具体操作如下:

lemon@ubuntu:~$ cd ~/download/

lemon@ubuntu:~/download$ wget http://download.oracle.com/otn-pub/java/jdk/8u191-b12/2787e4a523244c269598db4e85c51e0c/jdk-8u191-linux-x64.tar.gz

lemon@ubuntu:~/download$ tar vxf jdk-8u191-linux-x64.tar.gz

lemon@ubuntu:~/download$ ls #查看当前目录下的文件

jdk1.8.0_191 jdk-8u191-linux-x64.tar.gz

lemon@ubuntu:~/download$ sudo mv jdk1.8.0_191/ /usr/local/jdk1.8/ #将jdk1.8.0_191文件夹移动到/usr/local/下并重命名为jdk1.8

lemon@ubuntu:~/download$ sudo vim /etc/profile #编辑环境变量

在环境变量末尾加入如下内容:

export JAVA_HOME=/usr/local/jdk1.8

export JRE_HOME=${JAVA_HOME}/jre

export CLASSPATH=.:${JAVA_HOME}/lib:${JRE_HOME}/lib

export PATH=.:${JAVA_HOME}/bin:$PATH

保存后重新加载环境变量,使生效:

lemon@ubuntu:~/download$ source /etc/profile #刷新环境变量,使生效

lemon@ubuntu:~$ java -version#输入java -version,如显示以下信息,则JDK安装成功

java version "1.8.0_191"

Java(TM) SE Runtime Environment (build 1.8.0_191-b12)

Java HotSpot(TM) 64-Bit Server VM (build 25.191-b12, mixed mode)

二、安装MySQL

step1. 安装MySQL并配置

step2. 创建数据库与表

由于在安装Ubuntu系统时,本人选择了安装LAMP服务,所以MySQL已安装完成,仅需设置即可启用。

测试是否安装:

lemon@ubuntu:~$ mysql #输入mysql,如出现以下提示,说明已安装mysql

ERROR 1045 (28000): Access denied for user 'lemon'@'localhost' (using password: NO)

如未安装:

lemon@ubuntu:~$ sudo apt-get install mysql-server

lemon@ubuntu:~$sudo apt isntall mysql-client

lemon@ubuntu:~$sudo apt install libmysqlclient-dev

如已安装:

lemon@ubuntu:~$ sudo mysql_secure_installation

两者都会进入MySQL设置过程,具体设置内容如下:

#1

VALIDATE PASSWORD PLUGIN can be used to test passwords...

Press y|Y for Yes, any other key for No: N(不启用弱密码检查)

#2

Please set the password for root here...

New password: (设置root密码)

Re-enter new password: (重复输入)

#3

By default, a MySQL installation has an anonymous user,

allowing anyone to log into MySQL without having to have

a user account created for them...

Remove anonymous users? (Press y|Y for Yes, any other key for No) : Y(不启用匿名用户)

#4

Normally, root should only be allowed to connect from

'localhost'. This ensures that someone cannot guess at

the root password from the network...

Disallow root login remotely? (Press y|Y for Yes, any other key for No) : Y (不允许root远程登陆)

#5

By default, MySQL comes with a database named 'test' that

anyone can access...

Remove test database and access to it? (Press y|Y for Yes, any other key for No) : N

#6

Reloading the privilege tables will ensure that all changes

made so far will take effect immediately.

Reload privilege tables now? (Press y|Y for Yes, any other key for No) : Y (立刻刷新权限表)

All done!

接下来进入进入MySQL进行操作:

#最新版的MySQL安装之后无法使用密码进行登陆,需要sudo登录修改登录方式

lemon@ubuntu:~$ sudo mysql -uroot -p

Enter password: (空密码)

mysql>

mysql>UPDATE mysql.user SET authentication_string=PASSWORD('LEMON'), plugin='mysql_native_password' WHERE user='root';

mysql> FLUSH PRIVILEGES;

mysql>exit

lemon@ubuntu:~$ sudo service mysql restart

lemon@ubuntu:~$ mysql -u root -p

Enter password: (上一步设置的密码,PASSWORD括号内的)

mysql>CREATE DATABASE nutch;

mysql>USE nutch

mysql> CREATE TABLE `webpage` (

`id` varchar(767) NOT NULL,

`headers` blob,

`text` mediumtext DEFAULT NULL,

`status` int(11) DEFAULT NULL,

`markers` blob,

`parseStatus` blob,

`modifiedTime` bigint(20) DEFAULT NULL,

`score` float DEFAULT NULL,

`typ` varchar(32) CHARACTER SET latin1 DEFAULT NULL,

`baseUrl` varchar(767) DEFAULT NULL,

`content` longblob,

`title` varchar(2048) DEFAULT NULL,

`reprUrl` varchar(767) DEFAULT NULL,

`fetchInterval` int(11) DEFAULT NULL,

`prevFetchTime` bigint(20) DEFAULT NULL,

`inlinks` mediumblob,

`prevSignature` blob,

`outlinks` mediumblob,

`fetchTime` bigint(20) DEFAULT NULL,

`retriesSinceFetch` int(11) DEFAULT NULL,

`protocolStatus` blob,

`signature` blob,

`metadata` blob,

`batchId`varchar(767)DEFAULT NULL,

PRIMARY KEY (`id`)

) ENGINE=InnoDB

ROW_FORMAT=COMPRESSED

DEFAULT CHARSET=utf8mb4;

mysql>exit

*最新版本默认情况下,MySQL是不允许远程登录的,如需远程访问需要做一些修改:

lemon@ubuntu:~$sudo vim /etc/mysql/mysql.conf.d/mysqld.cnf

#将bind-address = 127.0.0.1注释掉,重启MySQL服务

lemon@ubuntu:~$sudo service mysqld start

接下来就可以通过Navicat等软件,在其他计算机访问数据库了。

d38f99f74870?tdsourcetag=s_pcqq_aiomsg

Navicat

三、安装Nutch

step1.下载Nutch

step2. 解压

step3. 修改ivy.xml、gora.properties、nutch-site.xml

step4. 编译Nutch

step5. 网页抓取配置

具体操作如下:

lemon@ubuntu:~$ cd ~/download/

lemon@ubuntu:~/download$ wget http://archive.apache.org/dist/nutch/2.2.1/apache-nutch-2.2.1-src.zip

lemon@ubuntu:~/download$ unzip apache-nutch-2.2.1-src.zip

#如果提示未安装unzip,就先安装一下sudo apt install unzip

lemon@ubuntu:~/download$ mkdir ~/software

lemon@ubuntu:~/download$ mv apache-nutch-2.2.1 ~/software/

修改ivy.xml:(用于配置存储层使用的数据库)

lemon@ubuntu:~/software$ vim apache-nutch-2.2.1/ivy/ivy.xml

将以下两行取消注释

default”/>

改成

修改gora.properties:(数据库的具体参数)

lemon@ubuntu:~/software$ vim apache-nutch-2.2.1/conf/gora.properties

注释掉默认的数据库连接配置,同时添加以下配置内容:

gora.sqlstore.jdbc.driver=com.mysql.jdbc.Driver

gora.sqlstore.jdbc.url=jdbc:mysql://localhost:3306/nutch?createDatabaseIfNotExist=true

gora.sqlstore.jdbc.user=xxxx(MySQL用户名)

gora.sqlstore.jdbc.password=xxxx(MySQL密码)

如数据库非本机,需修改localhost为数据库地址

修改nutch-site:(配置Nutch)

lemon@ubuntu:~/software$ vim apache-nutch-2.2.1/conf/nutch-site.xml

增加如下内容:

http.agent.name

LemonSpider

http.accept.language

ja-jp, en-us,en-gb,en;q=0.7,*;q=0.3

Value of the “Accept-Language” request header failed. This allows selecting non-English language as default one to retrieve. It is a useful setting for search engines build for certain national group.

parser.character.encoding.default

utf-8

The character encoding to fall back to when no other information is available

storage.data.store.class

org.apache.gora.sql.store.SqlStore

The Gora DataStore class for storing and retrieving data.

Currently the following stores are available: ….

generate.batch.id

*

plugin.includes

protocol-httpclient|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)|parse-jsoup

http.robots.agents

LemonSpider,*

编译Nutch

lemon@ubuntu:~$cd ~/software/apache-nutch-2.2.1

lemon@ubuntu:~/software/apache-nutch-2.2.1$ant

#编译需要较长时间,请保持联网

网页抓取配置

lemon@ubuntu:~$cd ~/software/apache-nutch-2.2.1/runtime/local

lemon@ubuntu:~/software/apache-nutch-2.2.1/runtime/local$mkdir -p urls

lemon@ubuntu:~/software/apache-nutch-2.2.1/runtime/local$echo 'http://www.apache.org/' > urls/seed.txt#设置要抓取的网站

lemon@ubuntu:~/software/apache-nutch-2.2.1/runtime/local$bin/nutch crawl urls -depth 3 -topN 5#执行抓取

-depth -topN 参数分别是深度和返回前N页面,具体参数可以参考官网手册

如果报错,请仔细检查是否完全按照上述教程操作、检查有无修改内容时多删除或者少删除了字符。

成功运行示意图:

-finishing thread FetcherThread3, activeThreads=5

-finishing thread FetcherThread9, activeThreads=6

-finishing thread FetcherThread2, activeThreads=7

-finishing thread FetcherThread1, activeThreads=8

-finishing thread FetcherThread8, activeThreads=9

-finishing thread FetcherThread6, activeThreads=4

-finishing thread FetcherThread7, activeThreads=3

-finishing thread FetcherThread0, activeThreads=2

-finishing thread FetcherThread4, activeThreads=1

-finishing thread FetcherThread5, activeThreads=0

0/0 spinwaiting/active, 11 pages, 0 errors, 0.4 0 pages/s, 78 36 kb/s, 0 URLs in 0 queues

-activeThreads=0

ParserJob: resuming: false

ParserJob: forced reparse: false

ParserJob: parsing all

Parsing http://accumulo.apache.org/

Parsing http://activemq.apache.org/

Parsing http://airavata.apache.org/

Parsing http://allura.apache.org/

Parsing http://ambari.apache.org/

Parsing http://www.apache.org/

Parsing http://www.apache.org/foundation/sponsorship.html

Parsing http://www.apache.org/foundation/thanks.html

Parsing http://www.apache.org/licenses/

Parsing http://www.apache.org/licenses/LICENSE-2.0

Parsing http://www.apache.org/security/

lemon@ubuntu:~/software/apache-nutch-2.2.1/runtime/local$

四、安装tomcat

step1.下载tomcat

step2. 解压

step3. 启动

lemon@ubuntu:~$ cd download/

lemon@ubuntu:~/download$ wget http://archive.apache.org/dist/tomcat/tomcat-8/v8.0.33/bin/apache-tomcat-8.0.33.tar.gz

lemon@ubuntu:~/download$ tar vxf apache-tomcat-8.0.33.tar.gz

lemon@ubuntu:~/download$ mv apache-tomcat-8.0.33 ~/software/

lemon@ubuntu:~/download$ cd ~/software/apache-tomcat-8.0.33/

lemon@ubuntu:~/software/apache-tomcat-8.0.33$ bin/startup.sh

此时,在本地浏览器中打开localhost:8080或者127.0.0.1:8080,同一局域网下计算机可以访问本机ip:8080,例如,本服务器内网ip为114.212.167.106,同一局域网下计算机可以访问114.212.167.106:8080

看到以下页面就说明tomcat安装完成:

d38f99f74870?tdsourcetag=s_pcqq_aiomsg

tomcat页面

五、安装solr与tomcat集成

step1.下载solr并解压

step2. 解压

step3. 在tomcat的webapps目录下新建solr文件夹

step4. 将solr-4.10.3/example/webapps/文件夹下的solr.war拷贝到step2新建的solr文件夹并解压

step5. step4完成后solr文件夹下会生成collection1文件夹,将apache-nutch-2.2.1/conf/文件夹下的schema.xml拷贝到collection1/conf/文件夹下

step6. 修改tomcat文件夹下webapps/solr/WEB_INF/web.xml

step7. 复制solr-4.10.3/example/lib/ext/文件夹下的jar包到tomcat/webapps/solr/WEB-INF/lib/

step8.在tomcat/webapps/solr/WEB-INF/文件夹下新建classes文件夹,并将solr-4.10.3/example/resources文件夹下的log4j.properties复制到新建classes文件夹里

step9. 重启tomcat

lemon@ubuntu:~$ cd ~/download/

lemon@ubuntu:~/download$ wget http://archive.apache.org/dist/lucene/solr/4.10.3/solr-4.10.3.zip

lemon@ubuntu:~/download$ unzip solr-4.10.3.zip

lemon@ubuntu:~/download$ mv solr-4.10.3 ../software/

lemon@ubuntu:~/download$ cd ../software/

lemon@ubuntu:~/software$ cd apache-tomcat-8.0.33/webapps/

lemon@ubuntu:~/software/apache-tomcat-8.0.33/webapps$ mkdir solr

lemon@ubuntu:~/software/apache-tomcat-8.0.33/webapps$ cp ~/software/solr-4.10.3/example/webapps/solr.war ./solr/

lemon@ubuntu:~/software/apache-tomcat-8.0.33/webapps$ jar vxf solr.war

lemon@ubuntu:~/software/apache-tomcat-8.0.33/webapps$ cp -r ~/software/solr-4.10.3/example/solr ../

lemon@ubuntu:~/software/apache-tomcat-8.0.33/webapps$ cp ~/software/apache-nutch-2.2.1/conf/schema.xml ../solr/collection1/conf/

lemon@ubuntu:~/software/apache-tomcat-8.0.33/webapps$ vim solr/WEB-INF/web.xml

取消以下内容的注释,并修改solrhome的值

solr/home

/home/lemon/software/apache-tomcat-8.0.33/solr

java.lang.String

lemon@ubuntu:~/software/apache-tomcat-8.0.33$ vim ~/software/apache-tomcat-8.0.33/solr/collection1/conf/solrconfig.xml

${solr.data.dir:/home/lemon/software/apache-tomcat-8.0.33/solr/collection1/data}

lemon@ubuntu:~/software/apache-tomcat-8.0.33$ cp ~/software/solr-4.10.3/example/lib/ext/* ~/software/apache-tomcat-8.0.33/webapps/solr/WEB-INF/lib/

lemon@ubuntu:~/software/apache-tomcat-8.0.33$ mkdir ~/software/apache-tomcat-8.0.33/webapps/solr/WEB-INF/classes

lemon@ubuntu:~/software/apache-tomcat-8.0.33$ cp ~/software/solr-4.10.3/example/resources/log4j.properties ~/software/apache-tomcat-8.0.33/webapps/solr/WEB-INF/classes

最后,重新启动tomcat.

lemon@ubuntu:~/software/apache-tomcat-8.0.33$ bin/shutdown.sh

lemon@ubuntu:~/software/apache-tomcat-8.0.33$ bin/startup.sh

d38f99f74870?tdsourcetag=s_pcqq_aiomsg

image.png

五、利用solr为抓取到的数据建立索引

lemon@ubuntu:~/software/apache-nutch-2.2.1$ cd ~/software/apache-nutch-2.2.1/runtime/local/

lemon@ubuntu:~/software/apache-nutch-2.2.1/runtime/local/$bin/nutch crawl -solr http://127.0.0.1:8080/solr/ -reindex

检索界面:

d38f99f74870?tdsourcetag=s_pcqq_aiomsg

检索结果

结语

我在本次搭建也踩了许多坑,本文是避坑后的完整过程,严格按照本文操作应该不会出现问题。由于用于演示,未采用较为复杂的Hbase作为存储,不过接下来我也将尝试。

如果遇到错误,请核对版本是否一致、路径是否正确、代码修改是否有误。

我将部署过程中所遇到的错误做了总结,写了一篇错误集锦,将于最近完成,希望届时对大家有所帮助。

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值