nutch在hadoop集群上安装使用

最新推荐文章于 2016-06-23 10:23:30 发布

july_2

最新推荐文章于 2016-06-23 10:23:30 发布

阅读量678

点赞数

分类专栏： solr

solr 专栏收录该内容

23 篇文章 0 订阅

订阅专栏

1.搭建Hadoop 2.5.1集群配置yarn

2.创建hadoop用户

useradd hadoop

passwd hadoop

3.使用hadoop用户来编译nutch-1.7文件夹

4.以 hadoop 用户创建文件

进入到/home/nutch/nutch-1.7/runtime/deploy文件夹下

mkdir urls

sh-4.2$ cd urls

sh-4.2$ touch seed.txt

sh-4.2$ vi seed.txt

5.以hadoop 用户创建文件夹,查看文件夹[预先做好]

sh-4.2$ /home/hadoop/hadoop-2.5.1/bin/hdfs dfs -mkdir -p data

其实真正的存储目录是/user/root/data/urls/seed.txt.

sh-4.2$ /home/hadoop/hadoop-2.5.1/bin/hdfs dfs -ls

Found 1 items

drwxr-xr-x - hadoop supergroup 0 2014-10-09 17:05 data

6.上传种子文件

/home/hadoop/hadoop-2.5.1/bin/hdfs dfs -put ./urls data

确认是否上传成功---

/home/hadoop/hadoop-2.5.1/bin/hdfs dfs -cat data/urls/seed.txt

8.执行nutch的爬虫

cd /home/hadoop/hadoop-2.5.1

./bin/hadoop jar apache-nutch-1.7.job org.apache.nutch.crawl.Crawl data/urls -dir crawl -threads 100 -depth 3 -topN 100

执行结果如下：

14/10/11 10:00:04 WARN crawl.Crawl: solrUrl is not set, indexing will be skipped...

14/10/11 10:00:06 INFO crawl.Crawl: crawl started in: crawl

14/10/11 10:00:06 INFO crawl.Crawl: rootUrlDir = data/urls

14/10/11 10:00:06 INFO crawl.Crawl: threads = 100

14/10/11 10:00:06 INFO crawl.Crawl: depth = 3

14/10/11 10:00:06 INFO crawl.Crawl: solrUrl=null

14/10/11 10:00:06 INFO crawl.Crawl: topN = 100

14/10/11 10:00:07 INFO crawl.Injector: Injector: starting at 2014-10-11 10:00:07

14/10/11 10:00:07 INFO crawl.Injector: Injector: crawlDb: crawl/crawldb

14/10/11 10:00:07 INFO crawl.Injector: Injector: urlDir: data/urls

14/10/11 10:00:07 INFO Configuration.deprecation: mapred.temp.dir is deprecated. Instead, use mapreduce.cluster.temp.dir

14/10/11 10:00:07 INFO crawl.Injector: Injector: Converting injected urls to crawl db entries.

14/10/11 10:00:09 INFO client.RMProxy: Connecting to ResourceManager at idc66/192.168.56.66:8080

14/10/11 10:00:10 INFO client.RMProxy: Connecting to ResourceManager at idc66/192.168.56.66:8080

14/10/11 10:00:20 INFO mapred.FileInputFormat: Total input paths to process : 1

14/10/11 10:00:20 INFO mapreduce.JobSubmitter: number of splits:2

14/10/11 10:00:20 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1412989543453_0001

14/10/11 10:00:21 INFO impl.YarnClientImpl: Submitted application application_1412989543453_0001

14/10/11 10:00:21 INFO mapreduce.Job: The url to track the job: http://idc66:8088/proxy/application_1412989543453_0001/

14/10/11 10:00:21 INFO mapreduce.Job: Running job: job_1412989543453_0001

14/10/11 10:00:39 INFO mapreduce.Job: Job job_1412989543453_0001 running in uber mode : false

14/10/11 10:00:39 INFO mapreduce.Job: map 0% reduce 0%

14/10/11 10:01:26 INFO mapreduce.Job: map 100% reduce 0%

14/10/11 10:01:51 INFO mapreduce.Job: map 100% reduce 100%

14/10/11 10:01:52 INFO mapreduce.Job: Job job_1412989543453_0001 completed successfully

14/10/11 10:01:52 INFO mapreduce.Job: Counters: 50

File System Counters

FILE: Number of bytes read=6

FILE: Number of bytes written=351842

FILE: Number of read operations=0

FILE: Number of large read operations=0

FILE: Number of write operations=0

HDFS: Number of bytes read=245

HDFS: Number of bytes written=86

HDFS: Number of read operations=9

HDFS: Number of large read operations=0

HDFS: Number of write operations=2

Job Counters

Launched map tasks=2

Launched reduce tasks=1

Data-local map tasks=2

Total time spent by all maps in occupied slots (ms)=179870

Total time spent by all reduces in occupied slots (ms)=53334

Total time spent by all map tasks (ms)=89935

Total time spent by all reduce tasks (ms)=17778

Total vcore-seconds taken by all map tasks=89935

Total vcore-seconds taken by all reduce tasks=17778

Total megabyte-seconds taken by all map tasks=138140160

Total megabyte-seconds taken by all reduce tasks=54614016

Map-Reduce Framework

Map input records=2

Map output records=0

Map output bytes=0

Map output materialized bytes=12

Input split bytes=200

Combine input records=0

Combine output records=0

Reduce input groups=0

Reduce shuffle bytes=12

Reduce input records=0

Reduce output records=0

Spilled Records=0

Shuffled Maps =2

Failed Shuffles=0

Merged Map outputs=2

GC time elapsed (ms)=8397

CPU time spent (ms)=8130

Physical memory (bytes) snapshot=1627553792

Virtual memory (bytes) snapshot=7534977024

Total committed heap usage (bytes)=1441071104

Shuffle Errors

BAD_ID=0

CONNECTION=0

IO_ERROR=0

WRONG_LENGTH=0

WRONG_MAP=0

WRONG_REDUCE=0

injector

urls_filtered=2

File Input Format Counters

Bytes Read=45

File Output Format Counters

Bytes Written=86

14/10/11 10:01:52 INFO crawl.Injector: Injector: total number of urls rejected by filters: 2

14/10/11 10:01:52 INFO crawl.Injector: Injector: total number of urls injected after normalization and filtering: 0

14/10/11 10:01:52 INFO crawl.Injector: Injector: Merging injected urls into crawl db.

14/10/11 10:01:52 INFO client.RMProxy: Connecting to ResourceManager at idc66/192.168.56.66:8080

14/10/11 10:01:58 INFO mapred.FileInputFormat: Total input paths to process : 1

14/10/11 10:01:58 INFO mapreduce.JobSubmitter: number of splits:1

14/10/11 10:01:58 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1412989543453_0002

14/10/11 10:01:58 INFO impl.YarnClientImpl: Submitted application application_1412989543453_0002

14/10/11 10:01:58 INFO mapreduce.Job: The url to track the job: http://idc66:8088/proxy/application_1412989543453_0002/

14/10/11 10:01:58 INFO mapreduce.Job: Running job: job_1412989543453_0002

14/10/11 10:02:33 INFO mapreduce.Job: Job job_1412989543453_0002 running in uber mode : false

14/10/11 10:02:33 INFO mapreduce.Job: map 0% reduce 0%

14/10/11 10:02:40 INFO mapreduce.Job: map 100% reduce 0%

14/10/11 10:02:49 INFO mapreduce.Job: map 100% reduce 100%

14/10/11 10:02:49 INFO mapreduce.Job: Job job_1412989543453_0002 completed successfully

14/10/11 10:02:49 INFO mapreduce.Job: Counters: 49

File System Counters

FILE: Number of bytes read=6

FILE: Number of bytes written=234971

FILE: Number of read operations=0

FILE: Number of large read operations=0

FILE: Number of write operations=0

HDFS: Number of bytes read=230

HDFS: Number of bytes written=215

HDFS: Number of read operations=7

HDFS: Number of large read operations=0

HDFS: Number of write operations=4

Job Counters

Launched map tasks=1

Launched reduce tasks=1

Data-local map tasks=1

Total time spent by all maps in occupied slots (ms)=10712

Total time spent by all reduces in occupied slots (ms)=17775

Total time spent by all map tasks (ms)=5356

Total time spent by all reduce tasks (ms)=5925

Total vcore-seconds taken by all map tasks=5356

Total vcore-seconds taken by all reduce tasks=5925

Total megabyte-seconds taken by all map tasks=8226816

Total megabyte-seconds taken by all reduce tasks=18201600

Map-Reduce Framework

Map input records=0

Map output records=0

Map output bytes=0

Map output materialized bytes=6

Input split bytes=144

Combine input records=0

Combine output records=0

Reduce input groups=0

Reduce shuffle bytes=6

Reduce input records=0

Reduce output records=0

Spilled Records=0

Shuffled Maps =1

Failed Shuffles=0

Merged Map outputs=1

GC time elapsed (ms)=44

CPU time spent (ms)=2470

Physical memory (bytes) snapshot=448868352

Virtual memory (bytes) snapshot=5610287104

Total committed heap usage (bytes)=724303872

Shuffle Errors

BAD_ID=0

CONNECTION=0

IO_ERROR=0

WRONG_LENGTH=0

WRONG_MAP=0

WRONG_REDUCE=0

File Input Format Counters

Bytes Read=86

File Output Format Counters

Bytes Written=215

14/10/11 10:02:49 INFO client.RMProxy: Connecting to ResourceManager at idc66/192.168.56.66:8080

14/10/11 10:02:49 INFO crawl.Injector: Injector: finished at 2014-10-11 10:02:49, elapsed: 00:02:41

14/10/11 10:02:49 INFO crawl.Generator: Generator: starting at 2014-10-11 10:02:49

14/10/11 10:02:49 INFO crawl.Generator: Generator: Selecting best-scoring urls due for fetch.

14/10/11 10:02:49 INFO crawl.Generator: Generator: filtering: true

14/10/11 10:02:49 INFO crawl.Generator: Generator: normalizing: true

14/10/11 10:02:49 INFO crawl.Generator: Generator: topN: 100

14/10/11 10:02:49 INFO Configuration.deprecation: mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address

14/10/11 10:02:49 INFO crawl.Generator: Generator: jobtracker is 'local', generating exactly one partition.

14/10/11 10:02:49 INFO client.RMProxy: Connecting to ResourceManager at idc66/192.168.56.66:8080

14/10/11 10:02:55 INFO mapred.FileInputFormat: Total input paths to process : 1

14/10/11 10:02:55 INFO mapreduce.JobSubmitter: number of splits:1

14/10/11 10:02:55 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1412989543453_0003

14/10/11 10:02:55 INFO impl.YarnClientImpl: Submitted application application_1412989543453_0003

14/10/11 10:02:55 INFO mapreduce.Job: The url to track the job: http://idc66:8088/proxy/application_1412989543453_0003/

14/10/11 10:02:55 INFO mapreduce.Job: Running job: job_1412989543453_0003

14/10/11 10:03:08 INFO mapreduce.Job: Job job_1412989543453_0003 running in uber mode : false

14/10/11 10:03:08 INFO mapreduce.Job: map 0% reduce 0%

14/10/11 10:03:17 INFO mapreduce.Job: map 100% reduce 0%

14/10/11 10:04:15 INFO mapreduce.Job: map 100% reduce 100%

14/10/11 10:04:15 INFO mapreduce.Job: Job job_1412989543453_0003 completed successfully

14/10/11 10:04:15 INFO mapreduce.Job: Counters: 49

File System Counters

FILE: Number of bytes read=6

FILE: Number of bytes written=237427

FILE: Number of read operations=0

FILE: Number of large read operations=0

FILE: Number of write operations=0

HDFS: Number of bytes read=205

HDFS: Number of bytes written=0

HDFS: Number of read operations=5

HDFS: Number of large read operations=0

HDFS: Number of write operations=0

Job Counters

Launched map tasks=1

Launched reduce tasks=1

Data-local map tasks=1

Total time spent by all maps in occupied slots (ms)=12016

Total time spent by all reduces in occupied slots (ms)=164820

Total time spent by all map tasks (ms)=6008

Total time spent by all reduce tasks (ms)=54940

Total vcore-seconds taken by all map tasks=6008

Total vcore-seconds taken by all reduce tasks=54940

Total megabyte-seconds taken by all map tasks=9228288

Total megabyte-seconds taken by all reduce tasks=168775680

Map-Reduce Framework

Map input records=0

Map output records=0

Map output bytes=0

Map output materialized bytes=6

Input split bytes=119

Combine input records=0

Combine output records=0

Reduce input groups=0

Reduce shuffle bytes=6

Reduce input records=0

Reduce output records=0

Spilled Records=0

Shuffled Maps =1

Failed Shuffles=0

Merged Map outputs=1

GC time elapsed (ms)=289

CPU time spent (ms)=3510

Physical memory (bytes) snapshot=929054720

Virtual memory (bytes) snapshot=5619216384

Total committed heap usage (bytes)=763297792

Shuffle Errors

BAD_ID=0

CONNECTION=0

IO_ERROR=0

WRONG_LENGTH=0

WRONG_MAP=0

WRONG_REDUCE=0

File Input Format Counters

Bytes Read=86

File Output Format Counters

Bytes Written=0

14/10/11 10:04:15 WARN crawl.Generator: Generator: 0 records selected for fetching, exiting ...

14/10/11 10:04:15 INFO crawl.Crawl: Stopping at depth=0 - no more URLs to fetch.

14/10/11 10:04:15 WARN crawl.Crawl: No URLs to fetch - check your seed list and URL filters.

14/10/11 10:04:15 INFO crawl.Crawl: crawl finished: crawl

8查看nutch的爬虫结果

/home/hadoop/hadoop-2.5.1/bin/hdfs dfs -ls

7.部署nutch的job文件

$Nutch_home/nutch-1.7/runtime/deploy目录下有一个打包好的文件

apache-nutch-1.7.job

就是所有执行的程序及配置文件

需要将这个文件拷贝到$HADOOP_HOME文件夹里来

参考文献：http://my.oschina.net/qiangzigege/blog/317066?p=1#OSC_h2_25

july_2

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
nutch在hadoop集群上安装使用

1.搭建Hadoop 2.5.1集群配置yarn2.创建hadoop用户useradd hadooppasswd hadoop3.使用hadoop用户来编译nutch-1.7文件夹4.以hadoop用户创建文件进入到/home/nutch/nutch-1.7/runtime/deploy文件夹下
复制链接

扫一扫