Nutch2.x 演示抓取第一个网站

最新推荐文章于 2024-07-27 12:20:46 发布

weixin_33859504

最新推荐文章于 2024-07-27 12:20:46 发布

阅读量124

点赞数

文章标签： python 大数据数据库

http://www.micmiu.com/opensource/nutch/nutch2x-crawl-first-website/?utm_source=tuicool&utm_medium=referral

下面演示的过程是基于目前 Nutch 2.2.1 自己编译配置的版本。

在编译后 bin目录下有两个脚本文件：nutch 和 crawl ，在命令行下执行各命令即可查看具体使用说明：

$ nutch

Usage: nutch COMMAND

where COMMAND is one of:

inject inject new urls into the database

hostinject creates or updates an existing host table from a text file

generate generate new batches to fetch from crawl db

fetch fetch URLs marked during generate

parse parse URLs marked during fetch

updatedb update web table after parsing

updatehostdb update host table after parsing

readdb read/dump records from page database

readhostdb display entries from the hostDB

elasticindex run the elasticsearch indexer

solrindex run the solr indexer on parsed batches

solrdedup remove duplicates from solr

parsechecker check the parser for a given url

indexchecker check the indexing filters for a given url

plugin load a plugin and run one of its classes main()

nutchserver run a (local) Nutch server on a user defined port

junit runs the given JUnit test

CLASSNAME run the class named CLASSNAME

Most commands print help when invoked w/o parameters.

1 2	$ crawl Missing seedDir : crawl <seedDir> <crawlID> <solrURL> <numberOfRounds>

在Nutch2.x版本中，爬取流程所涉及的命令做了优化，整合到了crawl 命令中，使用者只需要执行一个命令 crawl 即可完成爬取流程，而不必像老版本中那样，必须依次地执行 inject、generate、fetch、parse等命令。对于初学者来说仍然可以依次执行相关命令，仔细观察每执行一步引起的数据变化。下面以抓取本人博客网站为例详细说明下抓取的过程：

[准备]：创建需要抓取的URL

首先启动hbase （本文是在单机模式下演示的）
mkdir -p urls
cd urls
touch seed.txt
echo ‘http://micmiu.com’ >seed.txt

下面每一步执行后都可以查看HBase中数据的变化情况。

[第一步]：inject

$ nutch inject urls -crawlId micmiublog

InjectorJob: starting at 2015-01-12 09:42:46

InjectorJob: Injecting urlDir: urls

2015-01-12 09:42:47.096 java[14509:4735452] Unable to load realm info from SCDynamicStore

InjectorJob: Using class org.apache.gora.hbase.store.HBaseStore as the Gora storage class.

InjectorJob: total number of urls rejected by filters: 0

InjectorJob: total number of urls injected after normalization and filtering: 1

查看HBase中得数据：

hbase(main):016:0> scan 'micmiublog_webpage'

ROW COLUMN+CELL

com.micmiu:http/ column=f:fi, timestamp=1421026970740, value=\x00'\x8D\x00

com.micmiu:http/ column=f:ts, timestamp=1421026970740, value=\x00\x00\x01J\xDB\xCE\xBC\xF2

com.micmiu:http/ column=mk:_injmrk_, timestamp=1421026970740, value=y

com.micmiu:http/ column=mk:dist, timestamp=1421026970740, value=0

com.micmiu:http/ column=mtdt:_csh_, timestamp=1421026970740, value=?\x80\x00\x00

com.micmiu:http/ column=s:s, timestamp=1421026970740, value=?\x80\x00\x00

1 row(s) in 0.1010 seconds

[第二步]：generate

$ nutch generate -topN 5 -crawlId micmiublog

GeneratorJob: starting at 2015-01-12 09:47:09

GeneratorJob: Selecting best-scoring urls due for fetch.

GeneratorJob: starting

GeneratorJob: filtering: true

GeneratorJob: normalizing: true

GeneratorJob: topN: 5

2015-01-12 09:47:09.822 java[14533:4744993] Unable to load realm info from SCDynamicStore

GeneratorJob: finished at 2015-01-12 09:47:13, time elapsed: 00:00:03

GeneratorJob: generated batch id: 1421027229-1374349927

查看HBase中得数据：

hbase(main):018:0> scan 'micmiublog_webpage'

ROW COLUMN+CELL

com.micmiu:http/ column=f:bid, timestamp=1421027232815, value=1421027229-1374349927

com.micmiu:http/ column=f:fi, timestamp=1421026970740, value=\x00'\x8D\x00

com.micmiu:http/ column=f:ts, timestamp=1421026970740, value=\x00\x00\x01J\xDB\xCE\xBC\xF2

com.micmiu:http/ column=mk:_gnmrk_, timestamp=1421027232815, value=1421027229-1374349927

com.micmiu:http/ column=mk:_injmrk_, timestamp=1421026970740, value=y

com.micmiu:http/ column=mk:dist, timestamp=1421026970740, value=0

com.micmiu:http/ column=mtdt:_csh_, timestamp=1421026970740, value=?\x80\x00\x00

com.micmiu:http/ column=s:s, timestamp=1421026970740, value=?\x80\x00\x00

1 row(s) in 0.0580 seconds

[第三步]：fetch

ps：上一步执行的日志中 GenerateorJob batch id 的值作为下面命令的参数 batchId的值

也可以从hbase中重查询到：

hbase(main):025:0> get 'micmiublog_webpage','com.micmiu:http/',{COLUMNS => 'f:bid'}

COLUMN CELL

f:bid timestamp=1421027232815, value=1421027229-1374349927

1 row(s) in 0.0060 seconds

下面执行 fetch 命令：

$ nutch fetch 1421027229-1374349927 -crawlId micmiublog -threads 10

FetcherJob: starting

FetcherJob: batchId: 1421027229-1374349927

FetcherJob: threads: 10

FetcherJob: parsing: false

FetcherJob: resuming: false

FetcherJob : timelimit set for : -1

2015-01-12 09:49:37.095 java[14546:4753667] Unable to load realm info from SCDynamicStore

Using queue mode : byHost

Fetcher: threads: 10

QueueFeeder finished: total 1 records. Hit by time limit :0

fetching http://micmiu.com/ (queue crawl delay=5000ms)

-finishing thread FetcherThread1, activeThreads=1

-finishing thread FetcherThread2, activeThreads=1

-finishing thread FetcherThread3, activeThreads=1

-finishing thread FetcherThread4, activeThreads=1

-finishing thread FetcherThread5, activeThreads=1

-finishing thread FetcherThread6, activeThreads=1

-finishing thread FetcherThread7, activeThreads=1

-finishing thread FetcherThread8, activeThreads=1

Fetcher: throughput threshold: -1

Fetcher: throughput threshold sequence: 5

-finishing thread FetcherThread9, activeThreads=1

-finishing thread FetcherThread0, activeThreads=0

0/0 spinwaiting/active, 1 pages, 0 errors, 0.2 0 pages/s, 0 0 kb/s, 0 URLs in 0 queues

-activeThreads=0

FetcherJob: done

查看HBase中得数据：

hbase(main):019:0> scan 'micmiublog_webpage'

ROW COLUMN+CELL

com.micmiu:http/ column=f:bas, timestamp=1421027385487, value=http://micmiu.com/

com.micmiu:http/ column=f:bid, timestamp=1421027232815, value=1421027229-1374349927

com.micmiu:http/ column=f:cnt, timestamp=1421027385487, value=

com.micmiu:http/ column=f:fi, timestamp=1421026970740, value=\x00'\x8D\x00

com.micmiu:http/ column=f:prot, timestamp=1421027385487, value=\x18\x02,http://www.micmiu.com/\x00\x00

com.micmiu:http/ column=f:pts, timestamp=1421027385487, value=\x00\x00\x01J\xDB\xCE\xBC\xF2

com.micmiu:http/ column=f:rpr, timestamp=1421027385487, value=http://micmiu.com/

com.micmiu:http/ column=f:st, timestamp=1421027385487, value=\x00\x00\x00\x05

com.micmiu:http/ column=f:ts, timestamp=1421027385487, value=\x00\x00\x01J\xDB\xD5\x17%

com.micmiu:http/ column=f:typ, timestamp=1421027385487, value=text/html

com.micmiu:http/ column=h:Cache-Control, timestamp=1421027385487, value=no-store, no-cache, must-revalidate, post-check=0, pre-check=0

com.micmiu:http/ column=h:Connection, timestamp=1421027385487, value=close

com.micmiu:http/ column=h:Content-Encoding, timestamp=1421027385487, value=gzip

com.micmiu:http/ column=h:Content-Length, timestamp=1421027385487, value=20

com.micmiu:http/ column=h:Content-Type, timestamp=1421027385487, value=text/html; charset=UTF-8

com.micmiu:http/ column=h:Date, timestamp=1421027385487, value=Mon, 12 Jan 2015 01:49:41 GMT

com.micmiu:http/ column=h:Expires, timestamp=1421027385487, value=Thu, 19 Nov 1981 08:52:00 GMT

com.micmiu:http/ column=h:Location, timestamp=1421027385487, value=http://www.micmiu.com/

com.micmiu:http/ column=h:Pragma, timestamp=1421027385487, value=no-cache

com.micmiu:http/ column=h:Server, timestamp=1421027385487, value=LiteSpeed

com.micmiu:http/ column=h:Set-Cookie, timestamp=1421027385487, value=PHPSESSID=5657f9f9da456a7bf6e243f78b7e0182; path=/

com.micmiu:http/ column=h:Vary, timestamp=1421027385487, value=Cookie

com.micmiu:http/ column=h:X-Pingback, timestamp=1421027385487, value=http://www.micmiu.com/xmlrpc.php

com.micmiu:http/ column=h:X-Powered-By, timestamp=1421027385487, value=PHP/5.3.29

com.micmiu:http/ column=mk:_ftcmrk_, timestamp=1421027385487, value=1421027229-1374349927

com.micmiu:http/ column=mk:_gnmrk_, timestamp=1421027232815, value=1421027229-1374349927

com.micmiu:http/ column=mk:_injmrk_, timestamp=1421026970740, value=y

com.micmiu:http/ column=mk:dist, timestamp=1421026970740, value=0

com.micmiu:http/ column=mtdt:___rdrdsc__, timestamp=1421027385487, value=y

com.micmiu:http/ column=mtdt:_csh_, timestamp=1421026970740, value=?\x80\x00\x00

com.micmiu:http/ column=ol:http://www.micmiu.com/, timestamp=1421027385487, value=

com.micmiu:http/ column=s:s, timestamp=1421026970740, value=?\x80\x00\x00

1 row(s) in 0.0980 seconds

[第四步]：parse

$ nutch parse 1421027229-1374349927 -crawlId micmiublog

ParserJob: starting

ParserJob: resuming: false

ParserJob: forced reparse: false

ParserJob: batchId: 1421027229-1374349927

2015-01-12 09:50:03.525 java[14559:4756783] Unable to load realm info from SCDynamicStore

Parsing http://micmiu.com/

http://micmiu.com/ skipped. Content of size 20 was truncated to 0

ParserJob: success

查看HBase中得数据：

hbase(main):020:0> scan 'micmiublog_webpage'