Nutch学习笔记三

最新推荐文章于 2020-04-17 17:22:56 发布

lskyne

最新推荐文章于 2020-04-17 17:22:56 发布

阅读量2.9k

点赞数

分类专栏： Nutch

本文链接：https://blog.csdn.net/lskyne/article/details/9325293

版权

Nutch 专栏收录该内容

5 篇文章 0 订阅

订阅专栏

Nutch抓取网页步骤

1,新建url列表

http://www.qq.com/
http://www.sina.com.cn/

2，将种子列表URL导入Nutch的crawldb

hadoop@slave5:~/nutch$ nutch inject crawl/crawldb urls/
Injector: starting at 2013-07-14 17:19:07
Injector: crawlDb: crawl/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Injector: total number of urls rejected by filters: 1
Injector: total number of urls injected after normalization and filtering: 2
Injector: Merging injected urls into crawl db.
Injector: finished at 2013-07-14 17:19:10, elapsed: 00:00:02

3，生成获取(fetch)列表，以便获取和分析内容

hadoop@slave5:~/nutch$ nutch generate crawl/crawldb crawl/segments
Generator: starting at 2013-07-14 17:19:14
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: true
Generator: normalizing: true
Generator: jobtracker is 'local', generating exactly one partition.
Generator: Partitioning selected urls for politeness.
Generator: segment: crawl/segments/20130714171917
Generator: finished at 2013-07-14 17:19:18, elapsed: 00:00:03

以上命令在crawl/segments目录下生成了一个新的segment目录，里边存储了抓到的URLs

4，我们需要最新的segment目录作为参数，存储到环境变量SEGMENT里

export SEGMENT=crawl/segments/`ls -tr crawl/segments|tail -1`

hadoop@slave5:~/nutch$ export SEGMENT=crawl/segments/`ls -tr crawl/segments|tail -1`

5，启动抓取程序真正开始抓取内容

nutch fetch $SEGMENT -noParsing

6，

解析抓取下来的内容

bin/nutch parse $SEGMENT

hadoop@slave5:~/nutch$ bin/nutch fetch $SEGMENT -noParsing
Fetcher: Your 'http.agent.name' value should be listed first in 'http.robots.agents' property.
Fetcher: starting at 2013-07-14 17:20:18
Fetcher: segment: crawl/segments/20130714171917
Using queue mode : byHost
Fetcher: threads: 10
Fetcher: time-out divisor: 2
QueueFeeder finished: total 2 records + hit by time limit :0
Using queue mode : byHost
Using queue mode : byHost
fetching http://www.qq.com/ (queue crawl delay=5000ms)
Using queue mode : byHost
Using queue mode : byHost
fetching http://www.sina.com.cn/ (queue crawl delay=5000ms)
-finishing thread FetcherThread, activeThreads=2
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=3
-finishing thread FetcherThread, activeThreads=2
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=2
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=2
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=2
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=2
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=2
Fetcher: throughput threshold: -1
Fetcher: throughput threshold retries: 5
-finishing thread FetcherThread, activeThreads=1
-finishing thread FetcherThread, activeThreads=0
-activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=0
Fetcher: finished at 2013-07-14 17:20:21, elapsed: 00:00:02
hadoop@slave5:~/nutch$ nutch parse $SEGMENT
ParseSegment: starting at 2013-07-14 17:21:35
ParseSegment: segment: crawl/segments/20130714171917
Parsed (10ms):http://www.qq.com/
http://www.sina.com.cn/ skipped. Content of size 135311 was truncated to 65536
ParseSegment: finished at 2013-07-14 17:21:37, elapsed: 00:00:01

7，

更新Nutch crawldb，updatedb命令会存储以上两步抓取(fetch)和分析(parse)最新的segment而得到的新的URLs到Nutch crawldb，以便后续的继续抓取，除了URLs之外，Nutch也存储了相应的页面内容，防止相同的URLs被反反复复的抓取。

bin/nutch updatedb crawl/crawldb $SEGMENT -filter -normalize

到此，一个完整的抓取周期结束了，你可以重复步骤10多次以便可以抓取更多的内容。

hadoop@slave5:~/nutch$ bin/nutch updatedb crawl/crawldb $SEGMENT -filter -normalize
CrawlDb update: starting at 2013-07-14 17:33:55
CrawlDb update: db: crawl/crawldb
CrawlDb update: segments: [crawl/segments/20130714171917]
CrawlDb update: additions allowed: true
CrawlDb update: URL normalizing: true
CrawlDb update: URL filtering: true
CrawlDb update: 404 purging: false
CrawlDb update: Merging segment data into db.
CrawlDb update: finished at 2013-07-14 17:33:56, elapsed: 00:00:01

8，创建超链接库

bin/nutch invertlinks crawl/linkdb -dir crawl/segments

hadoop@slave5:~/nutch$ bin/nutch invertlinks crawl/linkdb -dir crawl/segments
LinkDb: starting at 2013-07-14 17:37:43
LinkDb: linkdb: crawl/linkdb
LinkDb: URL normalize: true
LinkDb: URL filter: true
LinkDb: internal links will be ignored.
LinkDb: adding segment: file:/home/hadoop/apache-nutch-1.7/crawl/segments/20130714171917
LinkDb: finished at 2013-07-14 17:37:44, elapsed: 00:00:01

出现错误步骤

索引所有segments中的内容到Solr中

bin/nutch solrindex http://127.0.0.1:8983/solr/ crawl/crawldb crawl/linkdb crawl/segments/*

现在，所有Nutch抓取的内容已经被Solr索引了，你可以通过Solr Admin执行查询操作了

http://127.0.0.1:8983/solr/admin

。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。