Nuth | Hadoop完全分布式运行 学习笔记


hdfs://  url.txt -->

hdfs://  url2.txt --> http ://

直接生成:bin/nutch crawl urldir -dir crawldata -depth 3 -topN 5




     inject注入: bin/nutch inject crawldatatest/crawldb urldir

     输出以便查看:bin/nutch readdb crawldatatest/crawldb -dump      tmpdata/test/crawldb/crawldb_dump


     查看:bin/hadoop fs -cat tmpdata/test/crawldb/crawldb_dump/part-00000


      http :// Version: 7
     Status: 1 (db_unfetched)
     Fetch time: Fri Sep 13 10:57:28 CST 2013
     Modified time: Thu Jan 01 08:00:00 CST 1970
     Retries since fetch: 0
     Retry interval: 2592000 seconds (30 days)
     Score: 1.0
     Signature: null

      http ://www. 16 Version: 7
     Status: 1 (db_unfetched)
     Fetch time: Fri Sep 13 10:57:28 CST 2013
     Modified time: Thu Jan 01 08:00:00 CST 1970
     Retries since fetch: 0
     Retry interval: 2592000 seconds (30 days)
     Score: 1.0
     Signature: null


     generate产生:bin/nutch generate crawldatatest/crawldb                            crawldatatest/segments
       产生:bin/nutch readseg -dump crawldatatest/segments/20130913122422 tmpdata/test/segments/20130913122422_dump -nocontent -nofetch -noparse -noparsedata  –         noparsetext
      查看:bin/hadoop fs -cat tmpdata/*/*/*/dump
       Recno:: 0
URL::  http ://

Version: 7
Status: 1 (db_unfetched)
Fetch time: Fri Sep 13 10:57:28 CST 2013
Modified time: Thu Jan 01 08:00:00 CST 1970
Retries since fetch: 0
Retry interval: 2592000 seconds (30 days)
Score: 1.0
Signature: null
Metadata: _ngt_: 1379046213002

Recno:: 1
URL::  http ://www. 16

网易 应用 网易新闻 网易云音乐 网易云阅读 有道云笔记 网易花田 网易公开课 网易彩票 有道词典
Version: 7
Status: 1 (db_unfetched)
Fetch time: Fri Sep 13 10:57:28 CST 2013
Modified time: Thu Jan 01 08:00:00 CST 1970
Retries since fetch: 0
Retry interval: 2592000 seconds (30 days)
Score: 1.0
Signature: null
Metadata: _ngt_: 1379046213002


     fetch产生:bin/nutch fetch crawldatatest/segments/20130913122422
     产生:bin/nutch readseg -dump crawldatatest/segments/20130913122422 tmpdata/test/segments/20130913122422_dump_fetch -nocontent -nogenerate -noparse -     noparsedata  –noparsetext
     查看: bin/hadoop fs -cat tmpdata/*/*/*fetch*/dump
Recno:: 0
URL::  http ://
Version: 7
Status: 33 (fetch_success)
Fetch time: Fri Sep 13 12:35:02 CST 2013
Modified time: Thu Jan 01 08:00:00 CST 1970
Retries since fetch: 0
Retry interval: 2592000 seconds (30 days)
Score: 1.0
Signature: null
Metadata: _ngt_: 1379046213002Content-Type: text/html_pst_: success(1), lastModified=0

Recno:: 1
URL::  http ://www. 16

Version: 7
Status: 33 (fetch_success)
Fetch time: Fri Sep 13 12:35:05 CST 2013
Modified time: Thu Jan 01 08:00:00 CST 1970
Retries since fetch: 0
Retry interval: 2592000 seconds (30 days)
Score: 1.0
Signature: null
Metadata: _ngt_: 1379046213002Content-Type: text/html_pst_: success(1), lastModified=0
网易 应用 网易新闻 网易云音乐 网易云阅读 有道云笔记 网易花田 网易公开课 网易彩票 有道词典


     parse产生:bin/nutch parse crawldatatest/segments/20130913122422
     产生:bin/nutch readseg -dump crawldatatest/segments/20130913122422 tmpdata/test/segments/20130913122422_dump_parse -nocontent -nogenerate -nofetch -          noparsedata  –noparsetext
     查看:bin/hadoop fs -cat tmpdata/*/*/*parse*/dump | more
Recno:: 0
URL::  http ://3g. 16

Version: 7
Status: 67 (linked)
Fetch time: Fri Sep 13 12:41:48 CST 2013
Modified time: Thu Jan 01 08:00:00 CST 1970
Retries since fetch: 0
Retry interval: 2592000 seconds (30 days)
Score: 0.125
Signature: null
Recno:: 64
URL::  http ://

Version: 7
Status: 67 (linked)
Fetch time: Fri Sep 13 12:41:48 CST 2013
Modified time: Thu Jan 01 08:00:00 CST 1970
Retries since fetch: 0
Retry interval: 2592000 seconds (30 days)
Score: 0.01754386
Signature: null



     updatedb命令:bin/nutch updatedb crawldatatest/crawldb -dir                                     crawldatatest/segments
     update效果查看:bin/nutch readdb crawldatatest/crawldb -stats 
   13/09/13 12:49:34 INFO crawl.CrawlDbReader: TOTAL urls: 65
13/09/13 12:49:34 INFO crawl.CrawlDbReader: retry 0: 65
13/09/13 12:49:34 INFO crawl.CrawlDbReader: min score: 0.017
13/09/13 12:49:34 INFO crawl.CrawlDbReader: avg score: 0.061092306
13/09/13 12:49:34 INFO crawl.CrawlDbReader: max score: 1.0
13/09/13 12:49:34 INFO crawl.CrawlDbReader: status 1 (db_unfetched): 63
13/09/13 12:49:34 INFO crawl.CrawlDbReader: status 2 (db_fetched): 2
13/09/13 12:49:34 INFO crawl.CrawlDbReader: CrawlDb statistics: done
可以看到TOTAL urls 由开始的2个变成了63个。


 bin/nutch readseg -dump crawldatatest/segments/20130913122422 tmpdata/test/segments/20130913122422_dump_content -noparse -nogenerate -nofetch -noparsedata  –noparsetext
查看content:bin/hadoop fs -cat tmpdata/*/*/*content*/dump | more

Recno:: 0
URL::  http ://
Version: -1
url:  http ://
base:  http ://
contentType: text/html
metadata: Date=Fri, 13 Sep 2013 04:34:58 GMT Vary=Accept-Encoding Expires=Thu, 01 Nov 
2012 10:00:00 GMT Content-Encoding=gzip nutch.crawl.score=1.0 _fst_=33 nutch.segment.n
ame=20130913122422 Content-Type=text/html; charset=UTF-8 Connection=close Server=nginx
 Cache-Control=no-cache Pragma=no-cache 


生成:bin/nutch readseg -dump crawldatatest/segments/20130913122422 tmpdata/test/segments/20130913122422_dump_parse_data -noparse -nogenerate -nofetch -content  –noparsetext
查看:bin/hadoop fs -cat tmpdata/*/*/*parse_data*/dump | more
Recno:: 0
URL::  http ://

Outlinks: 57
  outlink: toUrl:  http :// anchor: 社会民生
  outlink: toUrl:  http :// anchor: 国际观察
  outlink: toUrl:  http :// anchor: 娱乐
  outlink: toUrl:  http :// anchor: 体育
  outlink: toUrl:  http :// anchor: 文化
  outlink: toUrl:  http :// anchor: 历史
  outlink: toUrl:  http :// anchor: 生活
  outlink: toUrl:  http :// anchor: 情感
  outlink: toUrl:  http :// anchor: 财经
  outlink: toUrl:  http :// anchor: 股市
  outlink: toUrl:  http :// anchor: 美食
  outlink: toUrl:  http :// anchor: 旅游


生成:bin/nutch readseg -dump crawldatatest/segments/20130913122422 tmpdata/test/segments/20130913122422_dump_parse_text -noparse -nogenerate -nofetch -content  –noparsedata

查看:bin/hadoop fs -cat tmpdata/*/*/*parse_text*/dump | more
Recno:: 0
URL::  http ://

Version: -1
url:  http ://
base:  http ://
contentType: text/html
metadata: Date=Fri, 13 Sep 2013 04:34:58 GMT Vary=Accept-Encoding Expires=Thu, 01 Nov 
2012 10:00:00 GMT Content-Encoding=gzip nutch.crawl.score=1.0 _fst_=33 nutch.segment.n
ame=20130913122422 Content-Type=text/html; charset=UTF-8 Connection=close Server=nginx
 Cache-Control=no-cache Pragma=no-cache 

Version: 5
Status: success(1,0)
Title: 天涯博客_有见识的人都在此
Outlinks: 57
  outlink: toUrl:  http :// anchor: 社会民生
  outlink: toUrl:  http :// anchor: 国际观察
  outlink: toUrl:  http :// anchor: 娱乐
  outlink: toUrl:  http :// anchor: 体育
  outlink: toUrl:  http :// anchor: 文化
  outlink: toUrl:  http :// anchor: 历史
  outlink: toUrl:  http :// anchor: 生活
Recno:: 1
URL::  http ://www. 16

Version: 5
Status: success(1,0)
Title: 网易
Outlinks: 8
  outlink: toUrl:  http ://m. 16 anchor: 应用
  outlink: toUrl:  http ://m. 16 anchor: 网易新闻
  outlink: toUrl:  http ://music. 16 anchor: 网易云音乐
  outlink: toUrl:  http ://yuedu. 16 anchor: 网易云阅读

bin/nutch | grep merge 合并命令

         bin/nutch mergesegs crawldata/segments_merge -dir crawldata/segments


bin/nutch invertlinks crawldatatest/linkdb -dir crawldatatest/segments
 产生:bin/nutch readlinkdb crawldatatest/linkdb -dump tmpdata/test/linkdb/linkdb_dump
 查看:bin/hadoop fs -cat tmpdata/*/*/*linkdb*/part-*
http ://3g. 16 Inlinks:
 fromUrl:  http ://www. 16 anchor: 网易花田

http :// Inlinks:
 fromUrl:  http :// anchor: 长沙艾敏

http :// Inlinks:
 fromUrl:  http :// anchor: 安歌儿

http :// Inlinks:
 fromUrl:  http :// anchor: ayuan566

http ://caipiao. 16  Inlinks:
 fromUrl:  http ://www. 16 anchor: 网易彩票

http :// Inlinks:
 fromUrl:  http :// anchor: 缠绕夜色

http :// Inlinks:
 fromUrl:  http :// anchor: 陈彤

http :// Inlinks:
 fromUrl:  http :// anchor: 云无心

http :// Inlinks:
 fromUrl:  http :// anchor: 六盘水评论

http :// Inlinks:
 fromUrl:  http :// anchor: 党国英

http :// Inlinks:
 fromUrl:  http :// anchor: 老海博客

http :// Inlinks:
 fromUrl:  http :// anchor: ESPN詹俊

http :// Inlinks:
 fromUrl:  http :// anchor: 丰雪飘

http :// Inlinks:
 fromUrl:  http :// anchor: 还是定风波

http :// Inlinks:
 fromUrl:  http :// anchor: 湖畔小子

http :// Inlinks:
 fromUrl:  http :// anchor: 古月轩主1

http :// Inlinks:
 fromUrl:  http :// anchor: 蒋丰

http :// Inlinks:
 fromUrl:  http :// anchor: 飞一扬

http :// Inlinks:
 fromUrl:  http :// anchor: 金满楼

http :// Inlinks:
 fromUrl:  http :// anchor: 说真话好难

http :// Inlinks:
 fromUrl:  http :// anchor: 蓝田玉烟

http :// Inlinks:
 fromUrl:  http :// anchor: 李承鹏

http :// Inlinks:
 fromUrl:  http :// anchor: 章半仙

http ://m. 16 Inlinks:
 fromUrl:  http ://www. 16 anchor: 应用

http ://m. 16 Inlinks:
 fromUrl:  http ://www. 16 anchor: 网易新闻

http :// Inlinks:
 fromUrl:  http :// anchor: 墨黑纸白

http ://music. 16 Inlinks:
 fromUrl:  http ://www. 16 anchor: 网易云音乐

http :// Inlinks:
 fromUrl:  http :// anchor: 涅阳小生

http :// Inlinks:
 fromUrl:  http ://www. 16 anchor: 有道云笔记

http ://open. 16 Inlinks:
 fromUrl:  http ://www. 16 anchor: 网易公开课

http :// Inlinks:
 fromUrl:  http :// anchor: 红豆火警

http :// Inlinks:
 fromUrl:  http :// anchor: 皮海洲

http :// Inlinks:
 fromUrl:  http :// anchor: 少林修女

http :// Inlinks:
 fromUrl:  http :// anchor: 寒枫化雨

http :// Inlinks:
 fromUrl:  http :// anchor: 烟灰醉余晖

http :// Inlinks:
 fromUrl:  http :// anchor: 陶短房

http :// Inlinks:
 fromUrl:  http :// anchor: 童大焕

http :// Inlinks:
 fromUrl:  http :// anchor: 冉云飞

http :// Inlinks:
 fromUrl:  http :// anchor: 温柔恶女

http :// Inlinks:
 fromUrl:  http :// anchor: 

http :// Inlinks:
 fromUrl:  http :// anchor: 谢不谦

http :// Inlinks:
 fromUrl:  http :// anchor: 信力建

http :// Inlinks:
 fromUrl:  http :// anchor: 评论杨涛

http ://yuedu. 16 Inlinks:
 fromUrl:  http ://www. 16 anchor: 网易云阅读

http :// Inlinks:
 fromUrl:  http :// anchor: 云歇鸢

http :// Inlinks:
 fromUrl:  http :// anchor: 雨润de云温

http :// Inlinks:
 fromUrl:  http :// anchor: 张鸣

http :// Inlinks:
 fromUrl:  http :// anchor: 周其仁

http :// Inlinks:
 fromUrl:  http :// anchor: 周禄宝

parsechecker命令: bin/nutch parsechecker

13/09/13 13:12:37 INFO parse.ParserChecker: fetching:  http ://
13/09/13 13:12:37 INFO plugin.PluginRepository: Plugins: looking in: /home/hadoop/hadoop-hadoop/hadoop-unjar3881248761965128726/classes/plugins
13/09/13 13:12:38 INFO plugin.PluginRepository: Plugin Auto-activation mode: [true]
13/09/13 13:12:38 INFO plugin.PluginRepository: Registered Plugins:
13/09/13 13:12:38 INFO plugin.PluginRepository:  the nutch core extension points (nutch-extensionpoints)
13/09/13 13:12:38 INFO plugin.PluginRepository:  Basic URL Normalizer (urlnormalizer-basic)
13/09/13 13:12:38 INFO plugin.PluginRepository:  Html Parse Plug-in (parse-html)
13/09/13 13:12:38 INFO plugin.PluginRepository:  Basic Indexing Filter (index-basic)
13/09/13 13:12:38 INFO plugin.PluginRepository:  HTTP  Framework (lib- http )
13/09/13 13:12:38 INFO plugin.PluginRepository:  Pass-through URL Normalizer (urlnormalizer-pass)
13/09/13 13:12:38 INFO plugin.PluginRepository:  Regex URL Filter (urlfilter-regex)
13/09/13 13:12:38 INFO plugin.PluginRepository:  Http  Protocol Plug-in (protocol- http )
13/09/13 13:12:38 INFO plugin.PluginRepository:  Regex URL Normalizer (urlnormalizer-regex)
13/09/13 13:12:38 INFO plugin.PluginRepository:  Tika Parser Plug-in (parse-tika)
13/09/13 13:12:38 INFO plugin.PluginRepository:  OPIC Scoring Plug-in (scoring-opic)
13/09/13 13:12:38 INFO plugin.PluginRepository:  CyberNeko HTML Parser (lib-nekohtml)
13/09/13 13:12:38 INFO plugin.PluginRepository:  Anchor Indexing Filter (index-anchor)
13/09/13 13:12:38 INFO plugin.PluginRepository:  Regex URL Filter Framework (lib-regex-filter)
13/09/13 13:12:38 INFO plugin.PluginRepository: Registered Extension-Points:
13/09/13 13:12:38 INFO plugin.PluginRepository:  Nutch URL Normalizer (
13/09/13 13:12:38 INFO plugin.PluginRepository:  Nutch Protocol (org.apache.nutch.protocol.Protocol)
13/09/13 13:12:38 INFO plugin.PluginRepository:  Nutch Segment Merge Filter (org.apache.nutch.segment.SegmentMergeFilter)
13/09/13 13:12:38 INFO plugin.PluginRepository:  Nutch URL Filter (
13/09/13 13:12:38 INFO plugin.PluginRepository:  Nutch Indexing Filter (org.apache.nutch.indexer.IndexingFilter)
13/09/13 13:12:38 INFO plugin.PluginRepository:  HTML Parse Filter (org.apache.nutch.parse.HtmlParseFilter)
13/09/13 13:12:38 INFO plugin.PluginRepository:  Nutch Content Parser (org.apache.nutch.parse.Parser)
13/09/13 13:12:38 INFO plugin.PluginRepository:  Nutch Scoring (org.apache.nutch.scoring.ScoringFilter)
13/09/13 13:12:38 INFO  http . Http http = null
13/09/13 13:12:38 INFO  http . Http http .proxy.port = 8080
13/09/13 13:12:38 INFO  http . Http http .timeout = 10000
13/09/13 13:12:38 INFO  http . Http http .content.limit = 65536
13/09/13 13:12:38 INFO  http . Http http .agent = Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/29.0.1547.66 Safari/537.36/Nutch-1.6
13/09/13 13:12:38 INFO  http . Http http .accept.language = en-us,en-gb,en;q=0.7,*;q=0.3
13/09/13 13:12:38 INFO  http . Http http .accept = text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
13/09/13 13:12:38 INFO conf.Configuration: found resource parse-plugins.xml at file:/home/hadoop/hadoop-hadoop/hadoop-unjar3881248761965128726/parse-plugins.xml
13/09/13 13:12:39 INFO crawl.SignatureFactory: Using Signature impl: org.apache.nutch.crawl.MD5Signature
13/09/13 13:12:39 INFO parse.ParserChecker: parsing:  http ://
13/09/13 13:12:39 INFO parse.ParserChecker: contentType: text/html
13/09/13 13:12:39 INFO parse.ParserChecker: signature: de2214c0120f01f00cb1b2c99f193057
http ://
Version: 5
Status: success(1,0)
Title: 百度一下,你就知道
Outlinks: 30
  outlink: toUrl:  http :// anchor: 搜索设置
  outlink: toUrl: http anchor: 登录
  outlink: toUrl: http anchor: 注册
  outlink: toUrl:  http :// anchor: 
  outlink: toUrl:  http :// anchor: 新?闻
  outlink: toUrl:  http :// anchor: 贴?吧
  outlink: toUrl:  http :// anchor: 知?道
  outlink: toUrl:  http :// anchor: 音?乐
  outlink: toUrl:  http :// anchor: 图?片
  outlink: toUrl:  http :// anchor: 视?频
  outlink: toUrl:  http :// anchor: 地?图
  outlink: toUrl:  http :// anchor: 手写
  outlink: toUrl:  http :// anchor: 拼音
  outlink: toUrl:  http :// anchor: 关闭
  outlink: toUrl:  http :// anchor: 百科
  outlink: toUrl:  http :// anchor: 文库
  outlink: toUrl:  http :// anchor: hao123
  outlink: toUrl:  http :// anchor: 更多>>
  outlink: toUrl:  http :// anchor: 把百度设为主页
  outlink: toUrl:  http :// anchor: 把百度设为主页
  outlink: toUrl:  http :// anchor: 安装百度浏览器
  outlink: toUrl:  http :// anchor: 加入百度推广
  outlink: toUrl:  http :// anchor: 搜索风云榜
  outlink: toUrl:  http :// anchor: 关于百度
  outlink: toUrl:  http :// anchor: About Baidu
  outlink: toUrl:  http :// anchor: 使用百度前必读
  outlink: toUrl:  http :// anchor: 
  outlink: toUrl:  http :// anchor: 
  outlink: toUrl:  http :// anchor: 
  outlink: toUrl:  http :// anchor: 
Content Metadata: Content-Length=4408 Expires=Fri, 13 Sep 2013 05:12:37 GMT Set-Cookie=BAIDUID=6D9 16 8DE43 16 2206106A095B0D79C9F2:FG=1; expires=Fri, 13-Sep-43 05:12:37 GMT; path=/; Connection=Close Server=BWS/1.0 Cache-Control=private Date=Fri, 13 Sep 2013 05:12:37 GMT BDQID=0xad50667e05149c9b P3P=CP=" OTI DSP COR IVA OUR IND COM " Content-Encoding=gzip BDPAGETYPE=1 Content-Type=text/html;charset=utf-8 BDUSERID=0 
Parse Metadata: CharEncodingForConversion=utf-8 OriginalCharEncoding=utf-8 


  crawl             one-step crawler for intranets (DEPRECATED - USE CRAWL SCRIPT INSTEAD)(爬虫)
  readdb            read / dump crawl db()
  mergedb           merge crawldb-s, with optional filtering()
  readlinkdb        read / dump link db
  inject            inject new urls into the database
  generate          generate new segments to fetch from crawl db
  freegen           generate new segments to fetch from text files
  fetch             fetch a segment's pages
  parse             parse a segment's pages
  readseg           read / dump segment data
  mergesegs         merge several segments, with optional filtering and slicing
  updatedb          update crawl db from segments after fetching
  invertlinks       create a linkdb from parsed segments
  mergelinkdb       merge linkdb-s, with optional filtering
  solrindex         run the solr indexer on parsed segments and linkdb
  solrdedup         remove duplicates from solr
  solrclean         remove  HTTP  301 and 404 documents from solr
  parsechecker      check the parser for a given url
  indexchecker      check the indexing filters for a given url
  domainstats       calculate domain statistics from crawldb
  webgraph          generate a web graph from existing segments
  linkrank          run a link analysis program on the generated web graph
  scoreupdater      updates the crawldb with linkrank scores
  nodedumper        dumps the web graph's node scores
  plugin            load a plugin and run one of its classes main()
  junit             runs the given JUnit test
  CLASSNAME         run the class named CLASSNAME
6大命令:read merge




