nutch
文章平均质量分 77
chengqianl
lucene solr
展开
-
nutch 1.4 需要的额外的jar
nekohtml http://nekohtml.sourceforge.net/dk.brics.automaton http://www.brics.dk/automaton/rome http://mirrors.ibiblio.org/pub/mirrors/maven2/rome/rome/0.9/rome-0.9.jartagsoup-1.1.3 http://www.fi...原创 2012-05-27 19:58:21 · 108 阅读 · 0 评论 -
nutch SolrIndexer 详解
[img]http://dl.iteye.com/upload/attachment/0070/9707/99759312-b08c-308d-b142-17c8b826763f.jpg[/img]这个 job的 具体和 nutch1.2 index[url]http://chengqianl.iteye.com/admin/blogs/1597617[/url]一样IndexerMap...原创 2012-07-18 18:33:07 · 146 阅读 · 0 评论 -
nutch1.2 DeleteDuplicates IndexMerger 详解
[img]http://dl.iteye.com/upload/attachment/0070/9571/dc62bf75-a090-399e-bf72-cb1b38a5e7c7.jpg[/img]job 1 map: 默认Mapper , 输出为key:Text url value :IndexDoc job.setInputFormat(Input...原创 2012-07-18 16:31:59 · 131 阅读 · 0 评论 -
nutch1.2 index 详解
首先如果存在crawl/index ,crawl/indexes目录则删除[img]http://dl.iteye.com/upload/attachment/0070/9519/a430b9dc-5f53-30cf-8a29-9fdcfd640db8.jpg[/img]map:IndexerMapReduce map输入目录为 所有的segment的crawl_fet...原创 2012-07-18 15:16:56 · 145 阅读 · 0 评论 -
nutch LinkDb invert 详解
LinkDb[img]http://dl.iteye.com/upload/attachment/0070/9396/c9cab6fc-3367-3c31-9baa-1262cee8a7ee.jpg[/img]map :LinkDb 输入目录为segments目录里面所有segment下面的parse_data目录 1 首先对key:url 如果配置filter和n...原创 2012-07-18 14:19:59 · 111 阅读 · 0 评论 -
nutch crawdb update 详解
crawdb update[img]http://dl.iteye.com/upload/attachment/0070/9302/e36cc6e0-519e-3a58-8ae0-bdb1eef4840f.jpg[/img] map :CrawlDbFilter 这个map主要是用来合并数据的 输入,fetch产生的segment目录下面的crawl_fetch...原创 2012-07-18 11:01:20 · 133 阅读 · 0 评论 -
nutch fetcher详解
fetcher 是生产者和消费者的模式,生产者是QueueFeeder 不断的读取文件,消费者是FetcherThread 不断的抓取网址 map是输入是crawl/segments/具体的segment/crawl_generate QueueFeeder[img]http://dl.iteye.com/upload/attachment/0070/8351/350c6d77...原创 2012-07-16 18:04:06 · 201 阅读 · 0 评论 -
nutch generator 详解
[img]http://dl.iteye.com/upload/attachment/0070/8228/5e55caae-08ec-3e9b-a2ec-dafacb1773d7.jpg[/img]job1 map Selector 输入目录为crawldb/current 输入key:Text 为url ,Value:CrawlDatum 功能如下...原创 2012-07-16 15:31:34 · 142 阅读 · 0 评论 -
nutch inject 详解
nutch的inject 有二个job第一个job 如下图[img]http://dl.iteye.com/upload/attachment/0070/8193/a71b6a19-b4c3-3cd6-90d8-2a490b9a61c9.jpg[/img]map :InjectMapper 功能如下 1 url是否有tab分割的k-v 对如果有记录下来,2...原创 2012-07-16 14:27:21 · 151 阅读 · 0 评论 -
nutch 配置文件
NutchConfiguration 类中的初始化 public static Configuration createCrawlConfiguration() { Configuration conf = new Configuration(); addNutchResources(conf, true); return conf; }调用N...原创 2012-06-27 16:57:52 · 92 阅读 · 0 评论 -
nutch SolrDeleteDuplicates
[img]http://dl.iteye.com/upload/attachment/0070/9722/4cd4c22a-aeae-39a3-ad52-26d98b008fc4.jpg[/img]map 使用默认的map InputFormat 负责split数据转换数据 job.setInputFormat(SolrInputFormat.class);SolrInpu...原创 2012-07-19 12:24:02 · 119 阅读 · 0 评论