nutch搜索结果为0。。。

最新推荐文章于 2020-07-17 15:50:46 发布

DSADSFGSADGGGGGGGGGG

最新推荐文章于 2020-07-17 15:50:46 发布

阅读量558

点赞数

文章标签： generator thread parsing url 2010 file

本文链接：https://blog.csdn.net/DSADSFGSADGGGGGGGGGG/article/details/5643840

版权

crawl started in: mydir
rootUrlDir = urls
threads = 4
depth = 2
topN = 50
Injector: starting
Injector: crawlDb: mydir/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Injector: Merging injected urls into crawl db.
Injector: done
Generator: Selecting best-scoring urls due for fetch.
Generator: starting
Generator: segment: mydir/segments/20100601161618
Generator: filtering: true
Generator: topN: 50
Generator: jobtracker is 'local', generating exactly one partition.
Generator: Partitioning selected urls by host, for politeness.
Generator: done.
Fetcher: starting
Fetcher: segment: mydir/segments/20100601161618
Fetcher: threads: 4
QueueFeeder finished: total 1 records.
-finishing thread FetcherThread, activeThreads=3
fetching http://www.sina.com.cn/
-finishing thread FetcherThread, activeThreads=1
-finishing thread FetcherThread, activeThreads=2
-activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
-finishing thread FetcherThread, activeThreads=0
-activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=0
Fetcher: done
CrawlDb update: starting
CrawlDb update: db: mydir/crawldb
CrawlDb update: segments: [mydir/segments/20100601161618]
CrawlDb update: additions allowed: true
CrawlDb update: URL normalizing: true
CrawlDb update: URL filtering: true
CrawlDb update: Merging segment data into db.
CrawlDb update: done
Generator: Selecting best-scoring urls due for fetch.
Generator: starting
Generator: segment: mydir/segments/20100601161637
Generator: filtering: true
Generator: topN: 50
Generator: jobtracker is 'local', generating exactly one partition.
Generator: Partitioning selected urls by host, for politeness.
Generator: done.
Fetcher: starting
Fetcher: segment: mydir/segments/20100601161637
Fetcher: threads: 4
fetching http://news.sina.com.cn/pfpnews/js/libweb.js
fetching http://pfp.sina.com.cn/pfpnew/merge/res_PGLS000022_FP.js
QueueFeeder finished: total 3 records.
-finishing thread FetcherThread, activeThreads=3
fetching http://i2.sinaimg.cn/dy/deco/2010/0527/headwww.js
Error parsing: http://news.sina.com.cn/pfpnews/js/libweb.js: org.apache.nutch.parse.ParseException: parser not found for contentType=application/javascript url=http://news.sina.com.cn/pfpnews/js/libweb.js
at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:74)
at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:766)
at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:552)

-finishing thread FetcherThread, activeThreads=2
Error parsing: http://pfp.sina.com.cn/pfpnew/merge/res_PGLS000022_FP.js: org.apache.nutch.parse.ParseException: parser not found for contentType=application/javascript url=http://pfp.sina.com.cn/pfpnew/merge/res_PGLS000022_FP.js
at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:74)
at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:766)
at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:552)

-finishing thread FetcherThread, activeThreads=1
-activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
-finishing thread FetcherThread, activeThreads=0
-activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=0
Fetcher: done
CrawlDb update: starting
CrawlDb update: db: mydir/crawldb
CrawlDb update: segments: [mydir/segments/20100601161637]
CrawlDb update: additions allowed: true
CrawlDb update: URL normalizing: true
CrawlDb update: URL filtering: true
CrawlDb update: Merging segment data into db.
CrawlDb update: done
LinkDb: starting
LinkDb: linkdb: mydir/linkdb
LinkDb: URL normalize: true
LinkDb: URL filter: true
LinkDb: adding segment: file:/C:/cygwin/home/Administrator/nutch-1.0/mydir/segments/20100601161618
LinkDb: adding segment: file:/C:/cygwin/home/Administrator/nutch-1.0/mydir/segments/20100601161637
LinkDb: done
Indexer: starting
Indexer: done
Dedup: starting
Dedup: adding indexes in: mydir/indexes
Dedup: done
merging indexes to: mydir/index
Adding file:/C:/cygwin/home/Administrator/nutch-1.0/mydir/indexes/part-00000
done merging
crawl finished: mydir

DSADSFGSADGGGGGGGGGG

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
nutch搜索结果为0。。。

 crawl started in: mydir rootUrlDir = urls threads = 4 depth = 2 topN = 50 Injector: starting Injector: crawlDb: mydir/crawldb Injector: urlDir: urls Injector: Converting injected urls to crawl db entries. 
复制链接

扫一扫