Nutch学习笔记三

Nutch抓取网页步骤

1,新建url列表

http://www.qq.com/
http://www.sina.com.cn/

2,将种子列表URL导入Nutch的crawldb

hadoop@slave5:~/nutch$ nutch inject crawl/crawldb urls/
Injector: starting at 2013-07-14 17:19:07
Injector: crawlDb: crawl/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Injector: total number of urls rejected by filters: 1
Injector: total number of urls injected after normalization and filtering: 2
Injector: Merging injected urls into crawl db.
Injector: finished at 2013-07-14 17:19:10, elapsed: 00:00:02

3,生成获取(fetch)列表,以便获取和分析内容

hadoop@slave5:~/nutch$ nutch generate crawl/crawldb crawl/segments
Generator: starting at 2013-07-14 17:19:14
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: true
Generator: normalizing: true
Generator: jobtracker is 'local', generating exactly one partition.
Generator: Partitioning selected urls for politeness.
Generator: segment: crawl/segments/20130714171917
Generator: finished at 2013-07-14 17:19:18, elapsed: 00:00:03

以上命令在crawl/segments目录下生成了一个新的segment目录,里边存储了抓到的URLs

4,我们需要最新的segment目录作为参数,存储到环境变量SEGMENT里

export SEGMENT=crawl/segments/`ls -tr crawl/segments|tail -1`

hadoop@slave5:~/nutch$ export SEGMENT=crawl/segments/`ls -tr crawl/segments|tail -1`

5,启动抓取程序真正开始抓取内容

nutch fetch $SEGMENT -noParsing

6,

解析抓取下来的内容

bin/nutch parse $SEGMENT

hadoop@slave5:~/nutch$ bin/nutch fetch $SEGMENT -noParsing
Fetcher: Your 'http.agent.name' value should be listed first in 'http.robots.agents' property.
Fetcher: starting at 2013-07-14 17:20:18
Fetcher: segment: crawl/segments/20130714171917
Using queue mode : byHost
Fetcher: threads: 10
Fetcher: time-out divisor: 2
QueueFeeder finished: total 2 records + hit by time limit :0
Using queue mode : byHost
Using queue mode : byHost
fetching http://www.qq.com/ (queue crawl delay=5000ms)
Using queue mode : byHost
Using queue mode : byHost
fetching http://www.sina.com.cn/ (queue crawl delay=5000ms)
-finishing thread FetcherThread, activeThreads=2
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=3
-finishing thread FetcherThread, activeThreads=2
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=2
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=2
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=2
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=2
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=2
Fetcher: throughput threshold: -1
Fetcher: throughput threshold retries: 5
-finishing thread FetcherThread, activeThreads=1
-finishing thread FetcherThread, activeThreads=0
-activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=0
Fetcher: finished at 2013-07-14 17:20:21, elapsed: 00:00:02
hadoop@slave5:~/nutch$ nutch parse $SEGMENT
ParseSegment: starting at 2013-07-14 17:21:35
ParseSegment: segment: crawl/segments/20130714171917
Parsed (10ms):http://www.qq.com/
http://www.sina.com.cn/ skipped. Content of size 135311 was truncated to 65536
ParseSegment: finished at 2013-07-14 17:21:37, elapsed: 00:00:01

7,

更 新Nutch crawldb,updatedb命令会存储以上两步抓取(fetch)和分析(parse)最新的segment而得到的新的URLs到Nutch crawldb,以便后续的继续抓取,除了URLs之外,Nutch也存储了相应的页面内容,防止相同的URLs被反反复复的抓取。

bin/nutch updatedb crawl/crawldb $SEGMENT -filter -normalize

到此,一个完整的抓取周期结束了,你可以重复步骤10多次以便可以抓取更多的内容。

hadoop@slave5:~/nutch$ bin/nutch updatedb crawl/crawldb $SEGMENT -filter -normalize
CrawlDb update: starting at 2013-07-14 17:33:55
CrawlDb update: db: crawl/crawldb
CrawlDb update: segments: [crawl/segments/20130714171917]
CrawlDb update: additions allowed: true
CrawlDb update: URL normalizing: true
CrawlDb update: URL filtering: true
CrawlDb update: 404 purging: false
CrawlDb update: Merging segment data into db.
CrawlDb update: finished at 2013-07-14 17:33:56, elapsed: 00:00:01

8,创建超链接库

bin/nutch invertlinks crawl/linkdb -dir crawl/segments

hadoop@slave5:~/nutch$ bin/nutch invertlinks crawl/linkdb -dir crawl/segments
LinkDb: starting at 2013-07-14 17:37:43
LinkDb: linkdb: crawl/linkdb
LinkDb: URL normalize: true
LinkDb: URL filter: true
LinkDb: internal links will be ignored.
LinkDb: adding segment: file:/home/hadoop/apache-nutch-1.7/crawl/segments/20130714171917
LinkDb: finished at 2013-07-14 17:37:44, elapsed: 00:00:01

出现错误步骤

9,

索引所有segments中的内容到Solr中

bin/nutch solrindex http://127.0.0.1:8983/solr/ crawl/crawldb crawl/linkdb crawl/segments/*

现在,所有Nutch抓取的内容已经被Solr索引了,你可以通过Solr Admin执行查询操作了

http://127.0.0.1:8983/solr/admin







。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。

利用 TensorFlow 训练自己的目标识别器。本文内容来自于我的毕业设计,基于 TensorFlow 1.15.0,其他 TensorFlow 版本运行可能存在问题。.zip项目工程资源经过严格测试可直接运行成功且功能正常的情况才上传,可轻松复刻,拿到资料包后可轻松复现出一样的项目,本人系统开发经验充足(全领域),有任何使用问题欢迎随时与我联系,我会及时为您解惑,提供帮助。 【资源内容】:包含完整源码+工程文件+说明(如有)等。答辩评审平均分达到96分,放心下载使用!可轻松复现,设计报告也可借鉴此项目,该资源内项目代码都经过测试运行成功,功能ok的情况下才上传的。 【提供帮助】:有任何使用问题欢迎随时与我联系,我会及时解答解惑,提供帮助 【附带帮助】:若还需要相关开发工具、学习资料等,我会提供帮助,提供资料,鼓励学习进步 【项目价值】:可用在相关项目设计中,皆可应用在项目、毕业设计、课程设计、期末/期中/大作业、工程实训、大创等学科竞赛比赛、初期项目立项、学习/练手等方面,可借鉴此优质项目实现复刻,设计报告也可借鉴此项目,也可基于此项目来扩展开发出更多功能 下载后请首先打开README文件(如有),项目工程可直接复现复刻,如果基础还行,也可在此程序基础上进行修改,以实现其它功能。供开源学习/技术交流/学习参考,勿用于商业用途。质量优质,放心下载使用。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值