nutch1.4 + solr3.5 上路

本文介绍了Nutch 1.4与Solr 3.5的安装配置过程,包括环境搭建、配置修改、爬虫测试及Solr集成,并解决了常见错误如Nutch Fetcher代理名称缺失、输入路径不存在等问题。

1.下载nutch1.4 & solr3.5

http://mirror.bjtu.edu.cn/apache/lucene/solr/3.5.0/apache-solr-3.5.0.zip
http://mirror.bjtu.edu.cn/apache/nutch/apache-nutch-1.4-bin.zip


2.解压安装

2.1 目录规划

安装目录 nutch:/home/xcloud/spider/nutch
安装目录 solr:/home/xcloud/spider/solr


2.2文件解压

解压nutch
unzip apache-nutch-1.4-bin.zip
mv apache-nutch-1.4 nutch
chmod 775 -R nutch


解压solr
unzip apache-solr-3.5.0.zip
mv apache-solr-3.5.0 solr
chmod 775 -R solr


3.nutch配置/运行测试

3.1 nutch配置目录说明

nutch运行目录:$NUTCH_HOME/runtime/local/bin
nutch配置目录:$NUTCH_HOME/runtime/local/conf


3.2nutch测试

$NUTCH_HOME/runtime/local/bin/nutch


xcloud@xcloud:~/spider/nutch/runtime/local/bin$ ./nutch
Usage: nutch [-core] COMMAND
where COMMAND is one of:
  crawl             one-step crawler for intranets
  readdb            read / dump crawl db
  mergedb           merge crawldb-s, with optional filtering
  readlinkdb        read / dump link db
  inject            inject new urls into the database
  generate          generate new segments to fetch from crawl db
  freegen           generate new segments to fetch from text files
  fetch             fetch a segment's pages
  parse             parse a segment's pages
  readseg           read / dump segment data
  mergesegs         merge several segments, with optional filtering and slicing
  updatedb          update crawl db from segments after fetching
  invertlinks       create a linkdb from parsed segments
  mergelinkdb       merge linkdb-s, with optional filtering
  solrindex         run the solr indexer on parsed segments and linkdb
  solrdedup         remove duplicates from solr
  solrclean         remove HTTP 301 and 404 documents from solr
  parsechecker      check the parser for a given url
  indexchecker      check the indexing filters for a given url
  domainstats       calculate domain statistics from crawldb
  webgraph          generate a web graph from existing segments
  linkrank          run a link analysis program on the generated web graph
  scoreupdater      updates the crawldb with linkrank scores
  nodedumper        dumps the web graph's node scores
  plugin            load a plugin and run one of its classes main()
  junit             runs the given JUnit test
 or
  CLASSNAME         run the class named CLASSNAME
Most commands print help when invoked w/o parameters.


Expert: -core option is for developers only. It avoids building the job jar, 
        instead it simply includes classes compiled with ant compile-core. 
        NOTE: this works only for jobs executed in 'local' mode
xcloud@xcloud:~/spider/nutch/runtime/local/bin$


说明安装成功。


3.3 nutch配置修改

$NUTCH_HOME/runtime/local/conf/nutch-site.xml


<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>


<!-- Put site-specific property overrides in this file. -->


<configuration>
  <property>
        <name>http.agent.name</name>
        <value>My Nutch Agent</value>
        <description>HTTP 'User-Agent' request header. MUST NOT be empty -
          please set this to a single word uniquely related to your organization.
          NOTE: You should also check other related properties:
                http.robots.agents
                http.agent.description
                http.agent.url
                http.agent.email
                http.agent.version
                and set their values appropriately.
        </description>
  </property>
  <property>
        <name>http.agent.description</name>
        <value></value>
        <description>Further description of our bot- this text is used in
        the User-Agent header. It appears in parenthesis after the agent name.
        </description>
  </property>
  <property>
        <name>http.agent.url</name>
        <value></value>
        <description>A URL to advertise in the User-Agent header. This will
        appear in parenthesis after the agent name. Custom dictates that this
        should be a URL of a page explaining the purpose and behavior of this crawler.
        </description>
  </property>
  <property>
        <name>http.agent.email</name>
        <value></value>
        <description>An email address to advertise in the HTTP 'From' request
        header and User-Agent header. A good practice is to mangle this
        address (e.g. 'info at example dot com') to avoid spamming.
        </description>
  </property>
</configuration>


3.4创建种子目录

mkdir $NUTCH_HOME/runtime/local/bin/urls
echo http://nutch.apache.org/ >> $NUTCH_HOME/runtime/local/bin/urls/seed.txt


3.5修改conf/regex-urlfilter.txt

sudo gedit $NUTCH_HOME/runtime/local/conf/regex-urlfilter.txt

# accept anything else
+.
修改为
+^http://([a-z0-9]*\.)*nutch.apache.org/




3.6爬行

./nutch crawl urls -dir crawl -depth 3 -topN 5


3.7 solrindex

./nutch solrindex http://127.0.0.1:8983/solr/ mycrawl/crawldb -linkdb mycrawl/linkdb mycrawl/segments/*


4.solr配置/运行测试

4.1启动

启动目录:${SOLR_HOME}/example
java -jar start.jar


4.2验证

http://localhost:8983/solr/admin/
http://localhost:8983/solr/admin/stats.jsp


4.3nutch&solr集成

cp $NUTCH_HOME/runtime/local/conf/schema.xml ${SOLR_HOME}/example/solr/conf/


java -jar start.jar
然后在页面中输入nutch,返回xml格式到结果。




5.遇到到问题

5.1 Nutch Fetcher: No agents listed in ‘http.agent.name’ property


解决:修改$NUTCH_HOME/runtime/local/conf/nutch-site.xml,如上图所示


5.2Exception in thread "main" org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/home/xcloud/spider/nutch/runtime/local/bin/crawl/segments/20120405100520/parse_data


Exception in thread "main" org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/home/xcloud/spider/nutch/runtime/local/bin/crawl/segments/20120405100520/parse_data
Input path does not exist: file:/home/xcloud/spider/nutch/runtime/local/bin/crawl/segments/20120405101222/parse_data
at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:190)
at org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:44)
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:201)
at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810)
at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:175)
at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:149)
at org.apache.nutch.crawl.Crawl.run(Crawl.java:143)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.crawl.Crawl.main(Crawl.java:55)


原来:bin/nutch crawl urls -dir crawl -depth 3 -topN 5
修改:
mkdir mycrawl
chmod 777 mycrawl

bin/nutch crawl urls -dir mycrawl -depth 3 -topN 5


5.3  regex-urlfilter.txt 正则匹配问题

xcloud@xcloud:~/nutch$ nutch crawl urls -dir db -depth 2 -topN 2
solrUrl is not set, indexing will be skipped...
crawl started in: db
rootUrlDir = urls
threads = 10
depth = 2
solrUrl=null
topN = 2
Injector: starting at 2012-04-09 14:56:20
Injector: crawlDb: db/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Exception in thread "main" java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1252)
at org.apache.nutch.crawl.Injector.inject(Injector.java:217)
at org.apache.nutch.crawl.Crawl.run(Crawl.java:127)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.crawl.Crawl.main(Crawl.java:55)

查看logs/hadoop.log

Caused by: java.io.IOException: Invalid first character: http://XXX/product/*
at org.apache.nutch.urlfilter.api.RegexURLFilterBase.readRules(RegexURLFilterBase.java:200)
at org.apache.nutch.urlfilter.api.RegexURLFilterBase.setConf(RegexURLFilterBase.java:156)
... 21 more
原来:

http://XXX/product/*

修改为

+^http://XXX/product/*


5.4 timeout exception

2012-04-09 16:35:52,067 ERROR http.Http - java.net.SocketTimeoutException: Read timed out
2012-04-09 16:35:52,067 ERROR http.Http - at java.net.SocketInputStream.socketRead0(Native Method)
2012-04-09 16:35:52,067 ERROR http.Http - at java.net.SocketInputStream.read(SocketInputStream.java:129)
2012-04-09 16:35:52,067 ERROR http.Http - at java.io.BufferedInputStream.read1(BufferedInputStream.java:256)
2012-04-09 16:35:52,067 ERROR http.Http - at java.io.BufferedInputStream.read(BufferedInputStream.java:317)
2012-04-09 16:35:52,067 ERROR http.Http - at java.io.FilterInputStream.read(FilterInputStream.java:116)
2012-04-09 16:35:52,068 ERROR http.Http - at java.io.PushbackInputStream.read(PushbackInputStream.java:169)
2012-04-09 16:35:52,068 ERROR http.Http - at java.io.FilterInputStream.read(FilterInputStream.java:90)
2012-04-09 16:35:52,068 ERROR http.Http - at org.apache.nutch.protocol.http.HttpResponse.readPlainContent(HttpResponse.java:229)
2012-04-09 16:35:52,068 ERROR http.Http - at org.apache.nutch.protocol.http.HttpResponse.<init>(HttpResponse.java:158)
2012-04-09 16:35:52,068 ERROR http.Http - at org.apache.nutch.protocol.http.Http.getResponse(Http.java:64)
2012-04-09 16:35:52,068 ERROR http.Http - at org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:138)
2012-04-09 16:35:52,068 ERROR http.Http - at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:665)

修改conf/nutch-default.xml中的http.timeout=200000





2012-04-09 15:28:20,865 INFO  api.RobotRulesParser - Couldn't get robots.txt for http://xxx.html: java.net.SocketTimeoutException: Read timed out

修改lib-http/org/apache/nutch/protocol/http/api/RobotRulesParser.java 源码

/home/xcloud/iworkspace/nutch/plugin/lib-http/org/apache/nutch/protocol/http/api/RobotRulesParser.java
class:/home/xcloud/iworkspace/nutch/bin/org/apache/nutch/protocol/http/api
jar路径:/home/xcloud/nutch/runtime/local/plugins/lib-http


基于实时迭代的数值鲁棒NMPC双模稳定预测模型(Matlab代码实现)内容概要:本文介绍了基于实时迭代的数值鲁棒非线性模型预测控制(NMPC)双模稳定预测模型的研究与Matlab代码实现,重点在于提升系统在存在不确定性与扰动情况下的控制性能与稳定性。该模型结合实时迭代优化机制,增强了传统NMPC的数值鲁棒性,并通过双模控制策略兼顾动态响应与稳态精度,适用于复杂非线性系统的预测控制问题。文中还列举了多个相关技术方向的应用案例,涵盖电力系统、路径规划、信号处理、机器学习等多个领域,展示了该方法的广泛适用性与工程价值。; 适合人群:具备一定控制理论基础和Matlab编程能力,从事自动化、电气工程、智能制造、机器人控制等领域研究的研究生、科研人员及工程技术人员。; 使用场景及目标:①应用于非线性系统的高性能预测控制设计,如电力系统调度、无人机控制、机器人轨迹跟踪等;②解决存在模型不确定性、外部扰动下的系统稳定控制问题;③通过Matlab仿真验证控制算法的有效性与鲁棒性,支撑科研论文复现与工程原型开发。; 阅读建议:建议读者结合提供的Matlab代码进行实践,重点关注NMPC的实时迭代机制与双模切换逻辑的设计细节,同时参考文中列举的相关研究方向拓展应用场景,强化对数值鲁棒性与系统稳定性之间平衡的理解。
评论 1
成就一亿技术人!
拼手气红包6.0元
还能输入1000个字符
 
红包 添加红包
表情包 插入表情
 条评论被折叠 查看
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值