分布式部署爬虫 + solr cloud 遇到的几个问题

问题1. WARN crawl.Generator: Generator: 0 records selected for fetching

    出现可能原因:

    1).regex-urlfilter.txt 里面的正则表达式有问题;

问题2. Bad Request

request: http://XXXXX:8080/solr/CultureSearch/update?wt=javabin&version=2

这个solr cloud的配置文件有问题造成的主要与相关的schema.xml有关

我的第一个原因是缺少相关的jar包,但是在schema.xml中配置了;第二个是_version_属性的类型对应不上。


问题3.

15/04/07 23:31:03 INFO mapreduce.Job: Task Id : attempt_1427710479955_0129_r_000000_0, Status : FAILED
Error: java.io.IOException
    at org.apache.nutch.indexwriter.solr.SolrIndexWriter.makeIOException(SolrIndexWriter.java:173)
    at org.apache.nutch.indexwriter.solr.SolrIndexWriter.write(SolrIndexWriter.java:137)
    at org.apache.nutch.indexer.IndexWriters.write(IndexWriters.java:88)
    at org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:50)
    at org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:41)
    at org.apache.hadoop.mapred.ReduceTask$OldTrackingRecordWriter.write(ReduceTask.java:511)
    at org.apache.hadoop.mapred.ReduceTask$3.collect(ReduceTask.java:440)
    at org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:334)
    at org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:53)
    at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:462)
    at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:408)
    at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:162)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:415)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491)
    at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:157)
Caused by: org.apache.solr.client.solrj.SolrServerException: org.apache.commons.httpclient.ProtocolException: Unbuffered entity enclosing request can not be repeated.
    at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:475)
    at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:244)
    at org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:105)
    at org.apache.nutch.indexwriter.solr.SolrIndexWriter.write(SolrIndexWriter.java:135)
    ... 14 more
Caused by: org.apache.commons.httpclient.ProtocolException: Unbuffered entity enclosing request can not be repeated.
    at org.apache.commons.httpclient.methods.EntityEnclosingMethod.writeRequestBody(EntityEnclosingMethod.java:487)
    at org.apache.commons.httpclient.HttpMethodBase.writeRequest(HttpMethodBase.java:2114)
    at org.apache.commons.httpclient.HttpMethodBase.execute(HttpMethodBase.java:1096)
    at org.apache.commons.httpclient.HttpMethodDirector.executeWithRetry(HttpMethodDirector.java:398)
    at org.apache.commons.httpclient.HttpMethodDirector.executeMethod(HttpMethodDirector.java:171)
    at org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:397)
    at org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:323)
    at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:422)
    ... 17 more

15/04/07 23:45:44 INFO mapreduce.Job: Task Id : attempt_1427710479955_0129_r_000000_1, Status : FAILED
Error: java.lang.RuntimeException: problem advancing post rec#954132
    at org.apache.hadoop.mapred.Task$ValuesIterator.next(Task.java:1364)
    at org.apache.hadoop.mapred.ReduceTask$ReduceValuesIterator.moveToNext(ReduceTask.java:213)
    at org.apache.hadoop.mapred.ReduceTask$ReduceValuesIterator.next(ReduceTask.java:209)
    at org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:176)
    at org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:53)
    at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:462)
    at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:408)
    at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:162)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:415)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491)
    at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:157)
Caused by: java.io.IOException: Cannot initialize the class: class org.apache.hadoop.io.NullWritable
    at org.apache.nutch.util.GenericWritableConfigurable.readFields(GenericWritableConfigurable.java:49)
    at org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:71)
    at org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:42)
    at org.apache.hadoop.mapred.Task$ValuesIterator.readNextValue(Task.java:1421)
    at org.apache.hadoop.mapred.Task$ValuesIterator.next(Task.java:1361)
    ... 11 more

15/04/08 00:08:39 ERROR indexer.IndexingJob: Indexer: java.io.IOException: Job failed!
    at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:836)
    at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:114)
    at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:176)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
    at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:186)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at org.apache.hadoop.util.RunJar.main(RunJar.java:212)


这个问题很奇葩,从日志来看似乎是与solr有关系。一时没有找到好方法,还需要将solr的配置和相关的jar包都加进去,然后也注意下tomcat的执行权限问题。

问题4:出现这个问题job就会停止了,爬虫就不会再爬了。

Error: java.io.IOException
    at org.apache.nutch.indexwriter.solr.SolrIndexWriter.makeIOException(SolrIndexWriter.java:173)
    at org.apache.nutch.indexwriter.solr.SolrIndexWriter.write(SolrIndexWriter.java:137)
    at org.apache.nutch.indexer.IndexWriters.write(IndexWriters.java:88)
    at org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:50)
    at org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:41)
    at org.apache.hadoop.mapred.ReduceTask$OldTrackingRecordWriter.write(ReduceTask.java:511)
    at org.apache.hadoop.mapred.ReduceTask$3.collect(ReduceTask.java:440)
    at org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:334)
    at org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:53)
    at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:462)
    at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:408)
    at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:162)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:415)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491)
    at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:157)
Caused by: org.apache.solr.client.solrj.SolrServerException: org.apache.commons.httpclient.ProtocolException: Unbuffered entity enclosing request can not be repeated.
    at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:475)
    at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:244)
    at org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:105)
    at org.apache.nutch.indexwriter.solr.SolrIndexWriter.write(SolrIndexWriter.java:135)
    ... 14 more
Caused by: org.apache.commons.httpclient.ProtocolException: Unbuffered entity enclosing request can not be repeated.
    at org.apache.commons.httpclient.methods.EntityEnclosingMethod.writeRequestBody(EntityEnclosingMethod.java:487)
    at org.apache.commons.httpclient.HttpMethodBase.writeRequest(HttpMethodBase.java:2114)
    at org.apache.commons.httpclient.HttpMethodBase.execute(HttpMethodBase.java:1096)
    at org.apache.commons.httpclient.HttpMethodDirector.executeWithRetry(HttpMethodDirector.java:398)
    at org.apache.commons.httpclient.HttpMethodDirector.executeMethod(HttpMethodDirector.java:171)
    at org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:397)
    at org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:323)
    at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:422)
    ... 17 more

Error: java.lang.RuntimeException: problem advancing post rec#954132
    at org.apache.hadoop.mapred.Task$ValuesIterator.next(Task.java:1364)
    at org.apache.hadoop.mapred.ReduceTask$ReduceValuesIterator.moveToNext(ReduceTask.java:213)
    at org.apache.hadoop.mapred.ReduceTask$ReduceValuesIterator.next(ReduceTask.java:209)
    at org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:176)
    at org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:53)
    at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:462)
    at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:408)
    at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:162)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:415)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491)
    at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:157)
Caused by: java.io.IOException: Cannot initialize the class: class org.apache.hadoop.io.NullWritable
    at org.apache.nutch.util.GenericWritableConfigurable.readFields(GenericWritableConfigurable.java:49)
    at org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:71)
    at org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:42)
    at org.apache.hadoop.mapred.Task$ValuesIterator.readNextValue(Task.java:1421)
    at org.apache.hadoop.mapred.Task$ValuesIterator.next(Task.java:1361)
    ... 11 more

Job failed as tasks failed. failedMaps:0 failedReduces:1

    File System Counters
        FILE: Number of bytes read=4550678317
        FILE: Number of bytes written=9135325272
        FILE: Number of read operations=0
        FILE: Number of large read operations=0
        FILE: Number of write operations=0
        HDFS: Number of bytes read=1948818838
        HDFS: Number of bytes written=0
        HDFS: Number of read operations=84
        HDFS: Number of large read operations=0
        HDFS: Number of write operations=0
    Job Counters
        Failed reduce tasks=4
        Killed map tasks=10
        Launched map tasks=31
        Launched reduce tasks=4
        Data-local map tasks=18
        Rack-local map tasks=13
        Total time spent by all maps in occupied slots (ms)=2958546
        Total time spent by all reduces in occupied slots (ms)=3033513
    Map-Reduce Framework
        Map input records=12879328
        Map output records=12879328
        Map output bytes=4555697629
        Map output materialized bytes=4582546030
        Input split bytes=2693
        Combine input records=0
        Spilled Records=25446303
        Failed Shuffles=0
        Merged Map outputs=0
        GC time elapsed (ms)=4776
        CPU time spent (ms)=314730
        Physical memory (bytes) snapshot=7077011456
        Virtual memory (bytes) snapshot=25862344704
        Total committed heap usage (bytes)=7811366912
    File Input Format Counters
        Bytes Read=1948816145

15/04/08 00:08:39 ERROR indexer.IndexingJob: Indexer: java.io.IOException: Job failed!
    at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:836)
    at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:114)
    at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:176)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
    at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:186)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at org.apache.hadoop.util.RunJar.main(RunJar.java:212)

从网上搜了两个解决方案(修改nutch-default.xml ,nutch、solr配置文件保持一致)都没有解决这个问题,

问了某个群里头的一个大牛,说看hadoop的日志,结果hadoop日志中确实有问题。按照这个问题又从网上找个

结果。


2015-04-10 04:04:33,899 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: hadoop4:50010:DataXceiver error processing WRITE_BLOCK operation  src: /XXXXXXX  dest: /XXXXX:50010

根据这个问题修改了下hadoop hdfs-site.xml的配置
    <property>
        <name>dfs.datanode.max.transfer.threads</name>
        <value>8192</value>
    </property>

 

其他解决办法:

删除hadoop里头的 /linkdb路径;

修改nutch-default.xml的plugin.folders属性值

 




转载于:https://my.oschina.net/u/2329222/blog/397032

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值