爬虫
feihuadao
这个作者很懒,什么都没留下…
展开
-
nutch2.3.1 SolrDeleteDuplicates.java 去重时空指针崩溃
修改源代码如下: @Override public boolean nextKeyValue() throws IOException, InterruptedException { while(true){ if (currentDoc >= numDocs) { return false;原创 2016-11-02 15:56:24 · 586 阅读 · 0 评论 -
nutch2.3.1爬取marker流程
crawlstatus: STATUS_UNFETCHED = 0x01; //Page was not fetched yet STATUS_FETCHED = 0x02; //Page was successfully fetched STATUS_GONE = 0x03; //Page no longer exists ST原创 2016-11-08 16:54:53 · 853 阅读 · 0 评论 -
nutch2.3.1 nutch-site.xml配置
<configuration><property> <name>storage.data.store.class</name> <value>org.apache.gora.mongodb.store.MongoStore</value></property><property> <name>http.agent.name</name> <value>User-原创 2016-11-01 10:54:40 · 985 阅读 · 0 评论 -
nutch2.3.1 updatejob时错误url导致崩溃
原因可能是错误的html解析出来的 在DbUpdateMapper.java的map时加个trycatch 55 @Override 56 public void map(String key, WebPage page, Context context) 57 throws IOException, InterruptedException { 58 if (Mark.原创 2016-11-01 15:21:34 · 780 阅读 · 0 评论 -
nutch2.3.1 构建solr6索引时meta_keywords longer than the max length 32766
解决办法有2 1是在managed schema置meta_* 的index=false 2是修改nutch代码MetaTagsParser.java如下 private void addIndexedMetatags(Map<CharSequence, ByteBuffer> metadata, String metatag, String value) { //ad原创 2016-11-03 21:41:54 · 1005 阅读 · 0 评论