Nutch开发(三)

Nutch开发(三)

开发环境
  • Linux,Ubuntu20.04LST
  • IDEA
  • Nutch1.18
  • Solr8.11

转载请声明出处!!!By 鸭梨的药丸哥

1.Nutch url过滤

Nutch的url过滤配置主要放在regex-urlfilter.txt,通过配置regex-urlfilter.txt可以定制nutch的爬虫url过滤规则。

# The default url filter.
# Better for whole-internet crawling.
# Please comment/uncomment rules to your needs.

# Each non-comment, non-blank line contains a regular expression
# prefixed by '+' or '-'.  The first matching pattern in the file
# determines whether a URL is included or ignored.  If no pattern
# matches, the URL is ignored.

#'+'代表留下url,'-'代表过滤掉url
#匹配到第一个的正则表达式将决定了该url是过滤
#匹配从上到下
#没匹配到的url默认是过滤掉的


#过滤掉file,ftp,mailto等url
# skip file: ftp: and mailto: urls
-^(?:file|ftp|mailto):

# skip URLs longer than 2048 characters, see also db.max.outlink.length
#-^.{2049,}

# skip image and other suffixes we can't yet parse
# for a more extensive coverage use the urlfilter-suffix plugin
#过滤掉一些图片,xml,js等url
-(?i)\.(?:gif|jpg|png|ico|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|exe|jpeg|bmp|js|svg)$

# skip URLs containing certain characters as probable queries, etc.
#动态页面的过滤
-[?*!@=]
#-[!@]

#过滤掉循环的url,如http://www.baidu.com/p/p/p/p
# skip URLs with slash-delimited segment that repeats 3+ times, to break loops
-.*(/[^/]+)/[^/]+\1/[^/]+\1/

#然后接受其他所有url
# accept anything else
#+.

2.示例

规则添加要注意顺序,因为在匹配到第一个的正则表达式将决定了该url是过滤掉了,将下面的url过过滤规则进行添加。下面的示例将可以爬取一些博客网站中的博文。

#添加在文件尾部
#先包含博客主页
+^https://www\.(cnblogs|jianshu|csdn|oschina)\.(net|com)$
+^https://cloud.tencent.com/developer$
+^https://developer.aliyun.com$
+^https://segmentfault.com$

#在包含博客独立的域名和路径格式
+^https://blog.csdn.net/[^/]+/article/details/.*
+^https://my\.oschina\.net/.*/blog/.*
+^https://cloud.tencent.com/developer/article/.+
+^https://www.jianshu.com/p/.+
+^https://www.cnblogs.com/.+/p/.+
+^https://developer.aliyun.com/article/.+
+^https://segmentfault.com/a/.+

#再排除除了主页下的其他资源
-^https://developer.aliyun.com/.+
-^https://segmentfault.com/.+
-^https://cloud.tencent.com/developer/.+
-^https://www\.(cnblogs|jianshu|csdn|oschina)\.(net|com)/.+

3.在Solr建立index

Nutch爬虫支持对各种全文检索服务器提交索引建立,这功能归功于Nutch的强大的插件式设置,通过包含一些插件,Nutch可以轻松地将爬取到的信息在全文检索服务器上建立索引。

使用bin/下面的nutch脚本

./nutch solrindex ../nutch/crawldb/ -dir ../nutch/segments/ -deleteGone

关于solr服务器的位置的配置,前面有说过,这里再重复一遍,在Nutch1.18中有关index建立的配置都放在了index-writers.xml的配置文件。该文件可以才conf/目录找到

 <writer id="indexer_solr_1" class="org.apache.nutch.indexwriter.solr.SolrIndexWriter">
    <!--服务器的位置等配置信息-->
     <parameters>
      <param name="type" value="http"/>
      <param name="url" value="http://localhost:8983/solr/nutch"/>
      <param name="collection" value=""/>
      <param name="weight.field" value=""/>
      <param name="commitSize" value="1000"/>
      <param name="auth" value="false"/>
      <param name="username" value="username"/>
      <param name="password" value="password"/>
    </parameters>
     <!--这里配置filed字段的映射-->
    <mapping>
      <!--将一个field里面的值复制并拼接到另一个field值的后面-->
      <copy>
        <!-- <field source="content" dest="search"/> -->
        <!-- <field source="title" dest="title,search"/> -->
      </copy>
      <!--这个式index-metedata插件使用的filed的重命名-->
      <rename>
        <field source="metatag.description" dest="description"/>
        <field source="metatag.keywords" dest="keywords"/>
      </rename>
      <!--移除filed-->
      <remove>
        <field source="segment"/>
      </remove>
    </mapping>
  </writer>
关于solr字段的配置

关于nutch会进行那些字段的索引可以通过下面nutch脚本命令进行参考。(其中的url是我的另一篇博客,关于如何在solr中使用Ik分词器的)

./nutch indexchecker https://blog.csdn.net/musicmtv/article/details/22758817

在Nutch1.18中并不会像Nutch1.8等比较以前的版本一样提供schema.xml,用于solr core的建立配置文件。

下面是Nutch2.4版本下的schema.xml文件,可以参考,具体的field的配置要根据实际情况进行配置,不过可以参考其他版本nutch的下的schema.xml

<!--Nutch2.4版本下,使用各种plugin使用的field-->
<fields>
    <!-- This field is used internally by Solr, for example by features 
    like partial update functionality and update log. It is NOT required
    if updateLog is turned off in your updateHandler, however it is advised
    to include it as performance improvements are minimal. -->
    <field name="_version_" type="long" indexed="true" stored="true"/>
    
    <field name="id" type="string" stored="true" indexed="true" required="true"/>

    <!-- core fields -->
    <field name="batchId" type="string" stored="true" indexed="false"/>
    <field name="digest" type="string" stored="true" indexed="false"/>
    <field name="boost" type="float" stored="true" indexed="false"/>

    <!-- fields for index-basic plugin -->
    <field name="host" type="url" stored="false" indexed="true"/>
    <field name="url" type="url" stored="true" indexed="true"/>
    <!-- stored=true for highlighting, use term vectors  and positions for fast highlighting -->
    <field name="content" type="text_general" stored="true" indexed="true"/>
    <field name="title" type="text_general" stored="true" indexed="true" multiValued="true"/>
    <field name="cache" type="string" stored="true" indexed="false"/>
    <field name="tstamp" type="date" stored="true" indexed="false"/>

    <!-- catch-all field -->
    <field name="text" type="text_general" stored="false" indexed="true" multiValued="true"/>

    <!-- fields for index-anchor plugin -->
    <field name="anchor" type="text_general" stored="true" indexed="true"
        multiValued="true"/>

    <!-- fields for index-more plugin -->
    <field name="type" type="string" stored="true" indexed="true" multiValued="true"/>
    <field name="contentLength" type="string" stored="true" indexed="false"/>
    <field name="lastModified" type="date" stored="true" indexed="false"/>
    <field name="date" type="tdate" stored="true" indexed="true"/>
    
    <!-- fields for index-metadata plugin -->  
    <dynamicField name="meta_*" type="string" stored="true" indexed="true"/>

    <!-- fields for languageidentifier plugin -->
    <field name="lang" type="string" stored="true" indexed="true"/>

    <!-- fields for subcollection plugin -->
    <field name="subcollection" type="string" stored="true" indexed="true" multiValued="true"/>

    <!-- fields for feed plugin (tag is also used by microformats-reltag)-->
    <field name="author" type="string" stored="true" indexed="true"/>
    <field name="tag" type="string" stored="true" indexed="true" multiValued="true"/>
    <field name="feed" type="string" stored="true" indexed="true"/>
    <field name="publishedDate" type="date" stored="true" indexed="true"/>
    <field name="updatedDate" type="date" stored="true" indexed="true"/>

    <!-- fields for creativecommons plugin -->
    <field name="cc" type="string" stored="true" indexed="true" multiValued="true"/>

    <!-- fields for tld plugin -->    
    <field name="tld" type="string" stored="false" indexed="false"/>

    <!-- fields for index-html plugin
         Note: although raw document content may be binary,
               index-html adds a String to the index field -->
    <field name="rawcontent" type="string" stored="true" indexed="false"/>

 </fields>
 <uniqueKey>id</uniqueKey>
 <defaultSearchField>text</defaultSearchField>
 <solrQueryParser defaultOperator="OR"/>

  <!-- copyField commands copy one field to another at the time a document
        is added to the index.  It's used either to index the same field differently,
        or to add multiple fields to the same field for easier/faster searching.  -->

 <copyField source="content" dest="text"/>
 <copyField source="url" dest="text"/>
 <copyField source="title" dest="text"/>
 <copyField source="anchor" dest="text"/>
 <copyField source="author" dest="text"/>

4.关于Nutch plugin

Nutch通过可以通过添加各种类型的插件,对Nutch自身的功能进行扩展,有那些插件,可以在lib/目录找到,至于你要使用那些插件,可以通配置变量plugin.includes进行配置,在conf/nutch-site.xml中添加即可。

<property>
  	<name>plugin.includes</name>
    <!--使用正则匹配选择你需要的插件-->
  	<value>protocol-http|protocol-httpclient|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|metadata)|indexer-solr|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
	<description>Regular expression naming plugin directory names to
	    include. Any plugin not matching this expression is excluded.
	    In any case you need at least include the nutch-extensionpoints plugin. By
	    default Nutch includes crawling just HTML and plain text via HTTP,
	    and basic indexing and search plugins. In order to use HTTPS please enable
	    protocol-httpclient, but be aware of possible intermittent problems with the
	    underlying commons-httpclient library.
	</description>
</property>

5.关于Nutch的默认配置信息

Nutch的所有默认的配置信息都可以在nutch-default.xml中找到。我们可以通过nutch-default.xml文件去了解Nutch的配置,并且在conf/nutch-site.xml添加配置以覆盖默认的配置信息。

6.使用metadata plugin

这个我写了一篇博客怎么用,看下面链接

Nutch 使用metadata plugin捕获页面中的meta标签数据_鸭梨的药丸哥的博客-CSDN博客

7.Nutch2.4 存储方式配置

这个我也写了博客了,链接如下:

Nutch2.4 存储方式配置_鸭梨的药丸哥的博客-CSDN博客

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值