Introduction to Nutch, Part 2: Searching

 

In part one of this two part series on Nutch, the open-source Java search engine, we looked at how to crawl websites. Recall that the Nutch crawler system produces three key data structures:

  1. The WebDB containing the web graph of pages and links.
  2. A set of segments containing the raw data retrieved from the Web by the fetchers.
  3. The merged index created by indexing and de-duplicating parsed data from the segments.

In this article, we turn to searching. The Nutch search system uses the index and segments generated during the crawling process to answer users' search queries. We shall see how to get the Nutch search application up and running, and how to customize and extend it for integration into an existing website. We'll also look at how to re-crawl sites to keep your index up to date--a requirement of all real-world search engines.

Running the Search Application

Without further ado, let's run a search using the results of the crawl we did last time.Tomcat seems to be the most popular servlet container for running Nutch, so let's assume you have it installed (although there is someguidance on the Nutch wiki for Resin). The first step is to install the Nutch web app. There are some reported problems with running Nutch (version 0.7.1) as a non-root web app, so it is currently safest to install it as the root web app. This is what the Nutch tutorial advises. If Tomcat's web apps are in~/tomcat/webapps/, then type the following in the directory where you unpacked Nutch:

rm -rf ~/tomcat/webapps/ROOT*
cp nutch*.war ~/tomcat/webapps/ROOT.war

The second step is to ensure that the web app can find the index and segments that we generated last time. Nutch looks for these in theindex and segments subdirectories of the directory defined in thesearcher.dir property. The default value for searcher.dir is the current directory (.), which is where you started Tomcat. While this may be convenient during development, often you don't have so much control over the directory in which Tomcat starts up, so you want to be explicit about where the index and segments may be found. Recall from part one that Nutch's configuration files are found in theconf subdirectory of the Nutch distribution. For the web app, these files can be found inWEB-INF/classes/. So we simply create a file called nutch-site.xml in this directory (of the unpacked web app) and setsearcher.dir to be the crawl directory containing the index and segments.

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="nutch-conf.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<nutch-conf>
<property>
<name>searcher.dir</name>
<value>/Users/tom/Applications/nutch-0.7.1/crawl-tinysite</value>
</property>
</nutch-conf>

After restarting Tomcat, enter the URL of the root web app in your browser (in this example, I'm running Tomcat on port 80, but the default is port 8080) and you should see the Nutch home page. Do a search and you will get a page of search results like Figure 1.

Figure 1
Figure 1. Nutch search results for the query "animals"

The search results are displayed using the format used by all mainstream search engines these days. Theexplain and anchors links that are shown for each hit are unusual and deserve further comment.

Score Explanation

Clicking the explain link for the page A hit brings up the page shown in Figure 2. It shows some metadata for the page hit (page A), and ascore explanation. The score explanation is a Lucene feature that shows all of the factors that contribute to the calculated score for a particular hit. The formula for score calculation is rathertechnical, so it is natural to ask why this page is promoted by Nutch when it is clearly unsuitable for the average user.

Figure 2
Figure 2. Nutch's score explanation page for page A, matching the query "animals"

One of Nutch's key selling points is its transparency. Its ranking algorithms are open source, so anyone can see them. Nutch's ability to "explain" its rankings online--via theexplain link--takes this one step further and allows an (expert) user to see why one particular hit ranked above another for a given search. In practice, this page is only really useful for diagnostic purposes for people running a Nutch search engine, so there is no need to expose it publicly, except perhaps for PR reasons.

Anchors

The anchors page (not illustrated here) provides a list of the incoming anchor text for the pages that link to the page of interest. In this case, the link to page A from page B had the anchor text "A." Again, this is a feature for Nutch site maintainers rather than the average user of the site.

Integrating Nutch Search

While the Nutch web app is a great way to get started with search, most projects using Nutch require the search function to be more tightly integrated with their application. There are various ways to achieve this, depending on the application. The two ways we'll look at here are using the Nutch API and using the OpenSearch API

Using the Nutch API

If your application is written in Java, then it is worth considering using Nutch's API directly to add a search capability. Of course, the Nutch web app is written using the Nutch API, so you may find it fruitful to use it as a starting point for your application. If you take this approach, the files to take a look at first are the JSPs in src/web/jsp in the Nutch distribution.

To demonstrate Nutch's API, we'll write a minimal command-line program to perform a search. We'll run the program using Nutch's launcher, so for the search we did above, for the term "animals," we type:

bin/nutch org.tiling.nutch.intro.SearchApp animals

And the output is as follows.

'A' is for Alligator (http://www.java.net/external?url=http://keaton/tinysite/A.html)
  <b> ... </b>Alligators' main prey are smaller <b>animals</b> that they can kill and<b> ... </b>

'C' is for Cow (http://www.java.net/external?url=http://keaton/tinysite/C.html)
  <b> ... </b>leather and as draught <b>animals</b> (pulling carts, plows and<b> ... </b>

Here's the program that achieves this. To get it to run, the compiled class is packaged in a .jar file, which is then placed in Nutch'slib directory. See the Resources section to obtain the .jar file.

package org.tiling.nutch.intro;

import java.io.IOException;

import org.apache.nutch.searcher.Hit;
import org.apache.nutch.searcher.HitDetails;
import org.apache.nutch.searcher.Hits;
import org.apache.nutch.searcher.NutchBean;
import org.apache.nutch.searcher.Query;

public class SearchApp {

  private static final int NUM_HITS = 10;

  public static void main(String[] args)
      throws IOException {

    if (args.length == 0) {
      String usage = "Usage: SearchApp query";
      System.err.println(usage);
      System.exit(-1);
    }

    NutchBean bean = new NutchBean();
    Query query = Query.parse(args[0]);
    Hits hits = bean.search(query, NUM_HITS);

    for (int i = 0; i < hits.getLength(); i++) {
      Hit hit = hits.getHit(i);
      HitDetails details = bean.getDetails(hit);

      String title = details.getValue("title");
      String url = details.getValue("url");
      String summary =
        bean.getSummary(details, query);

      System.out.print(title);
      System.out.print(" (");
      System.out.print(url);
      System.out.println(")");
      System.out.println("\t" + summary);
    }

  }

}

Although it's a short and simple program, Nutch is doing lots of work for us, so we'll examine it in some detail. The central class here isNutchBean--it orchestrates the search for us. Indeed, the doc comment for NutchBean states that it provides "One-stop shopping for search-related functionality."

Upon construction, the NutchBean object opens the index it is searching against in read-only mode, and reads the set of segment names and filesystem locations into memory. The index and segments locations are configured in the same way as they were for the web app: via the searcher.dir property.

Before we can perform the search, we parse the query string given as the first parameter on the command line (args[0]) into a NutchQuery object. The Query.parse() method invokes Nutch's specialized parser (org.apache.nutch.analysis.NutchAnalysis), which is generated from a grammar using theJavaCC parser generator. Although Nutch relies heavily on Lucene for its text indexing, analysis, and searching capabilities, there are many places where Nutch enhances or provides different implementations of core Lucene functions. This is the case for Query, so be careful not to confuse Lucene'sorg.apache.lucene.search.Query with Nutch's org.apache.nutch.searcher.Query. The types represent the same concept (a user's query), but they are not type-compatible with one another.

With a Query object in hand, we can now ask the bean to do the search for us. It does this by translating the NutchQuery into an optimized Lucene Query, then carrying out a regular Lucene search. Finally, a NutchHits object is returned, which represents the top matches for the query. This object only contains index and document identifiers. To return useful information about each hit, we go back to the bean to get aHitDetails object for each hit we are interested in, which contains the data from the index. We retrieve only the title and URL fields here, but there are more fields available: the field names may be found using thegetField(int i) method of HitDetails.

The last piece of information that is displayed by the application is a short HTML summary that shows the context of the query terms in each matching document. The summary is constructed by the bean'sgetSummary() method. The HitDetails argument is used to find the segment and document number for retrieving the document's parsed text, which is then processed to find the first occurrence of any of the terms in theQuery argument. Note that the amount of context to show in the summary--that is, the number of terms before and after the matching query terms--and the maximum summary length are both Nutch configuration properties (searcher.summary.context and searcher.summary.length, respectively).

That's the end of the example, but you may not be surprised to learn that NutchBean provides access to more of the data stored in the segments, such as cached content and fetch date. Take a look at theAPI documentation for more details.

Using the OpenSearch API

OpenSearch is an extension of RSS 2.0 for publishing search engine results, and was developed byA9.com, the search engine owned by Amazon.com. Nutch supports OpenSearch 1.0 out of the box. The OpenSearch results for the search in Figure 1 can be accessed by clicking on the RSS link in the bottom right-hand corner of the page. This is the XML that is returned:

<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
  xmlns:nutch="http://www.nutch.org/opensearchrss/1.0/"
  xmlns:opensearch="http://a9.com/-/spec/opensearchrss/1.0/">

  <channel>
    <title>Nutch: animals</title>
    <description>Nutch search results for query: animals</description>
    <link>http://localhost/search.jsp?query=animals&amp;start=0&amp;hitsPerDup=2&amp;hitsPerPage=10</link>

    <opensearch:totalResults>2</opensearch:totalResults>
    <opensearch:startIndex>0</opensearch:startIndex>
    <opensearch:itemsPerPage>10</opensearch:itemsPerPage>

    <nutch:query>animals</nutch:query>

    <item>
      <title>'A' is for Alligator</title>
      <description>&lt;b&gt; ... &lt;/b&gt;Alligators'
      main prey are smaller &lt;b&gt;animals&lt;/b&gt;
      that they can kill and&lt;b&gt; ... &lt;/b&gt;</description>
      <link>http://keaton/tinysite/A.html</link>

      <nutch:site>keaton</nutch:site>
      <nutch:cache>http://localhost/cached.jsp?idx=0&amp;id=0</nutch:cache>
      <nutch:explain>http://localhost/explain.jsp?idx=0&amp;id=0&amp;query=animals</nutch:explain>
      <nutch:docNo>0</nutch:docNo>
      <nutch:segment>20051025121334</nutch:segment>
      <nutch:digest>fb8b9f0792e449cda72a9670b4ce833a</nutch:digest>
      <nutch:boost>1.3132616</nutch:boost>
    </item>

    <item>
      <title>'C' is for Cow</title>
      <description>&lt;b&gt; ... &lt;/b&gt;leather
      and as draught &lt;b&gt;animals&lt;/b&gt;
      (pulling carts, plows and&lt;b&gt; ... &lt;/b&gt;</description>
      <link>http://keaton/tinysite/C.html</link>

      <nutch:site>keaton</nutch:site>
      <nutch:cache>http://localhost/cached.jsp?idx=0&amp;id=2</nutch:cache>
      <nutch:explain>http://localhost/explain.jsp?idx=0&amp;id=2&amp;query=animals</nutch:explain>
      <nutch:docNo>1</nutch:docNo>
      <nutch:segment>20051025121339</nutch:segment>
      <nutch:digest>be7e0a5c7ad9d98dd3a518838afd5276</nutch:digest>
      <nutch:boost>1.3132616</nutch:boost>
    </item>

  </channel>
</rss>

This document is an RSS 2.0 document, where each hit is represented by an item element. Notice the two extra namespaces, opensearch andnutch, which allow search-specific data to be included in the RSS document. For example, theopensearch:totalResults element tells you the number of search results available (not just those returned in this page). Nutch also defines its own extensions, allowing consumers of this document to access page metadata or related resources, such as the cached content of a page, via the URL in the nutch:cache element.

Using OpenSearch to integrate Nutch is a great fit if your front-end application is not written in Java. For example, you could write a PHP front end to Nutch by writing a PHP search page that calls the OpenSearch servlet and then parses the RSS response and displays the results.

Real-World Nutch Search

The examples we have looked at so far have been very simple in order to demonstrate the concepts behind Nutch. In a real Nutch setup, other considerations come into play. One of the most frequently asked questions on the Nutch newsgroups concerns keeping the index up to date. The rest of this article looks at how to re-crawl pages to keep your search results fresh and relevant.

Re-Crawling

Unfortunately, re-crawling is not as simple as re-running the crawl tool that we saw in part one. Recall that this tool creates a pristine WebDB each time it is run, and starts compiling lists of URLs to fetch from a small set of seed URLs. A re-crawl starts with the WebDB structure from the previous crawl and constructs the fetchlist from there. This is generally a good idea, as most sites have a relatively static URL scheme. It is, however, possible to filter out the transient portions of a site's URL space that should not be crawled by editing the conf/regex-urlfilter.txt configuration file. Don't be confused by the similarity betweenconf/crawl-urlfilter.txt and conf/regex-urlfilter.txt--while they both have the same syntax, the former is used only by thecrawl tool, and the latter by all other tools.

The re-crawl amounts to running the generate/fetch/update cycle, followed by index creation. To accomplish this, we employ the lower-level Nutch tools to which thecrawl tool delegates. Here is a simple shell script to do it, with the tool names highlighted:

#!/bin/bash

# A simple script to run a Nutch re-crawl

if [ -n "$1" ]
then
  crawl_dir=$1
else
  echo "Usage: recrawl crawl_dir [depth] [adddays]"
  exit 1
fi

if [ -n "$2" ]
then
  depth=$2
else
  depth=5
fi

if [ -n "$3" ]
then
  adddays=$3
else
  adddays=0
fi

webdb_dir=$crawl_dir/db
segments_dir=$crawl_dir/segments
index_dir=$crawl_dir/index

# The generate/fetch/update cycle
for ((i=1; i <= depth ; i++))
do
  bin/nutch generate $webdb_dir $segments_dir -adddays $adddays
  segment=`ls -d $segments_dir/* | tail -1`
  bin/nutch fetch $segment
  bin/nutch updatedb $webdb_dir $segment
done

# Update segments
mkdir tmp
bin/nutch updatesegs $webdb_dir $segments_dir tmp
rm -R tmp

# Index segments
for segment in `ls -d $segments_dir/* | tail -$depth`
do
  bin/nutch index $segment
done

# De-duplicate indexes
# "bogus" argument is ignored but needed due to
# a bug in the number of args expected
bin/nutch dedup $segments_dir bogus

# Merge indexes
ls -d $segments_dir/* | xargs bin/nutch merge $index_dir


To re-crawl the toy site we crawled in part one, we would run:

./recrawl crawl-tinysite 3

The script is practically identical to the crawl tool except that it doesn't create a new WebDB or inject it with seed URLs. Likecrawl, the script takes an optional second argument, depth, which controls the number of iterations of the generate/fetch/update cycle to run (the default is five). Here we have specified a depth of three. This allows us to pick up new links that may have been created since the last crawl.

The script supports a third argument, adddays, which is useful for forcing pages to be retrieved even if they are not yet due to be re-fetched. The page re-fetch interval in Nutch is controlled by the configuration propertydb.default.fetch.interval, and defaults to 30 days. The adddays arguments can be used to advance the clock for fetchlist generation (but not for calculating the next fetch time), thereby fetching pages early.

Updating the Live Search Index

Even with the re-crawl script, we have a problem with updating the live search index. As mentioned above, theNutchBean class opens the index to search when it is initialized. Since the Nutch web app caches theNutchBean in the application servlet context, updates to the index will never be picked up as long as the servlet container is running.

This problem is recognized by the Nutch community, so it will likely be fixed in an upcoming release (Nutch 0.7.1 was the stable release at the time of writing). Until Nutch provides a way to do it, you can work around the problem--possibly the simplest way is to reload the Nutch web app after the re-crawl completes. More sophisticated ways of solving the problem arediscussed on the newsgroups. These typically involve modifying NutchBean and the search JSP to pick up changes to the index.

Conclusion

In this two-article series, we introduced Nutch and discovered how to crawl a small collection of websites and run a Nutch search engine using the results of the crawl. We covered the basics of Nutch, but there are many other aspects to explore, such as the numerous plugins available to customize your setup, the tools for maintaining the search index (typebin/nutch to get a list), or even whole-web crawling and searching. Possibly the best thing about Nutch, though, is its vibrantuser and developer community, which is continually coming up with new ideas and ways to do all things search-related.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
提供的源码资源涵盖了Java应用等多个领域,每个领域都包含了丰富的实例和项目。这些源码都是基于各自平台的最新技术和标准编写,确保了在对应环境下能够无缝运行。同时,源码中配备了详细的注释和文档,帮助用户快速理解代码结构和实现逻辑。 适用人群: 适合毕业设计、课程设计作业。这些源码资源特别适合大学生群体。无论你是计算机相关专业的学生,还是对其他领域编程感兴趣的学生,这些资源都能为你提供宝贵的学习和实践机会。通过学习和运行这些源码,你可以掌握各平台开发的基础知识,提升编程能力和项目实战经验。 使用场景及目标: 在学习阶段,你可以利用这些源码资源进行课程实践、课外项目或毕业设计。通过分析和运行源码,你将深入了解各平台开发的技术细节和最佳实践,逐步培养起自己的项目开发和问题解决能力。此外,在求职或创业过程中,具备跨平台开发能力的大学生将更具竞争力。 其他说明: 为了确保源码资源的可运行性和易用性,特别注意了以下几点:首先,每份源码都提供了详细的运行环境和依赖说明,确保用户能够轻松搭建起开发环境;其次,源码中的注释和文档都非常完善,方便用户快速上手和理解代码;最后,我会定期更新这些源码资源,以适应各平台技术的最新发展和市场需求。 所有源码均经过严格测试,可以直接运行,可以放心下载使用。有任何使用问题欢迎随时与博主沟通,第一时间进行解答!
nutch javax.net.ssl.sslexception : could not generate dh keypair 是一个SSL异常,意味着Nutch无法生成DH密钥对。 TLS(Transport Layer Security)是一种加密协议,用于保护在网络上进行的通信。在TLS握手期间,服务器和客户端会协商加密算法和生成共享密钥对。 DH(Diffie-Hellman)密钥交换是TLS协议中常用的一种加密算法。它允许服务器和客户端在不直接传递密钥的情况下,通过交换公钥来生成共享密钥。 nutch javax.net.ssl.sslexception : could not generate dh keypair 错误意味着Nutch无法生成DH密钥对。这可能是由于以下几个原因导致的: 1. Java安全性策略限制:Java默认情况下,限制了密钥长度。您可以尝试通过修改Java安全性策略文件来解决此问题。 2. 加密算法不受支持:您使用的Java版本可能不支持所需的加密算法。您可以尝试升级到较新的Java版本。 3. 随机数生成器问题:DH密钥对需要使用随机数生成器生成随机数。但是,如果随机数生成器不可用或出现故障,就会出现此错误。您可以尝试重新配置随机数生成器或更换可靠的实现。 4. SSL证书问题:此错误可能是由于证书问题引起的。您可以检查证书是否过期或不匹配,并尝试更新或更换证书。 针对这个错误,您可以逐一排查上述情况,并尝试相应的解决方法来解决该问题。如果问题仍然存在,您可能需要进一步的调查和故障排除来确定准确的原因并解决问题。

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值