A Comparison of Free Search Engine Software

最新推荐文章于 2025-05-01 06:30:00 发布

凌风探梅

最新推荐文章于 2025-05-01 06:30:00 发布

阅读量2.3k

点赞数

分类专栏： SearchEngineSoftware

SearchEngineSoftware 专栏收录该内容

2 篇文章

订阅专栏

Home

Guide

Tools Listing

News

Background

Search

About Us

Publisher's Note: we are very pleased to feature this valuable research on free and open source search engines on SearchTools.com. It was originally written in around 2004 and revised in April 2006.

A Comparison of Free Search Engine Software [1]

Yiling Chen (yilingchen7 [at] yahoo [dot] com)

Abstract: This paper reviews nine search engine software packages −Alkaline, Fluid Dynamic, ht://Dig, Juggernautsearch, mnoGoSearch, Perlfect, SWISH-E, Webinator, and Webglimpse− which are free to users. Their features and functionalities are compared and contrasted with emphasis on searching mechanisms, crawler and indexer features, and searching features.

1. Motivation

The Internet and computer technology have immeasurably increased the availability of information. However, as the size of information systems increases, it becomes harder for users to retrieve relevant information. Search engines have been developed to facilitate fast information retrieval. There are many software packages for search engine construction on the Internet. The website searchtools.com alone lists more than 170 search tools, many of which are free or free for noncommercial use. With so many software packages, selecting suitable search engine software is, as hard if not harder than retrieving relevant information efficiently from websites. Motivated a desire to aid website administrators in choosing a suitable search engine, this paper reviews basic information, feature, and functionalities of nine free search engine software packages: Alkaline, Fluid Dynamic, ht://Dig, Juggernautsearch 1.0.1, mnoGoSearch, Perlfect, SWISH-E, Webinator, and Webglimpse 2.x.

The remainder of the paper starts with an introduction to free search engine software. Then, we summarize basic information such as source code availability and platform compatibility of the nine software packages. After that, their features and functionalities are compared and contrasted. Finally, we conclude our comparison results.

2. Introduction to Free Search Engine Software

Free search engine software can be spotted at websites such as searchtools.com, sourceforge.net, searchenginewatch.com, and codebeach.com. Some of them are freeware with only binary files distributed, while others are open source software. In general, however, free search engine software is not well-documented and has undergone few formal tests, which makes it difficult to understand the functionalities they provide.

According to whoever provides the actual search service, free search tools can be categorized into remote site search service and the server-side search engine. In the former, the indexer and query engine run on a remote server that stores the index file. When it comes to the time of search, a form on a user's local Web page sends a message to the remote search engine, which then sends the query results back to the user. A server-side search engine is what we usually think of as a search engine. It runs on the user's server, and takes that server's CPU time and disk space. In this paper, the term search engine refers only to server-side search engines.

According to what is indexed, search engines are classified as file system search engines and website search engines. File system search engines index only files in the server's local file system. Website search engines can index remote servers by feeding URLs to web crawlers. Most search engines combine the two functions, and can index both local file systems and remote servers. The nine search engine software packages compared here are all website search engines, some of which can index local file systems.

A fully functional website search engine software package should have the following four blocks:

A Web Crawler or Spider that follows HTML links in Web pages to gather documents;
An Indexer that indexes the documents crawled using some indexing rules and saves the indexed results for searching;
An Query Engine that performs the actual search and returns ranked results;
An Interface that allows users to interact with the query engine.

The nine software packages we compare either have all four blocks or allow adding the missing blocks.

3. Basic Information of the Nine Search Engine Software Packages

This section intends to provide some basic information about each search engine software package. The information includes licensing, where to find, source code availability, documentation availability, what is written in, platform compatibility, completeness of the package, and who built it.

Licensing refers to whether the software is a freeware or is free under some conditions. Source code availability provides the website address to download the source code if it is available. Documentation availability indicates where to find the documentation files. What is written in tells what programming language is used in implementing the software. Platform compatibility specifies what operating systems that the software can run on. If the software package is fully functional, i.e. it has a web crawler, an indexer, a query engine, and a query interface, we consider the package to be complete. Who built it tells us the developers of the software.

A website administrator who is looking for a suitable software package can take a look into this information first to decide whether a software package is a potential candidate. For example, if the search engine software can not be installed in the platform on which the web server is running, there is no need for the administrator to look into specific features of the software. We summarize the basic information of the nine software packages in Table 1.

Table 1: Basic Information of Nine Search Engine Software Packages

Software Name	Licensing	Where to Find?	Source Code Availability	Documentation Availability	What is written in?	Platform Compatibility	Complete-ness of package	Who Built it?
Alkaline	Free for non-commercial use	alkaline.vestris. com	-Not available in public domain. -Can be purchased under license.	- User's Guide (pdf) - FAQ (pdf)	C++	-Linux -Solaris -IRIX -BSDI -FreeBSD - Win NT/2000/ -XP Pro	Complete	Daniel Doubrovkine, who is the founder of Vestris Inc. and Hassan Sultan who developed the cellular expansion algorithm.
Fluid Dynamics	-freeware available -free trial shareware	www.xav.com/	Available @ www.xav.com /scripts/search/install.html	Some information	Perl	-Unix -Linux -Win95/98/ME/ NT/2000	Complete	Copyrighted by Zoltan Milosevic, Fluid Dynamics Software Corporation
ht://Dig	Free	htdig.sourceforge. net or www.htdig.org/	-Stable release version 3.1.6 @ www.htdig.org /files/htdig-3.1.6.tar.gz - Beta release version 3.2.0b6 @ www.htdig.org/files/htdig-3.2.0b6.tar.gz	Available @ www.htdig.org/	C, C++	-Solaris -HP/UX -IRIX -SunOS -Linux -Mac OS X -Mac OS 9	Complete	Loic Dachary and Geoff Hutchinson, San Diego State University
Juggernaut-search 1.0.1	Free for personal use	juggernautsearch .com	Available @ juggernautsearch.com /JS.1.0.1.tgz	-Installation and Operation Guide available @ juggernautsearch.com /JSInstall.htm -Executive Summary @ juggernautsearch.com/	Perl	-Unix -Linux - Win NT, 2000 and XP for non-free version	Complete	Donald Kasper, and etc. of HyperProject, Inc.
mnoGoSearch	Free Unix version	mnogosearch.org/	Available @ mnogosearch .org /download.html	Reference manual available @ mnogosearch.org /doc/	C	-Unix -Linux -FreeBSD -Mac OS X	Complete	Alexander Barkov, Mark Napartovich, Ramil Kalimullin, Aleksey Botchkov, Sergei Kartashoff, and etc. of Lavtech. Corp.
Perlfect	Free	perlfect.com /freescripts /search/	Available @ perlfect.com /freescripts/ search/	Readme, FAQ, and Example of configuration file available @ perlfect.com /freescripts/ search/	Perl	-Win NT -Unix -Linux	Complete	N.Moraitakis and G. &
SWISH-E	Free	swish-e.org/	Available @ swish-e.org /download/index.html	Available @ swish-e.org /docs/index.html	C, Perl	-SunOS -FreeBSD -Net BSD -Linux -OSF/1 -AIX -Windows NT	Need additional CGI to invoke searching	-Original version, SWISH, is built by Kevin Hughes. - In Fall 1996, The Library of UC Berkeley received permission from Kevin Hughes to implement bug fixes and enhancements to the original binary, hence SWISH-E.
Webinator	Free for up to 10,000 pages and 10,000 hits per day	www.thunderstone. com/	Not in public domain	Available @ www.thunderstone.com /site/webinator5man/ or @ www.thunderstone.com /site/webinator5man /webinator5.pdf	Vortex-Tex is' Web Script language	-Unix: Solaris SPARC, Linux Intel, SGI Irix 5/6, Unixware, Solaris x86, BSDI, SGI Irix 4 , AT&T SVR4 386, SunOs 4, SCO 5, DEC Alpha Unix 4 , HP UX 10, SCO 5.02, DEC Alpha Unix 3, IBM AIX 4.2, -Windows NT and 2000	Complete	Thunderstone - EPI Inc.
Webglimpse 2.x	Has a free version for educational and governmental use	www.webglimpse. net/	Glimpse & Webglimpse available @ webglimpse.net /download.php	Available @ www.webglimpse.net /subdocs/ OR @ webglimpse.org/pubs /webglimpse.pdf	Glimpse: C full text search engine Webglimpse: Perl spider and indexer	- Solaris - SunOS - openBSD - AIX - IRIX - Mach -OSF -Rhapsody (Mac OS X)	Complete	University of Arizona

4. Comparison and Contrast of the Nine Search Engine Software Packages

4.1 Comparison Criterions

We compare and contrast the nine software packages from the following four perspectives.

4.1.1 Searching Mechanism

We consider the indexing method and the ranking method as the searching mechanism of a search engine, since these two methods usually determines how many disk space the search engine requires, how fast the indexing process is, and how fast and accurate the search process is.

Indexing method

Most search engines operate on the principle that pre-indexed data is easier and faster to search than raw data. The form and quality of the index created from the original documents is of paramount importance to how searches are performed. The commonly used indexing method is the full text inverted index. It takes a large amount of disk space and the indexing process is slow, because it keeps most of the information in a document. Another method is to index only the title, keywords, description, and author parts of a document. In this way, the indexing process can be very fast and the resulted index is relatively small. Some search engines have their novel indexing method. WebGlimpse uses two-level indexing, which we will introduce later. Alkaline applies Cellular Expansion Algorithm in indexing, which is still kept as a technical secret.

Relevance Ranking

Ranking method refers to the method that decides a document's relevance to a query. Factors such as word frequency in the document, word position in the text, and link popularity are usually considered. Different search engine takes into consideration of different factors.

4.1.2 Crawler and Indexer Features

We compare the following functionalities of built-in web crawlers and indexers.

Robot Exclusion Standard Support - Does the crawler respect the robot exclusion standard that is to not index documents indicated in the robot.txt file?
Crawler Retrieval Depth Control - Can the administrator control the maximum depth that a crawler follows in a retrieval process?
Duplicate Detection - During the process of crawling and indexing, can duplicated documents be detected and thus not be indexed?
File Format to be Indexed - Files in what formats can be crawled and indexed by the crawler and indexer?
Index Protected Server - Can the crawler retrieve secured pages in password protected sites?

4.1.3 Searching Features

Searching features are considered from nine aspects:

Boolean Search - Can the search engine look up pages containing some word and not containing some other word? Does the search engine support the AND and OR logic among query words?
Phrase Matching - Can the search engine match only those documents that contain words in exactly the same sequence as that of the query?
Attribute Search - Can search engine perform search within only the body, title, description, keywords, URL, or other parts of documents?
Fuzzy Search - Can the search engine match documents that contain words that are similar to requested query? Are search by soundex, metaphone, or substring supported?
Word Forms -- Is word stemming supported?
Wild Card - Is there a wild card character that can be used in search to match any one or more character or symbol?
Regular Expression - Regular expressions are symbols that users add to their queries to describe complex patterns to match. Is regular expression search supported?
Numeric Data Search - Can the search engine deal with numeric queries such as "Quantity > 300"?
Case Sensitivity - Is the search engine case sensitive, or can it be configured as case sensitive?
Nature Language Query - Does the search engine support nature language queries?

4.1.4 Other Features

We only consider three features in this category:

International Language - Can the search engine support languages other than English?
Page Limit - How many pages can be indexed for the free version of the software? What is the theoretical or empirical limit?
Customizable Result Formatting - Can the result pages be customized to have a desired look and feel?

4.2 Features of the Nine Search Engine Software Packages

We compare and contrast the features of the nine search engine software packages according to the above comparison criterions. The main results are summarized in Table 2. Individual analyses are also provided in subsections respectively.

Table 2: Features and Functionalities of the Nine Search Engine Software Packages

	Alkaline	Fluid Dynamic	ht://Dig	Juggernautsearch 1.0.1	mnoGoSearch	Perlfect	SWISH-E	Webinator	Webglimpse 2.x
Searching Mechanism
Indexing Method	Cellular Expansion Algorithm	Attribute Indexing	Inverted Index	Keywords Index	Inverted Index	Inverted Index	Don't know	Inverted Index	Two-level query
Relevance Ranking	Word weight	Word Frequency; Word Weight	Word weight	Word weight	Word weight	Gerald Salton Algorithm	Don't know	Please refer to details	Don't know
Crawler and Indexer Features
Robot Exclusion Standard Support	Yes	Yes	Yes	Yes	Yes	No	Yes	Yes	Yes
Crawler Retrieval Depth Control	Yes	Yes	Yes	No	Yes	No	Yes	Yes	Yes
Duplicate Page Detection	Yes	Yes	Yes	Yes	Yes	Don't know	Yes	Yes	Don't know
File Format to be Indexed	html, htm, text, shtml, PDF, embedded Shockwave flash objects, doc, rtf, LaTex/Tex, WordPerfect, Xml, and MPEG Layer 3	html, htm, shtml, shtm, stm, txt, mp3, and PDF	html, txt, PDF, MS Word, Power-Point, PostScript and Excel	txt, htm, html, .shtm, .shtml, ppt, doc, xls, .ps, rtf, BAT, C, CGI, CXX, CPP, H, Java, PHP, PL	html, txt, pdf, ps, doc, MP3, SQL database text fields	html, txt, pdf	html, XML txt, doc, PDF, and gzipped files	.html, .htm, .txt, .pdf, .doc, .swf, WordPerfect, .asp, .jsp, .shtml, .jhtml, .phtml	HTML documents, Word, PDF, and any other documents that can be filtered to plaintext
Index Protected Server	Yes	No	Yes	No	Yes	No	No	No	No
Searching Features
Boolean Search	Yes	Yes	Yes	No	Yes	Yes	Yes	Yes	Yes
Phrase Matching	No	Yes	No	No	Yes	No	Yes	Yes	No
Attribute Search	Yes	Yes	Yes	Yes	Yes	No	Yes	No	No
Fuzzy Search	No	No	Yes	Don't know	Yes	No	Yes	Yes	Yes
Word Forms	Yes	Yes	Yes	Don't know	Yes	No	Yes	Yes	No
Wild Card	Yes	Yes	Yes	Don't know	Yes	No	Yes	Yes	Yes
Regular Expression	No	No	No	Don't know	No	No	No	Yes	Yes
Numeric Data Search	Yes	No	No	No	No	No	No	Yes	No
Case Sensitivity	Yes	No	No	Don't know	No	No	No	No	Yes
Natural Language Query	No	No	No	No	No	No	No	Yes	No
Other Features
International Language	No	Latin-extended languages	Yes	No	Yes	Yes	Yes	No	Yes
Page Limit	Theoretical Limit: 2 billion documents. Recommended: 50, 000-500,000 pages.	No "hard" limit. "Soft" limit: 100,00 documents	No theoretical limit. Can be over 100,000 pages.	unlimited	Several Millions	1,000+ pages	Don't know	10,000 pages for free	Don't know
Customizable Result Formatting	Yes	Yes	Yes	Don't know	Yes	Yes	Don't know	Yes	No

4.2.1 Alkaline

Alkaline is a powerful search server. It supports most of the features we discussed here [2] [3].

Searching Mechanism

Alkaline uses the concept of "cellular expansion" to index and search documents. The cellular expansion algorithm is a technique of hashing and quickly finding short binary blobs. It is claimed that the algorithm makes searching incomplete word forms in 500,000 documents blazing fast. But I haven't been able to find any published document for this algorithm.

Alkaline uses an adaptive mechanism that is said to be able to closely match the results to the elements searched. The more extensive the search query is, the better the relevance the user gets. The word weight ranking gives different weight to words in title, meta keywords, words in description, and words in text body. Alkaline has the Weight option to modify ranking weights. Another option Alkaline provides for changing the ranking is WeakWords. Words in the WeakWords list are assigned lower weight.

Crawler and Indexer Features

Alkaline supports robot directives. AlkalineBOT is the registered robot. Alkaline is compliant with the /robots.txt directives. It will not follow a link if a <meta name= "robots" content= "nofollow">tag is found. It will also not index document contents if a <meta name= "robots" content= "noindex"> tag is found. By specifying Robots=N in the configuration file, Alkaline's robots support can be disabled.

Alkaline allows administrators to define the maximum depth of URLs to follow. The MD5 digest mechanism [4] within Alkaline can identify and ignore symbolic links and duplicated documents, such as http://www.abc.com and http://www.abc.com/index.html.

Alkaline can index html, htm, text, and shtml files. To index PDF, embedded Shockwave flash objects, doc, rtf, LaTex/Tex, WordPerfect, Xml, and MPEG Layer 3 files, Alkaline needs external document filters. A retrieved document of these kinds can be passed to any external filter, processed by this filter and then indexed based on the HTML output.

Retrieval of secured pages on password protected sites (HTTP/1.0 BASIC authentication, NTLM support for Windows NT versions, no support for SSL) is supported by alkaline.

Searching Features

Alkaline supports Boolean Search, Attribute Search, Word Forms, Wild Card, Numeric Data Search, and Case Sensitivity. It dose not support Phrase Matching, Fuzzy Search, Regular Expression, and Natural Language Query.

- Boolean search: To express the fact that a page must contain a word, a '+' sign is placed in front of the word. To search for all pages not containing a word, a '-' sign is used.

- Attribute Search: Alkaline can define search scopes such as Host Scope ( host: abc.com), Path Scope ( path: abc/directory ), URL Scope (url: www.abc.com/abc/directory ), File Extension Scope ( ext:cpp, h), and Meta Scope to allow searching within these scopes.

- Word forms: Alkaline supports word stemming. Searching for light will find all pages containing light, lightning, delighted, etc.

- Wild card: Alkaline can use * to return a list of all indexed documents.

- Numeric Data Search: Alkaline indexes words such as quantity=15 in a special manner. Thus it can support search such as quantity < 15, quantity =15, or quantity > 15.

- Case Sensitivity: Alkaline chooses a case-sensitive search when at least one upper-case letter is present in a word.

Other Features

Alkaline dose not support language other than English. There's a theoretical limit of two billion documents that Alkaline can index. But the recommended usage is to index something around 50,000-500,000 pages and 250,000 word forms. Layout of search results is fully customizable.

For detailed features of Alkaline, please refer to Appendix 1, which is the feature summary from the documentation of Alkaline [2].

4.2.2 Fluid Dynamic

Searching Mechanism

Fluid Dynamic search engine uses attribute indexing [5]. A document's text, keywords, description, title, and address are all extracted and used for searching. Basically, it is a full text indexing, but the option "Max Characters: File" allows one to determine the maximum number of bytes read from any document. Keeping it at a low value will save indexing time at the expense of accuracy in searching.

The ranking of documents is decided by the frequency of query words in the documents. Query words found in the title, keywords, or description parts of the documents are given additional weight which is allowed to be modified by changing the values of "Multiplier: Title", "Multiplier: Keywords", and "Multiplier: Description" settings. Every time a search term found in the web page text, one point is added to the web page's relevance. Every time a search term found in the title, the value of the "Multiplier: Title" setting is added to the relevance. Similar additions are made for the META Keywords and Description. Results can be ranked by last modified time, time web page last indexed, and their inverses.

Crawler and Indexer Features

Fluid dynamic supports Robot Exclusion Standards, i.e. it respects both the robot.txt file and the Robots Meta tags. The crawler can stop after each level of crawling waiting for manual approval, thus an administrator is able to control the depth of crawling. It can detect duplicated pages and will not index them.

Fluid dynamic can index html, htm, shtml, shtm, stm, and mp3 files. To index PDF files, it needs the helper utility xpdf package from www.foolabs.com/xpdf. It can not index servers protected by passwords.

Searching Features

Fluid Dynamic supports Boolean Search, Phrase Matching, Attribute Search, Word Forms, and Wild Card [5] [6]. It does not support Fuzzy Search, Regular Expression, Numeric Data Search, Case Sensitivity, and Natural language Query.

- Boolean search: To express the fact that a page must contain a word, a '+' sign or "and" is placed in front of the word. To search for all pages not containing a word, a '- ' sign or "not" is used. "or" or '|' means that this search term is preferred. Additional preferred terms will increase the ranking.

- Phrase Matching: Enclosing words in quotation marks causes them to be evaluated as a phrase.

- Attribute Search: Fluid Dynamic is able to limit search scopes within URLs, titles, texts, or links by using url:value (host:value or domain:value), title:value , text:value , or link:value .

- Word forms: Fluid Dynamic supports approximate English-language plural forms of words.

- Wild card: Fluid Dynamic uses * to represent one or more character or symbol.

Other Features

Fluid Dynamic is designed to search languages that use the Latin character set, including English, German, and Dutch. All Latin extended characters are reduced to their English equivalents. The query interface and result display are template-based, thus are easy to customize. It's also easy to translate the user interface into non-English languages. There's no theoretical page limit for Fluid Dynamic, but the soft limit because of the disk space and CPU running load is about 100,000 documents.

For detailed features of Fluid Dynamic, please refer to Appendix 2, which is the feature summary from the documentation of Fluid Dynamic [5].

4.2.3 ht://Dig

Searching Mechanism

ht://Dig uses the most standard indexing method: full text reverse index. The relevance ranking method is word weight. It is said that word weights are generally determined by the importance of the word in a document.

Crawler and Indexer Features

The crawler of ht://Dig supports Robot Exclusion Standards. The depth of crawling can be limited by setting maxhops option when running the crawling program, htdig. ht://Dig uses the signature of the document to detect duplicated pages. But it was reported that ht://Dig did not remove duplicates [7].

ht://Dig can index html and txt files by default. PDF, MS Word, PowerPoint, PostScript and Excel files can be indexed with the aid of external parsers or converters. The path name of external parser or converter must be put in the configuration file.

ht://Dig can index protected servers. It can be configured to use a specific username and password when it retrieves documents on a password protected server.

Searching Features

ht://Dig supports Boolean Search, Attribute Search, Fuzzy Search, Word Forms, and Wild Card. It dose not support Phrase Matching, Regular Expression, Numeric Data Search, Case Sensitivity, and Nature Language Query.

- Boolean Search: AND is used to search for more than one keywords. OR is used to search for any of the keywords.

- Attribute Search: ht://Dig can be set to perform search which only returns documents whose URL matche a certain pattern. It's different from the concept of Attribute Search. We list it here because it is similar to only search within a URL scope.

- Fuzzy Search: ht://Dig supports soundex, metaphone, accents, and synonyms search.

- Word Forms: ht://Dig supports word stemming.

- Wild Card: Wild card usage is not found in any documentation of ht://Dig. But the search engine at he Kennedy Space Center website [8], which is built using ht://Dig, supports powerful wild cards. More specifically,

� * is used to substitute one or more characters.

� ? is used to substitute one character.

§ #: Entering <WILDCARD>'#123land,launch#125ing' specifies possible word fragments. This would return pages containing landing or launching.

� [ ] : Entering <WILDCARD>'f[aeo]ster' specifies the allowed characters which can fill in for that space. This would return pages containing faster, fester, and foster to name a few.

§{ } : Entering <WILDCARD>'{land,launch}ing' specifies possible word fragments. This would return pages containing landing or launching.

� - : Entering <WILDCARD>'f[a-z]ster' specifies the range of allowed characters which can fill in for that space.

Other Features

Both SGML entities, such as 'à' and ISO-Latin-1 characters can be indexed and searched by ht://Dig. In order to support a specific language, we need to configure ht://Dig to use dictionary and affix files for the language of our choice by setting the locale attribute. There is no theoretical page limit. Usually, ht://Dig can index more than 100,000 pages. The output of a search can be easily customized using HTML templates.

For detailed features of ht://Dig, please refer to Appendix 3 which is the feature summary from the documentation of ht://Dig [9].

4.2.4 Juggernautsearch 1.0.1

In the documentation of Juggernautsearch, I couldn't find enough information to make conclusion about weather it supports some of the features we are discussing here. But Juggernautsearch uses a special indexing method which makes the indexing and searching process very fast.

Searching Mechanism

Juggernautsearch extracts the top keywords from a document and index only these keywords. These keywords are assigned word weight according to their appearance frequency in the document. The index file stores these words in the order of decreased weight. When it comes to the time of search, only the keywords stored in the indexed file are examined, the weight of the words in a document is used to calculate the relevance ranking. Since only the keywords are indexed and searched, the indexing and searching are very fast and the index files take little disk space.

Crawler and Indexer Features

Juggernautsearch supports Robot Exclusion Standard. The crawler of Juggernautsearch is called Pagerunner. It does not provide control over the depth of crawling. Juggernautsearch can detect duplicated pages. It pre-scans retrieved URLs to remove unwanted URLs and URLs that have already been visited, and ensures that once indexed a URL will not be crawled again in later web crawl iterations. It can not index protected sites.

The file formats that Juggernautsearch can index are as follows:

- Text and HTML files (.TXT, .HTM, .HTML, .SHTM, .SHTML, others)

- Microsoft PowerPoint files (.PPT)

- Microsoft Word files (.DOC)

- Microsoft Excel files (.XLS)

- Computer Language source files (.BAT, .C, .CGI, .CXX, .CPP, .H, Java, .PHP, .PL, others

- Postscript files (.PS)

- Rich Text Format files (.RTF)

Searching Features

Juggernautsearch supports Attribute Search. It can restrict search to be performed only in URLs. Juggernautsearch doesn't support Boolean Search. This is related to their indexing method. Boolean search that returns pages omitting a keyword can work only when we have the full document to search. While Juggernautsearch only extracts the top few keywords, requesting a search to exclude a word can not guarantee that the word is not in the document. No Boolean Search is the price for fast indexing and searching. In addition, Juggernautsearch dose not support Phrase Matching, Numeric Data Search, and Natural Language Query.

Other Features

Juggernautsearch does not support languages other than English. It dose not have a page limit, because the index file is very small.

Juggernautsearch has opened a challenge toward ht://Dig because of the criticism from some of the developers of ht://Dig. An interesting comparison between Juggernautsearch and ht://Dig can be found in [10]. The comparison table is attached as Appendix 4.

4.2.5 mnoGoSearch

Searching Mechanism

mnoGoSearch uses full text inverted index. Words in different parts of the document are assigned different weights. To determine the relevance of a document, mnoGoSearch considers several factors: number of complete phrases (taking into account of word weights), number of words from query found in a document, and number of incomplete phrases (taking into account of word weights).

Crawler and Indexer Features

mnoGoSearch supports Robot Exclusion Standards. The crawling depth of the crawler can be limited. By default, it can index html and txt files. With the aid of external parser, pdf, ps, and doc files can be indexed. On servers supporting HTTP 1.1, mnoGoSearch can index mp3 files. It can also index SQL database text fields. mnoGoSearch has the ability to index password protected servers.

Searching Features

mnoGoSearch supports Boolean Search, Phrase Matching, Attribute Search, Fuzzy Search, Word Forms, and Wild Card. It does not support Regular Expression, Numeric Data Search, Case Sensitivity, and Nature Language Query.

- Boolean Search: '&' represents logic AND; '|' represent logic OR; '~' represent logic NOT.

- Phrase Matching: Words enclosed with double quotation will be treated as a phase in searching.

- Attribute Search: mnoGoSearch can limit search within documents with given tags, or with given URL substrings.

- Fuzzy Search: Supports synonyms and substring search.

- Word Forms: Supports word stemming.

- Wild Card: '%' can be used as the wild card to define URL limit, but it can not be used in ordinary search words.

Other Features

mnoGoSearch supports almost all known 8 bit character sets as well as some multi-byte charsets including Korean euc-kr, Chinese big5 and gb2312, Japanese shift-jis, as well as utf8. The euc-kr, big5, gb2312 and shift-jis character sets are not supported by default, because the conversion tables for them are rather large that leads to increase in size of the executable files [11]. mnoGoSearch also supports the following Macintosh character sets: MacCE, MacCroatian, MacGreek, MacRoman, MacTurkish, MacIceland, MacRomania, MacThai, MacArabic, MacHebrew, MacCyrillic. When we talk about supporting language instead of supporting character sets, mnoGoSearch can support around 700 languages, which includes most of the frequently used language in the world.

mnoGoSearch can index about several million documents. It provides PHP3, Perl, and C CGI access to the search engine, offering significant flexibility and options in arranging search results.

For detailed features of mnoGoSearch, please refer to Appendix 5 which is the feature summary from the documentation of mnoGoSearch [11].

4.2.6 Perlfect

Searching Mechanism

Perlfect implements the most standard indexing and ranking algorithm. It uses the inverted index. When it comes to calculate word weight, it applies the algorithm of Gerald Salton [12], that is, the weight W, of a term T, in a document D, is:

W(T, D) = tf(T, D) * log ( DN / df(T)),

where tf(T, D) is the term frequency of T in D, DN is the total number of documents, and df(T) is the sum of frequencies of T in every document considered or as it called the document frequency of T.

Crawler and Indexer Features

Perlfect is the only search engine among the nine that does not support Robot Exclusion Standard. Thus, it is mainly designed for adding a search function to a single website. The depth of crawling can not be controlled. It can not index protected servers.

Searching Features

Perlfect only supports the Boolean Search feature. A '+' sign is used to include a word, while a '-' sign is used to exclude a word.

Other Features

The result page of Perlfect can be shown in many different languages such as German, French, and Italian. The user interface is fully customizable using the provided templates. Perlfect is a lightweight search engine. It can only index about 1,000+ documents.

For detailed features of perlfect, please refer to Appendix 6 which is the feature summary in the documentation of perlfect [13].

4.2.7 SWISH-E

Searching Mechanism

I haven't been successful in finding out what indexing and ranking methods SWISH-E uses in publicly available documents.

Crawler and Indexer Features

The crawler supports the Robot Exclusion Standards. Its maximum depth of crawling can be controlled. SWISH-E can not index protected servers.

SWISH-E can index html, xml and txt files. With the use of filters that convert other types of files such as MS Word documents, pdf, or gzipped files into one of the file types that Swish-e understands, SWISH-E can then index them. Files with extensions gif, xbm, au, mov, and mpg can be indexed but their content can not be indexed.

Searching Features

SWISH-E supports Boolean Search, Phrase Matching, Attribute Search, Fuzzy Search, Word Forms, and Wild Card. It doesn't support Regular Expression, Numerical Data Search, Case Sensitivity, and Natural Language Query.

- Boolean Search: and, or, and not are three logical operators of SWISH-E. The operators are case sensitive.

- Phrase Matching: Words in double quotation are treated as a phase in searching.

- Attribute Search: SWISH-E allows users to specify certain META tags that can be used as document properties. Search can be limited to documents with specified properties.

- Fuzzy Search: SWISH-E supports soundex search.

- Word Forms: SWISH-E supports word stemming.

- Wild Card: * is used to replace single or multiple characters.

Other Features

SWISH-E supports all the languages that use single byte characters.

For detailed features of SWISH-E, please refer to Appendix 7 which is the feature summary from the documentation of SWISH-E [14].

4.2.8 Webinator

Searching Mechanism

Webinator uses inverted index. The ranking algorithm takes into consideration of relative word ordering, word proximity, database frequency, document frequency, and position in text. The relative importance of these factors in computing the quality of a hit can be altered under Ranking Factors option.

Crawler and Indexer Features

The crawler supports the Robot Exclusion Standards. Its maximum depth of crawling can be controlled. Webinator can not index protected servers.

Webinator can detect duplicates by hashing the textual content of the page and not storing any page with a hash code that is already in the database. Files with extension html, htm, txt, pdf, doc, swf, asp, jsp, shtml, jhtml, or phtml can be indexed by Webinator.

Searching Features

Webinator supports Boolean Search, Phrase Matching, Fuzzy Search, Word Forms, Wild Card, Regular Expression, Numerical Data Search, and Natural Language Query. It does not support Attribute Search and Case Sensitivity.

- Boolean Search: An '-' is used to exclude a word; An '+' is used to include a word.

- Phrase Matching: Enclose the words in double quotation or hyphenate the words together.

- Fuzzy Search: It lets you find "looks roughly like" or "sounds like" information. To invoke a fuzzy match, precede the word or pattern with the '%' character.

- Word Forms: Word stemming is supported.

- Wild Card: * can be used to match just the prefix of a word or to ignore the middle of something.

- Regular Expression: Users can find those items that cannot be located with a simple wildcard search using regular expression pattern matcher. To invoke the REX regular expression pattern matcher within a query, precede the expression with a '/'. For example, we can use /19[789][0-9] to find years between 1970 and 1999.

- Numerical Data Search: It allows you to find quantities in textual information in any way they may be represented. To invoke a numeric value search within a query, precede the value with a '#'. For example, query #>5000 may return match "2.2 million".

- Natural Language Query: A query can be in the form of a sentence or question.

Other Features

Webinator doesn't support languages other than English. The free version of Webinator only can index about 10,000 pages. It has customizable user interface.

For detailed features of Webinator, please refer to Appendix 8 which is the feature summary from the documentation of Webinator [15].

4.2.9 WebGlimpse 2.x

Searching Mechanism

The indexer and query engine of WebGlimpse is Glimpse. Glimpse implements a two-level query method, which leads to small index files and fast index construction, and supports arbitrarily approximate matching. The idea of two-level query method is a hybrid of inverted index and sequential search with no indexing [16] [17].

The first step of a document indexing process is to divide the whole collection into small pieces, which are called blocks. The number of blocks can not exceed 256, so that the address of a block can be stored with one byte. The whole collection is scanned word by word. Then, an index that is similar to a regular inverted index with one notable exception is created. In an inverted index, every occurrence of every word is indexed with a pointer to the exact location of the occurrence. In Glimpse's index, every word is indexed, but not every occurrence. Each entry in the index contains a word and the block numbers in which that word occurs. Since each block can be identified with one byte, and many occurrences of the same word are combined in the index into one entry, the index is typically quite small.

The searching process consists of two phases. First, Glimpse searches the index for a list of all blocks that may contain a match to the query. Then, each such is searched separately. Since the index is small, agrep is used to perform flexible sequential search. Because of the sequential search, arbitrarily approximate search such as fuzzy search, word forms, regular expression, and wild card are easily supported.

Crawler and Indexer Features

WebGlimpse supports Robot Exclusion Standard. The crawling depth can be controlled. It can not index a protected server. By default, it can index html and txt files. With the aid of filters, it can index PDF, and any other documents that can be filtered to plaintext.

Searching Features

WebGlimpse supports Boolean Search, Fuzzy Search, Word Form, Wild Card, Regular Expression, and Case Sensitivity. It does not support Phrase Matching, Attribute Search, Numeric Data Search, and Natural Language Query.

- Boolean Search: 'AND' operation is denoted by the symbol ';'. 'OR' operation is denoted by the symbol ','.

- Fuzzy Search: Supports mis-spelling and partial word match.

- Word Form: Supports common word endings.

- Wild Card: The symbol '#' is used to denote a sequence of any number (including 0) of arbitrary characters. '*' works too.

- Regular Expression: The union operation '|', Kleene closure '*', and parentheses ( ) are supported to form regular expressions.

- Case Sensitivity: It supports case sensitive search.

Other Features

WebGlimpse can index all single byte languages. But the output of the interface is not configurable unless the commercial version of the software is purchased.

For detailed features of WebGlimpse, please refer to Appendix 9 which is the feature summary from the documentation of WebGlimpse [18].

5. Conclusion

We compared and contrasted nine free search engine software packages. Each package has its pros and cons. Most search engines support Boolean Search, Phrase Search, and Word Forms. ht://Dig has a powerful wild card. Juggernautsearch and WebGlimpse has small index files and fast indexing processes. Webinator supports natural language query, and it is the only search engine reviewed that can search numeric value in the textual environment. mnoGoSearch excels in supporting multiple languages. Perl scripts search engines such as Perlfect and RuterSearch are usually light-weighted. They have less functionality, but they are easy to use and install. In a nutshell, choosing which search engine software package to use is a decision that should be based on the matching between requirements and software features.

References and Notes:

[1] The original version of this paper was finished in 2002 as a project paper for the course, Information Sciences and Technology 511: Information Management − Information and Technology, taught by Dr. Lee Giles at the College of Information Sciences and Technology, The Pennsylvania State University, University Park, PA. The author was a graduate student at the College of Information Sciences and Technology, The Pennsylvania State University at that time. This paper removed all obsolete contents of the original version.

[2] Alkaline: a UNIX/NT Search Engine - Alkaline 1.9 Users Guide. Vestris Inc., Switzerland . http://alkaline.vestris.com/docs/pdf/alkaline.pdf

[3] Alkaline: a UNIX/NT Search Engine - Alkaline 1.5 Frequently Asked Questions. Vestris Inc., Switzerland . http://alkaline.vestris.com/docs/pdf/alkaline-faq.pdf

[4] http://www.faqs.org/rfcs/rfc1321.html

[5] http://www.xav.com/scripts/search/features.html

[6] http://www.xav.com/scripts/search/help/

[7] Comparing Open Source Indexers. http://www.infomotions.com/musings/opensource-indexers/

[8] http://kscsearch.ksc.nasa.gov/htdig/

[9] http://www.htdig.org/

[10] Donald T. Kasper, Juggernautsearch Internet Search Engine 1.0.1 Technical Responses and Comparison to HTDIG (HT://DIG), May 2001. http://juggernautsearch.com/htdig.htm.

[11] http://mnogosearch.org/doc/

[12] http://www.perlfect.com/freescripts/search/development.shtml

[13] http://perlfect.com/freescripts/search/

[14] http://swish-e.org/docs/index.html

[15] Webinator WWW Site Indexer Version 5.0. http://www.thunderstone.com/site/webinator5man/webinator5.pdf

[16] Manber, U.; Wu, S., "GLIMPSE: A Tool to Search Through Entire File System". TR 93-34, Department of Computer Science, University of Arizona, Tucson, Arizona, 1993.

[17] Udi Manber, Mike Smith,, Burra Gopal, "WebGlimpse Combining Browsing and Searching" , Proceedings of 1997 Usenix Technical Conference, Jan 6-10, 1997

[18] http://www.webglimpse.net/features.html