Designing and Implementing a Search Engine in .NET

When developing a site search engine, you have a couple of options. This article will show you how to use either Google, Index Server, or a Custom Search engine. The number of uses for a search engine suite aren't limited to asp.net web site search engines.

Note: there is no programming in this article!  This is an overview to show you how simple it is to create a search engine for a web site.  Alot of people are asking for articles on the code, and the Nata1 Unified team will work as hard as possible to show everyone how the API works internally.  However, the API was written very carefully so that a developer should be able to figure out how it works looking at our code comments and self describing naming conventions.  Code articles coming soon!

This first article will demonstrate using Nata1 Asp.net controls, but the next article will discuss adding searching capabilities to the TaskVision application, where we will build a site health monitor, check site ranking in goolge, and create tasks accordingly.

Download source for this article

Using Microsoft index server to develop a site search engine 4 years back was alot of fun.  At the time, the flexibility to control noise words and what part of a web site gets indexed was interesting.

Imagine you wanted to find “all good surfing places“ in Costa Rica - but the site copy uses surf, not surfing, and words like all, and places are also not in the copy and not relevant.  You can use index server, but its much better to have your own control.  Imagine you needed to collect info on what people are searching for on your site, or you want to weight pages, exclude directories, etc.

When I got my first experience with .Net during Beta II of Visual Studio, the possibilities jumped out at me and I began working on Nata1.

Using Nata1, you can drag and drop UI search engine components like hit results, relevance, etc. and switch between google, Nata1, or Index Server without writing any custom code. 

This article will go over the basics needed to build a basic search page, here is an example.  http://www.nata1.com/Photos/324.aspx

For the developer that is more interested in customizing the controls and developing more advanced features, this is a good starting point.  Also, computer science students studying Algorithms and data structures can get a good grasp on Binary Search Trees, and implment their own data structures.  A series of articles written by Scott Mitchell are an excellent starting point for the computer science student to understand and analyze the differences between different data structures like skipped lists, and the properties of balanced trees.  Nata1 was first implemented using BST's on a remote machine, and although you can use SQL server, you can also make a highbrid with little work to use SQL server and a BST.

This article will show you how to get up and running with Nata1, but future articles by myself and others will demonstrate developing core search engine components and controls.

Step 1: add some configuration code to the web.config file.  Config will be taken from the database, but if its not set in the admin tool it will look to the web.config, and then use defaults if nothing is found.  For your search engine, using the web.config is fine, but I've found storing this info in the database is preferable, and there are many admin controls included that allow you to alter everything from normalization rules, to spidering settings.

<Nata1>

<binPath>

<!-- if you want to use bsts, and have a web host, the bst has to be serialized as asp.net restarts frequently -->

<add key="filePath" value="C:/siteName/searchEngine/" />

</binPath>

<sites>

<add key="site" value=siteUrlHere if your indexing just one site />

<add key="defaultPage" value="index.aspx" />

</sites>

<log>

<add key="filePath" value="c:/eventLog.txt" />

</log>

<database>

<add key="connectionString" value="cn string stuff here" />

</database>

<indexing>

<add key="hour" value="4" />

<add key="intervalType" value="daily" />

<!--

<add key="interval" value="hourBased" />

<add key="intervalHours" value="2" />

-K>

</indexing>

<indexRequestTimeOut>

<add key="seconds" value="5" />

</indexRequestTimeOut>

<indexService>

<add key="provider" value="IndexServer" />

</indexService>

<google>

<add key="licenseKey" value="[put your google license key here]" />

</google>

</Nata1>

and if you want to publish exceptions, use this

<exceptionManagement mode="on">

<publisher assembly="Nata1" type="Nata1.Engine.Exceptions.ExceptionPublisher" exclude="*" include="Nata1.Engine.Exceptions.DataStructureException, Nata1; Nata1.Engine.Exceptions.QueryException, Nata1; Nata1.Engine.Exceptions.UIException" operatorMail=sedgewick@nata1.com filename="c:/inetpub/wwwroot/SearchEngine/Nata1ErrorLog.txt" />

</exceptionManagement>

You'll need to add this as well

<configuration>

<configSections>

<sectionGroup name="Nata1">

<section name="binPath" type="Nata1.Nata1SectionHandler,Nata1" />

<section name="sites" type="Nata1.Nata1SectionHandler,Nata1" />

<section name="log" type="Nata1.Nata1SectionHandler, Nata1" />

<section name="database" type="Nata1.Nata1SectionHandler, Nata1" />

<section name="preferedIndexTime" type="Nata1.Nata1SectionHandler, Nata1" />

<section name="indexRequestTimeOut" type="Nata1.Nata1SectionHandler, Nata1" />

<section name="indexing" type="Nata1.Nata1SectionHandler, Nata1" />

<section name="indexService" type="Nata1.Nata1SectionHandler, Nata1" />

<section name="google" type="Nata1.Nata1SectionHandler, Nata1" />

</sectionGroup>

<section name="exceptionManagement" type="Microsoft.ApplicationBlocks.ExceptionManagement.ExceptionManagerSectionHandler, Microsoft.ApplicationBlocks.ExceptionManagement" />

</configSections>

Step 2. next, run the DataBase install scripts for SQL Server.  the Database isn't very complex so you can easily use MySQL.  If you don't have a database, you can still use an in memory Binary Search tree, but this isn't recommended, you always want to remote your data structures.

Step 3. Add Nata1.dll to your toolbox.  Right click your toolbox.  Choose “add/remove items” , click browse, and find Nata1.dll.  Nata1 controls are now added to your toolbox.

there are dozens of controls, some are container controls, like ResultsRepeater, and other are for individual Items, all the ones with a smiley icon are placed in the Item or Alternating Item template, like HitUrl, HitWords, etc.  You can get creative with your toolbox icons, I've included some neat ones like Homestar runner icons.  Controls like QueryTime sit in the header template.  Some controls are specific to a search provider, e.g. Google has many controls, like spelling suggestions, but index server only has a couple so you have to be careful to make sure the provider supports the controls.

Step 4: We'll need a form to get from a search box to the search results page.  Go ahead and drag and drop “SearchForm“ (control with the ducky) onto any ascx or aspx page in your site. 

To use an image, set the SearchButtonText to an image Url (I know, not the most elegant) or enter text and make sure to set the ButtonType as well as SearchPageUrl.  As you can see, there is a bug in the designer as the image isn't updating.

Step 5: We'll build the search results page.  Drag and drop “Search Results Repeater” (the one with the fairy icon) onto a ascx or aspx page

The two most import properties will be “Query Provider“ - here you want to select Google, Nata1, Index Server, Rss, or ASP.Net Forums.  The last two are still in development, anyone want to develop them, be my guest.

The other property is called SearchQueryTemplate mode, here you want to select simple or advanced.

Step 6: Right click the template, choose the template you want to edit, and start dragging and dropping controls.

Here I dragged the controls SearchQuery and TotalHits onto the Header template, and put an ad banner there too, you can rotate based on keyword if you want.

There are several other templates you'll need to set, like NoResults, etc. There's also a template for a Search Form, and you can specify what search form controls to place there, perhaps you want an advanced search form to be at the top.

Step 7: make sure you place this code in your Global.asax!  When you restart you web app (I usually just add one space to the web.config) your app will restart, and Nata1 will begin indexing, and follow the index plan you have specificed in the web.config or in the database.

Sub Application_Start(ByVal sender As Object, ByVal e As EventArgs)

Nata1.Controller.Start()

End Sub

There are numerous controls for administration if you want those as well.  You can manually index your site, and you can manage noise words, see all the words on the site, and manage normalization (what words to normalize or not and also special rules, i.e. running, ran, and run are the same word.)

One import control that is left as an exercise is logging search words - i.e. what are people searching for on the site?  How about some info about them?

Here you have a powerful search engine you can put together in minutes, but the future of Nata1 is up to the community: I would like to see a DNN implementation, a CSK implementation, and I am currently working on a TaskVision implementation.

Hope you enjoyed my article, and let me know if you have any problems downloading the code or have any comments on the article.  We are looking for contributers, so if you want to write new data structures, new controls, new providers, integrate with DNN, CSK, IBuySpy, or have other ideas, we'd love to hear them!

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值