NATA1 Walk Through

原创 2004年10月21日 10:31:00

My mission, To implement NATA1 to spider and search my .Text site. Props to Paul [Sedgewick@Nata1.com] for sharing his source and walkthough to get me started on this, and for responding to my pleas for help

OK, so the NATA1 project is not quite as easy to implement as you were lead to believe at (http://www.nata1.com/)  the NATA2 does not appear to be "Open" yet, and the version 1.1 available via the download link at the bottom of the page at (http://www.nata1.com/download/default.aspx) has a couple of bugs, but we can fix this all in a jiffy... 

So start by getting the code, and make sure you can build it in your VS.NET. 

Choose or create your SQL database, and run the Nata1SqlScripts/tables.SQL to make the required tables... They all will co-exist nicely in your current sites database... Then run the Nata1SqlScripts/Sprocs.SQL to make the stored procedures...  If your database login is not an admin (sa) or a dbo, then you should remember to give your user execute permissions to the new stored procedures at this point.

Next on to the web.config...

In the web application where you hope to use NATA1, you are going to need to add a bunch of settings

for .Text you already have a configSections.. so add the following keys

<configuration><?xml:namespace prefix = o ns = "urn:schemas-microsoft-com:office:office" />

<configSections>

<sectionGroup name="Nata1">

<section name="binPath" type="Nata1.Nata1SectionHandler,Nata1" />

<section name="sites" type="Nata1.Nata1SectionHandler,Nata1" />

<section name="log" type="Nata1.Nata1SectionHandler, Nata1" />

<section name="database" type="Nata1.Nata1SectionHandler, Nata1" />

<section name="preferedIndexTime" type="Nata1.Nata1SectionHandler, Nata1" />

<section name="indexRequestTimeOut" type="Nata1.Nata1SectionHandler, Nata1" />

<section name="indexing" type="Nata1.Nata1SectionHandler, Nata1" />

<section name="indexService" type="Nata1.Nata1SectionHandler, Nata1" />

<section name="google" type="Nata1.Nata1SectionHandler, Nata1" />

</sectionGroup>

<section name="exceptionManagement" type="Microsoft.ApplicationBlocks.ExceptionManagement.ExceptionManagerSectionHandler, Microsoft.ApplicationBlocks.ExceptionManagement" />

</configSections>

Then after the close of your systemWeb section () but before the close of your configuration section () you should add the following... editing the file path (My implementation does not use this), site URL, SQL connection string, and the Google Key (if applicable).  I have removed the quotes from the sections you need to edit, so if you forget one your application will tell you your .config is invalid...  you need to replace all of my square brackets with quoted strings.

<Nata1>

<binPath>

<add key="filePath" value=[FILE PATH HERE eg C:/siteName/searchEngine/] />

</binPath>

<sites>

<add key="site" value=[BASE URL HERE eg http://blog.yoursite.com]/>

<add key="defaultPage" value="index.aspx" />

</sites>

<log>

<add key="filePath" value="c:/eventLog.txt" />

</log>

<database>

<add key="connectionString" value=[CONNECTION STRING HERE] />

</database>

<indexing>

<add key="hour" value="4" />

<add key="intervalType" value="daily" />

<--

<add value="hourBased" key="interval" />

<add value="2" key="intervalHours" />

-->

</indexing>

<indexRequestTimeOut>

<add key="seconds" value="5" />

</indexRequestTimeOut>

<indexService>

<add key="provider" value="IndexServer" />

</indexService>

<google>

<add key="licenseKey" value="[put your google license key here]" />

</google>

</Nata1>

And finally you need to add the exception management section adding your email address and a valid path where ASP.NET has permissions to write a file for the error log (this is also in SQL so you don't really need it)

<exceptionManagement mode="on">

<publisher assembly="Nata1" type="Nata1.Engine.Exceptions.ExceptionPublisher" exclude="*" include="Nata1.Engine.Exceptions.DataStructureException, Nata1; Nata1.Engine.Exceptions.QueryException, Nata1; Nata1.Engine.Exceptions.UIException" operatorMail=[YOUR EMAIL ADDRESS]filename=[YOUR FILE PATH eg.. c:/inetpub/wwwroot/SearchEngine/Nata1ErrorLog.txt] />

</exceptionManagement>

Now things would be working, but for a few bugs in the spider... so lets fix the spider before we implement it on our site. Open the file Engine/Indexing/IndexUtility.cs

First we need to up the  timeout on the HttpWebRequest.  Open the Engine/Normalization/SitePageUtility.cs… and around line 342 in the GetPage function, you will see a line that looks like this:

wr.Timeout=1;

Change it to 10000 as it is in milliseconds, and 1 will cause all spider pages to timeout.

We are going to need to change the default behavior of the spider.  It appears to have been designed to follow only relative links, which kept it on the specific site…  For .Text all of the links are absolute, so we need this utility to recognize and spider all links that are on the same domain as our base site.  This behavior can be changed in the file Engine/Indexing/IndexUtility.cs.  Update the buildSiteURLs function…  Actually just replace it with (careful on the word wrap the editor really messed up this function):

private static void buildSiteURLs(string url)

{

//_rawPages = new Hashtable();

// for debugging -

 

// this function will be just like get page urls

// 1. get the links for a page that don't appear in the root arraylist

// 2. add the unique URLs to the root arraylist

// 3. for each unique URL, recursivly call the function

// this function will eventually only return "unique" page urls

                 

// ok, capture all the URLs on the page

Regex rx = new Regex("href//s*=//s*(?:/"(?<href>[^/"]*)/"|(?<href>//S+))",

                        RegexOptions.IgnoreCase|RegexOptions.Compiled);

 

//Regex rx = new Regex("href//s*=//s*(?:(?:///"(?<url>[^///"]*)///")|(?<url>[^//s]* ))",RegexOptions.IgnoreCase|RegexOptions.Compiled);

 

//string hrefPattern = "<(a|A)//s{1}.*href=/"(?<href>[^/"]+)/"";

                  ArrayList pageResults = new ArrayList();

 

//Regex rx = new Regex(hrefPattern);

                  MatchCollection mc;

           

try

{

      // Perform the Regex Match

      string pageText = null;

      try

      {

            pageText = SitePageUtility.GetPage(url);

      }

      catch

      {

            // ok we've logged this

      }

      if(pageText != null)

      {

            _rawPages[url] = pageText;

      }

      else

      {

            return;

      }

      // System.Diagnostics.EventLog log = new System.Diagnostics.EventLog("vivaCostaRica");

      // log.WriteEntry("error requesting - " + url);

      // log.Close();

                       

      mc = rx.Matches(pageText);

                 

      foreach(Match m in mc)

      {

     

            Group g = m.Groups["href"];

            // prepare the url

            string linkTest = g.Value.ToUpper();

            //if(linkTest.IndexOf("/")!=0)continue; no need to check for relative vs static

            if(linkTest.IndexOf("/")<-1)continue; //Use this to strip anchors and some javascript

            if(linkTest.IndexOf(".JPG")>-1)continue;

            if(linkTest.IndexOf(".GIF")>-1)continue;

            if(linkTest.IndexOf(".PDF")>-1)continue;

            if(linkTest.IndexOf(".CSS")>-1)continue;

            //                      int testInt = linkTest.IndexOf(".jpg");

            string href = g.Value;

                             

            if(_debug == true)

                  LogUtility.LogEvent(EntryType.Info , DateTime.Now, "href : " + href);

                             

            if(href.IndexOf("/")==0)

                  href = SiteUtility.GetSiteBaseUrl() + href.Substring(1); //only append the base if it is a relative link

            string uri = href.Split(char.Parse("#"))[0].ToString(); //strip any anchor tags

            //    string defaultPage = getSiteDefaultPage();

            // ADD SUPPORT FOR DIFFERENT EXTENSTIONS

                              if(!_siteUrls.Contains(uri)&&uri!=SiteUtility.GetSiteBaseUrl()&&uri!=(SiteUtility.GetSiteBaseUrl()+SiteUtility.GetSiteDefaultPage())&&uri.IndexOf(SiteUtility.GetSiteBaseUrl())==0)

            {

                                    //if(_debug==true&&_siteUrls.Count<6||_debug==false)

            {

                  _siteUrls.Add(uri);

                  pageResults.Add(uri);

            }

                       

            }

            else if(_debug == true)

            {

                  //log why the URL was not added

                  if(_siteUrls.Contains(uri))

                        LogUtility.LogEvent(EntryType.Info , DateTime.Now, "The URL has already been added to the list");

                                    if(uri.IndexOf(SiteUtility.GetSiteBaseUrl())!=0)

                                          LogUtility.LogEvent(EntryType.Info , DateTime.Now, "The URL is off of the host domain");

                                    if(uri==SiteUtility.GetSiteBaseUrl()||uri==(SiteUtility.GetSiteBaseUrl()+SiteUtility.GetSiteDefaultPage()));

                                          LogUtility.LogEvent(EntryType.Info , DateTime.Now, "The URL is for the default page, and is already indexed");

 

            }

      }    

      LogUtility.LogEvent(EntryType.Info , DateTime.Now, "RegEx URL Matches for this page : " + mc.Count.ToString());

      LogUtility.LogEvent(EntryType.Info , DateTime.Now, "pageResults for this page : " + pageResults.Count.ToString());

      // now for each unique link on this page make a recursive call

      // TESTING - make sure we only have unique URLs

                 

}

catch(Exception ex)

{

                        LogUtility.LogError(EntryType.RecoverableError,DateTime.Now,"Error processing page " + url,ex);

 

      //throw new ApplicationException("error occured building Site Urls", ex);

 

}

//if(_debug == true)

//    if(this.siteURLs.Count>5)return;

 

foreach(string s in pageResults)

{

      buildSiteURLs(s);

}

                 

                 

}

 

While we are here there is a bug in the SearchResults Repeater designer, I didn't find it because I wanted to implement the whole thing as a class in my application directly and avoid all of the carefully crafted but not documented server controls.  But I devised this workaround...  There is a problem with the ItemTemplate object, but without it your repeater wont return results... the solution is to use the alternatingItemTemplate...  So in the file UI/Common/BaseRepeater.cs find the line around 460 that reads:

// don't do anything if no ItemTemplate

if (_itemTemplate == null)

      return;

and change it to:

// don't do anything if no ItemTemplate

if (_itemTemplate == null && _alternatingItemTemplate == null)

      return;

else if(_itemTemplate == null)

      _itemTemplate = _alternatingItemTemplate;

 Now we are ready to build NATA1 and add a reference to the DLL to your web site project and to your toolbox in VS.NET (Warning.  Make a new tab, it will add lots of new controls) Following the instructions from the authors original article:

Step 3. Add Nata1.dll to your toolbox.  Right click your toolbox.  Choose “add/remove items” , click browse, and find Nata1.dll.  Nata1 controls are now added to your toolbox.

<?xml:namespace prefix = v ns = "urn:schemas-microsoft-com:vml" />r_toolbox.jpg

there are dozens of controls, some are container controls, like ResultsRepeater, and other are for individual Items, all the ones with a smiley icon are placed in the Item or Alternating Item template, like HitUrl, HitWords, etc.  You can get creative with your toolbox icons, I've included some neat ones like Homestar runner icons.  Controls like QueryTime sit in the header template.  Some controls are specific to a search provider, e.g. Google has many controls, like spelling suggestions, but index server only has a couple so you have to be careful to make sure the provider supports the controls.

 After adding the reference, add the following line to your Application_Start function in your global.asax

      protected void Application_Start(Object sender, EventArgs e)

            {

                  Nata1.Controller.Start();

            }

Now when your site starts, it will check the SQL logs to see when it was last spidered, and it should begin crawling your site.  You should see activity in the log, and after a while you should have a decent collection of site pages in your database.  Now we just need to build a search page…  (some of this content is from the original FAQ)  just make a page called search.aspx for now and follow the Authors original step 4:

Step 4: We'll need a form to get from a search box to the search results page.  Go ahead and drag and drop “SearchForm“ (control with the ducky) onto any ascx or aspx page in your site. 

r_searchForm.jpg

To use an image, set the SearchButtonText to an image Url (I know, not the most elegant) or enter text and make sure to set the ButtonType as well as SearchPageUrl.  As you can see, there is a bug in the designer as the image isn't updating.

 

And then for step five make a page called SearchResults.aspx, you will need to link it to the control you added in step 4. and follow the Authors instructions:

Step 5: We'll build the search results page.  Drag and drop “Search Results Repeater” (the one with the fairy icon) onto a ascx or aspx page
r_ResultsRepeater1.jpg

The two most import properties will be “Query Provider“ - here you want to select Nata1.  The other property is called SearchQueryTemplate mode, here you want to select simple.

Then finally we follow the Authors Step 6.

 

 

Step 6: Right click the template, choose the template you want to edit, and start dragging and dropping controls.

r_HeaderAndFooterTemplates.jpg

Here I dragged the controls SearchQuery and TotalHits onto the Header template, and put an ad banner there too, you can rotate based on keyword if you want.

There are several other templates you'll need to set, like NoResults, etc. There's also a template for a Search Form, and you can specify what search form controls to place there, perhaps you want an advanced search form to be at the top.

The key is in that last part…  Earlier I noted that the Repeater relies on the ItemsTemplate row, but if you add that to the control, it breaks…  so we are going to want to add some basic output to the alternating items row… go to your HTML and add…

 

<AlternatingItemTemplate>

<P>

      <nata1searchui:HitTitle id="HitTitle1" runat="server"></nata1searchui:HitTitle>

      <nata1searchui:HitRankingIcon id="HitRankingIcon1" runat="server"></nata1searchui:HitRankingIcon></P>

<P>

      <nata1searchui:HitPageWords id="HitPageWords1" runat="server"></nata1searchui:HitPageWords></P>

<P>

      <nata1searchui:HitCategories id="HitCategories1" runat="server"></nata1searchui:HitCategories></P>

</AlternatingItemTemplate>

 

As well as an error handler, which cannot be set from the UI…

 

 <ErrorTemplate>

An error has occured and our support staff has been notified.

</ErrorTemplate>

 

And with this, You should be able to search through the site pages in your database.. Not to pretty, but really cool, because if you made it this far you are hard core, and understand the power of what just happened on your site…  you might want to explore the controls for the admin area… and add some noise words to your database too.  Happy searching….

视频推荐:Dimensions - A walk through mathematics

    刚在sdyy那儿看到了这个好东西。影片Dimensions长约2个小时,共分为9章,谈论了维度、射影、复数等有趣的数学话题。下面是一个4分钟长的预告片。完整的视频可以在这里下载。 ...
  • matrix67
  • matrix67
  • 2008年07月10日 03:34
  • 368

HDU4758 Walk Through Squares(AC自动机+状压DP)

题目大概说有个n×m的格子,有两种走法,每种走法都是一个包含D或R的序列,D表示向下走R表示向右走。问从左上角走到右下角的走法有多少种走法包含那两种走法。 D要走n次,R要走m次,容易想到用AC自动机...
  • Ezereal
  • Ezereal
  • 2016年08月08日 09:19
  • 161

hdu 4758 Walk Through Squares

AC自动机+DP。想了很久都没想出来。。。据说是一道很模板的自动机dp。。。原来自动机还可以这么跑啊。。。我们先用两个字符串建自动机,然后就是建一个满足能够从左上角到右下角的新串,这样我们就可以直接从...
  • u010697167
  • u010697167
  • 2013年10月08日 19:46
  • 920

A Walk Through the Forest 最短路+dp

A Walk Through the ForestTime Limit:1000MS  Memory Limit:65536KTotal Submit:48 Accepted:15 Descripti...
  • abcjennifer
  • abcjennifer
  • 2010年08月11日 08:26
  • 2352

hdu 4758 Walk Through Squares AC自动机

题意: 在一个(n+1)*(m+1)的矩阵中从0点到(n+1)(m+1)点,每次只能向左走或者向下走,问所有走法中包含所给出的2种子串的走法有多少种。 分析:      AC自动机类的题目做多了...
  • zp___waj
  • zp___waj
  • 2016年10月11日 20:43
  • 237

A Walk Through the Forest(最短路径+DFS)

A Walk Through the Forest Time Limit: 2000/1000 MS (Java/Others) Memory Limit: 65536/32768 ...
  • qq_33096883
  • qq_33096883
  • 2017年02月05日 13:45
  • 424

hdu4758---Walk Through Squares(AC自动机+dp)

Problem DescriptionOn the beaming day of 60th anniversary of NJUST, as a military college which was ...
  • Guard_Mine
  • Guard_Mine
  • 2015年04月16日 16:07
  • 691

Codeforces 821D Okabe and City (拆点+思维建图+spfa)

D. Okabe and City time limit per test 4 seconds memory limit per test 256 megabytes input...
  • qq_34374664
  • qq_34374664
  • 2017年07月06日 20:10
  • 427

Codeforces 821D Okabe and City【思维建图+Dij+优先队列优化】好题~好题~

D. Okabe and City time limit per test 4 seconds memory limit per test 256 megabytes input...
  • mengxiang000000
  • mengxiang000000
  • 2017年06月26日 13:30
  • 757

Node.js(四)

NPM 使用介绍 NPM是随同NodeJS一起安装的包管理工具,能解决NodeJS代码部署上的很多问题,常见的使用场景有以下几种: 允许用户从NPM服务器下载别人编写的第三方包到本地使...
  • shi199434
  • shi199434
  • 2018年01月25日 22:42
  • 11
内容举报
返回顶部
收藏助手
不良信息举报
您举报文章:NATA1 Walk Through
举报原因:
原因补充:

(最多只允许输入30个字)