Asp.net 简单的站内搜索引擎-CSDN博客

众所周知，搜索引擎的制作是非常繁琐和耗时的，对于企业级的搜索引擎的制作，需要有良好的蜘蛛程序，定期更新搜索资源库，并且完善优化搜索引擎的速度和方法（比如全文搜索等），减少垃圾网页的出现，是一个很值得深入研究的话题。

这里我们当然不是教大家去做类似Google这样强大的搜索引擎（个人力量有限），也不是简单的调用googl的API来实现，这里主要提供给大家怎么对网页信息进行筛选和查询的功能，我们可以制作一个这样简单的搜索网页的功能，放在我们的个人主页上，作为站内搜索的工具。

[本示例完整源码下载(0分)] http://download.csdn.net/source/3513103

好了，言归正传，我们简单看一下这个功能的实现过程：

首先我们建立一系列的站内网页文件，这里命名为WebPage0~9，包含一些简单的信息

给出一个示例HTML：

<html xmlns="http://www.w3.org/1999/xhtml">
<head runat="server">
    <title>Onecode</title>
</head>
<body>
    <form id="form1" runat="server">
    <div>
    Hi, Onecode team.
    </div>
    </form>
</body>
</html>

接着建立一个SearchEngine的web页面，此页面提供程序的主界面，拥有一个TextBox，Button和GridView控件，接收用户输入的关键字，并且返回对应的网页信息：

HTML代码如下：

<html xmlns="http://www.w3.org/1999/xhtml">
<head runat="server">
    <title></title>
</head>
<body>
    <form id="form1" runat="server">
    <div>
        Key word:
        <asp:TextBox ID="tbKeyWord" runat="server"></asp:TextBox>
        <br />
        <asp:Button ID="btnSearchPage" runat="server" Text="Search your web page" 
            οnclick="btnSearchPage_Click" />
        <asp:GridView ID="gvwResource" runat="server" AutoGenerateColumns="False">
            <Columns>
                <asp:BoundField DataField="Title" HeaderText="Page Name" />
                <asp:HyperLinkField DataNavigateUrlFields="Link" DataTextField="Link" 
                    HeaderText="Page URL" />
            </Columns>
            <EmptyDataTemplate>
                No result
            </EmptyDataTemplate>
        </asp:GridView>
    </div>
    </form>
</body>
</html>

我们需要建立一个WebPage的实体类存储有关的网页信息，并且便于Linq的查询和数据绑定，创建一个类文件，命名为WebPageEntity.cs

C#代码，这里只存储了最简单信息（网页名称，内容（HTML），链接，标题，内容（文本））：

    /// <summary>
    /// web page entity class, contain page's basic information,
    /// such as name, content, link, title, body text.
    /// </summary>
    [Serializable]
    public class WebPageEntity
    {
        private string name;
        private string content;
        private string link;
        private string title;
        private string body;

        public string Name
        {
            get
            {
                return name;
            }
            set
            {
                name = value;
            }
        }

        public string Content
        {
            get
            {
                return content;
            }
            set
            {
                content = value;
            }
        }

        public string Link
        {
            get
            {
                return link;
            }
            set
            {
                link = value;
            }
        }

        public string Title
        {
            get
            {
                return title;
            }
            set
            {
                title = value;
            }
        }

        public string Body
        {
            get
            {
                return body;
            }
            set
            {
                body = value;
            }
        }
    }

创建一个RegexMethod类，包含提取网页标题，内容的方法，你可以选择扩展这个类，建立自己独有的搜索和排序方法：

代码（RegexMethod.cs）

    public class RegexMethod
    {
        /// <summary>
        /// The method is use to retrieve title text of pages.
        /// </summary>
        /// <param name="htmlCode"></param>
        /// <returns></returns>
        public string GetTitleString(string htmlCode)
        {
            string regexTitle = @"<title>([^<]*)</title>";
            string tagClean = @"<[^>]*>";
            Match match = Regex.Match(htmlCode, regexTitle, RegexOptions.IgnoreCase);
            string text = match.Groups[0].Value.ToString();
            string titleText = Regex.Replace(match.Value, tagClean, string.Empty, RegexOptions.IgnoreCase);
            return titleText;
        }

        /// <summary>
        /// The method is use to retrieve body text of pages.
        /// </summary>
        /// <param name="htmlCode"></param>
        /// <returns></returns>
        public string GetBodyString(string htmlCode)
        {
            string regexBody = @"(?m)<body[^>]*>(\w|\W)*?</body[^>]*>";
            string tagClean = @"<[^>]*>";
            MatchCollection matches = Regex.Matches(htmlCode, regexBody, RegexOptions.IgnoreCase);
            StringBuilder strPureText = new StringBuilder();
            foreach (Match match in matches)
            {
                string text = Regex.Replace(match.Value, tagClean, string.Empty, RegexOptions.IgnoreCase);
                strPureText.Append(text);
            }
            return strPureText.ToString();
        }
    }

准备工作已经OK，我们开始在SearchEngine.aspx.cs页面写主要的实现方法了，思想是这样的，首先建立一个List<T>实例，获取网页的资源信息，为了保持回传的状态，将这个List保留在ViewState中，获取网页资源将用到HttpWebRequest和HttpWebResponse类，并且用lock关键字定义互斥段代码。实体类中Name和Link用于展示网页名称和链接，允许用户通过点击访问网页，Title和Body作为搜索条件，Content用于通过RegexMethod class截取Title和Body。当收取网页实体类完成后（注意这里我们也可以收集外部网站的内容，同样可以借助我们的方法来执行搜索，这里我加入Bing网站，www.bing.com），是信息筛选阶段，使用Linq的Contain方法判断标题和网页内容是否包含对应的关键字，如果符合加入到选中的list中，并显示出来：

全部代码如下（Default.aspx.cs）

using System;
using System.Collections.Generic;
using System.Linq;
using System.Web;
using System.Web.UI;
using System.Web.UI.WebControls;
using System.Net;
using System.IO;
using System.Text;

namespace CSASPNETDisplayDataStreamResource
{
    public partial class SearchEngine : System.Web.UI.Page
    {
        private List<WebPageEntity> webResources;
        private bool isLoad = true;
        protected void Page_Load(object sender, EventArgs e)
        {
            if (!IsPostBack)
            {
                this.LoadList();
            }
        }

        /// <summary>
        /// Store web resources in ViewState variables.
        /// </summary>
        public List<WebPageEntity> WebResources
        {
            get
            {
                if (ViewState["Resource"] != null)
                {
                    this.LoadList();
                }
                return (List<WebPageEntity>)ViewState["Resource"];
            }
        }

        /// <summary>
        /// The method is use to load resource by specifically web pages.
        /// </summary>
        public void LoadList()
        {
            RegexMethod method = new RegexMethod();
            webResources = new List<WebPageEntity>();
            lock (this)
            {
                for (int i = 0; i < 10; i++)
                {
                    string url = Page.Request.Url.ToString().Replace("SearchEngine", string.Format("WebPage{0}", i));
                    string result = this.LoadResource(url);
                    if (isLoad)
                    {
                        WebPageEntity webEntity = new WebPageEntity();
                        webEntity.Name = Path.GetFileName(url);
                        webEntity.Link = url;
                        webEntity.Content = result;
                        webEntity.Title = method.GetTitleString(result);
                        webEntity.Body = method.GetBodyString(result);
                        webResources.Add(webEntity);
                    }
                }
                string extraUrl = "http://www.bing.com/";
                string bingResult = this.LoadResource(extraUrl);
                if (isLoad)
                {
                    WebPageEntity webEntity = new WebPageEntity();
                    webEntity.Name = Path.GetFileName(extraUrl);
                    webEntity.Link = extraUrl;
                    webEntity.Content = bingResult;
                    webEntity.Title = method.GetTitleString(bingResult);
                    webEntity.Body = method.GetBodyString(bingResult);
                    webResources.Add(webEntity);
                }
                ViewState["Resource"] = webResources;
            }
        }

        /// <summary>
        /// Use HttpWebRequest, HttpWebResponse, StreamReader for retrieving
        /// information of pages, and calling Regex methods to get useful 
        /// information.
        /// </summary>
        /// <param name="url"></param>
        /// <returns></returns>
        public string LoadResource(string url)
        {
            HttpWebResponse webResponse = null;
            StreamReader reader = null;
            try
            {
                HttpWebRequest webRequest = (HttpWebRequest)WebRequest.Create(url);
                webRequest.Timeout = 30000;
                webResponse = (HttpWebResponse)webRequest.GetResponse();
                string resource = String.Empty;
                if (webResponse == null)
                {
                    this.isLoad = false;
                    return string.Empty;
                }
                else if (webResponse.StatusCode != HttpStatusCode.OK)
                {
                    this.isLoad = false;
                    return string.Empty;
                }
                else
                {
                    reader = new StreamReader(webResponse.GetResponseStream(), Encoding.GetEncoding("utf-8"));
                    resource = reader.ReadToEnd();
                    return resource;
                }
            }
            catch (Exception ex)
            {
                this.isLoad = false;
                return ex.Message;
            }
            finally
            {
                if (webResponse != null)
                {
                    webResponse.Close();
                }
                if (reader != null)
                {
                    reader.Close();
                }
            }
        }

        /// <summary>
        /// The search button click event is use to compare key words and 
        /// page resources for selecting relative pages. 
        /// </summary>
        /// <param name="sender"></param>
        /// <param name="e"></param>
        protected void btnSearchPage_Click(object sender, EventArgs e)
        {
            if (!isLoad)
            {
                Response.Write("Resource file load failed, please refresh your page.");
                return;
            }
            if (tbKeyWord.Text.Trim() != string.Empty)
            {
                List<WebPageEntity> allSelectedResources = new List<WebPageEntity>();
                string[] keys = tbKeyWord.Text.Split(' ');
                foreach(string key in keys)
                {
                    string oneKey = key;
                    var webSelectedResources = from entity in this.WebResources
                                               where entity.Body.ToLower().Contains(string.Format("{0}", oneKey.ToLower()))
                                               || entity.Title.ToLower().Contains(string.Format("{0}", oneKey.ToLower()))
                                               select entity;
                    foreach (WebPageEntity entity in webSelectedResources)
                    {
                        if (!allSelectedResources.Contains(entity))
                        {
                            allSelectedResources.Add(entity);
                        }
                    }                 
                }                              
                gvwResource.DataSource = allSelectedResources;
                gvwResource.DataBind();
            }
            else
            {
                var webSelectedResource = from entity in this.WebResources
                                          select new
                                          {
                                              entity.Title,
                                              entity.Link,
                                          };
                gvwResource.DataSource = webSelectedResource;
                gvwResource.DataBind();
            }

        }
    }
}

请按Ctrl+F5尝试运行你的网站，输入你的关键字开始搜索吧，比如onecode，bing，azure，hotmail等等。