Asp.net 简单的站内搜索引擎

众所周知,搜索引擎的制作是非常繁琐和耗时的,对于企业级的搜索引擎的制作,需要有良好的蜘蛛程序,定期更新搜索资源库,并且完善优化搜索引擎的速度和方法(比如全文搜索等),减少垃圾网页的出现,是一个很值得深入研究的话题。

这里我们当然不是教大家去做类似Google这样强大的搜索引擎(个人力量有限),也不是简单的调用googl的API来实现,这里主要提供给大家怎么对网页信息进行筛选和查询的功能,我们可以制作一个这样简单的搜索网页的功能,放在我们的个人主页上,作为站内搜索的工具。

[本示例完整源码下载(0分)] http://download.csdn.net/source/3513103

好了,言归正传,我们简单看一下这个功能的实现过程:

首先我们建立一系列的站内网页文件,这里命名为WebPage0~9,包含一些简单的信息

给出一个示例HTML:

<html xmlns="http://www.w3.org/1999/xhtml">
<head runat="server">
    <title>Onecode</title>
</head>
<body>
    <form id="form1" runat="server">
    <div>
    Hi, Onecode team.
    </div>
    </form>
</body>
</html>


接着建立一个SearchEngine的web页面,此页面提供程序的主界面,拥有一个TextBox,Button和GridView控件,接收用户输入的关键字,并且返回对应的网页信息:

HTML代码如下:

<html xmlns="http://www.w3.org/1999/xhtml">
<head runat="server">
    <title></title>
</head>
<body>
    <form id="form1" runat="server">
    <div>
        Key word:
        <asp:TextBox ID="tbKeyWord" runat="server"></asp:TextBox>
        <br />
        <asp:Button ID="btnSearchPage" runat="server" Text="Search your web page" 
            οnclick="btnSearchPage_Click" />
        <asp:GridView ID="gvwResource" runat="server" AutoGenerateColumns="False">
            <Columns>
                <asp:BoundField DataField="Title" HeaderText="Page Name" />
                <asp:HyperLinkField DataNavigateUrlFields="Link" DataTextField="Link" 
                    HeaderText="Page URL" />
            </Columns>
            <EmptyDataTemplate>
                No result
            </EmptyDataTemplate>
        </asp:GridView>
    </div>
    </form>
</body>
</html>


我们需要建立一个WebPage的实体类存储有关的网页信息,并且便于Linq的查询和数据绑定,创建一个类文件,命名为WebPageEntity.cs

C#代码,这里只存储了最简单信息(网页名称,内容(HTML),链接,标题,内容(文本)):

    /// <summary>
    /// web page entity class, contain page's basic information,
    /// such as name, content, link, title, body text.
    /// </summary>
    [Serializable]
    public class WebPageEntity
    {
        private string name;
        private string content;
        private string link;
        private string title;
        private string body;

        public string Name
        {
            get
            {
                return name;
            }
            set
            {
                name = value;
            }
        }

        public string Content
        {
            get
            {
                return content;
            }
            set
            {
                content = value;
            }
        }

        public string Link
        {
            get
            {
                return link;
            }
            set
            {
                link = value;
            }
        }

        public string Title
        {
            get
            {
                return title;
            }
            set
            {
                title = value;
            }
        }

        public string Body
        {
            get
            {
                return body;
            }
            set
            {
                body = value;
            }
        }
    }


创建一个RegexMethod类,包含提取网页标题,内容的方法,你可以选择扩展这个类,建立自己独有的搜索和排序方法:

代码(RegexMethod.cs)

    public class RegexMethod
    {
        /// <summary>
        /// The method is use to retrieve title text of pages.
        /// </summary>
        /// <param name="htmlCode"></param>
        /// <returns></returns>
        public string GetTitleString(string htmlCode)
        {
            string regexTitle = @"<title>([^<]*)</title>";
            string tagClean = @"<[^>]*>";
            Match match = Regex.Match(htmlCode, regexTitle, RegexOptions.IgnoreCase);
            string text = match.Groups[0].Value.ToString();
            string titleText = Regex.Replace(match.Value, tagClean, string.Empty, RegexOptions.IgnoreCase);
            return titleText;
        }

        /// <summary>
        /// The method is use to retrieve body text of pages.
        /// </summary>
        /// <param name="htmlCode"></param>
        /// <returns></returns>
        public string GetBodyString(string htmlCode)
        {
            string regexBody = @"(?m)<body[^>]*>(\w|\W)*?</body[^>]*>";
            string tagClean = @"<[^>]*>";
            MatchCollection matches = Regex.Matches(htmlCode, regexBody, RegexOptions.IgnoreCase);
            StringBuilder strPureText = new StringBuilder();
            foreach (Match match in matches)
            {
                string text = Regex.Replace(match.Value, tagClean, string.Empty, RegexOptions.IgnoreCase);
                strPureText.Append(text);
            }
            return strPureText.ToString();
        }
    }


准备工作已经OK,我们开始在SearchEngine.aspx.cs页面写主要的实现方法了,思想是这样的,首先建立一个List<T>实例,获取网页的资源信息,为了保持回传的状态,将这个List保留在ViewState中,获取网页资源将用到HttpWebRequest和HttpWebResponse类,并且用lock关键字定义互斥段代码。实体类中Name和Link用于展示网页名称和链接,允许用户通过点击访问网页,Title和Body作为搜索条件,Content用于通过RegexMethod class截取Title和Body。当收取网页实体类完成后(注意这里我们也可以收集外部网站的内容,同样可以借助我们的方法来执行搜索,这里我加入Bing网站,www.bing.com),是信息筛选阶段,使用Linq的Contain方法判断标题和网页内容是否包含对应的关键字,如果符合加入到选中的list中,并显示出来:

全部代码如下(Default.aspx.cs)

using System;
using System.Collections.Generic;
using System.Linq;
using System.Web;
using System.Web.UI;
using System.Web.UI.WebControls;
using System.Net;
using System.IO;
using System.Text;

namespace CSASPNETDisplayDataStreamResource
{
    public partial class SearchEngine : System.Web.UI.Page
    {
        private List<WebPageEntity> webResources;
        private bool isLoad = true;
        protected void Page_Load(object sender, EventArgs e)
        {
            if (!IsPostBack)
            {
                this.LoadList();
            }
        }

        /// <summary>
        /// Store web resources in ViewState variables.
        /// </summary>
        public List<WebPageEntity> WebResources
        {
            get
            {
                if (ViewState["Resource"] != null)
                {
                    this.LoadList();
                }
                return (List<WebPageEntity>)ViewState["Resource"];
            }
        }

        /// <summary>
        /// The method is use to load resource by specifically web pages.
        /// </summary>
        public void LoadList()
        {
            RegexMethod method = new RegexMethod();
            webResources = new List<WebPageEntity>();
            lock (this)
            {
                for (int i = 0; i < 10; i++)
                {
                    string url = Page.Request.Url.ToString().Replace("SearchEngine", string.Format("WebPage{0}", i));
                    string result = this.LoadResource(url);
                    if (isLoad)
                    {
                        WebPageEntity webEntity = new WebPageEntity();
                        webEntity.Name = Path.GetFileName(url);
                        webEntity.Link = url;
                        webEntity.Content = result;
                        webEntity.Title = method.GetTitleString(result);
                        webEntity.Body = method.GetBodyString(result);
                        webResources.Add(webEntity);
                    }
                }
                string extraUrl = "http://www.bing.com/";
                string bingResult = this.LoadResource(extraUrl);
                if (isLoad)
                {
                    WebPageEntity webEntity = new WebPageEntity();
                    webEntity.Name = Path.GetFileName(extraUrl);
                    webEntity.Link = extraUrl;
                    webEntity.Content = bingResult;
                    webEntity.Title = method.GetTitleString(bingResult);
                    webEntity.Body = method.GetBodyString(bingResult);
                    webResources.Add(webEntity);
                }
                ViewState["Resource"] = webResources;
            }
        }

        /// <summary>
        /// Use HttpWebRequest, HttpWebResponse, StreamReader for retrieving
        /// information of pages, and calling Regex methods to get useful 
        /// information.
        /// </summary>
        /// <param name="url"></param>
        /// <returns></returns>
        public string LoadResource(string url)
        {
            HttpWebResponse webResponse = null;
            StreamReader reader = null;
            try
            {
                HttpWebRequest webRequest = (HttpWebRequest)WebRequest.Create(url);
                webRequest.Timeout = 30000;
                webResponse = (HttpWebResponse)webRequest.GetResponse();
                string resource = String.Empty;
                if (webResponse == null)
                {
                    this.isLoad = false;
                    return string.Empty;
                }
                else if (webResponse.StatusCode != HttpStatusCode.OK)
                {
                    this.isLoad = false;
                    return string.Empty;
                }
                else
                {
                    reader = new StreamReader(webResponse.GetResponseStream(), Encoding.GetEncoding("utf-8"));
                    resource = reader.ReadToEnd();
                    return resource;
                }
            }
            catch (Exception ex)
            {
                this.isLoad = false;
                return ex.Message;
            }
            finally
            {
                if (webResponse != null)
                {
                    webResponse.Close();
                }
                if (reader != null)
                {
                    reader.Close();
                }
            }
        }

        /// <summary>
        /// The search button click event is use to compare key words and 
        /// page resources for selecting relative pages. 
        /// </summary>
        /// <param name="sender"></param>
        /// <param name="e"></param>
        protected void btnSearchPage_Click(object sender, EventArgs e)
        {
            if (!isLoad)
            {
                Response.Write("Resource file load failed, please refresh your page.");
                return;
            }
            if (tbKeyWord.Text.Trim() != string.Empty)
            {
                List<WebPageEntity> allSelectedResources = new List<WebPageEntity>();
                string[] keys = tbKeyWord.Text.Split(' ');
                foreach(string key in keys)
                {
                    string oneKey = key;
                    var webSelectedResources = from entity in this.WebResources
                                               where entity.Body.ToLower().Contains(string.Format("{0}", oneKey.ToLower()))
                                               || entity.Title.ToLower().Contains(string.Format("{0}", oneKey.ToLower()))
                                               select entity;
                    foreach (WebPageEntity entity in webSelectedResources)
                    {
                        if (!allSelectedResources.Contains(entity))
                        {
                            allSelectedResources.Add(entity);
                        }
                    }                 
                }                              
                gvwResource.DataSource = allSelectedResources;
                gvwResource.DataBind();
            }
            else
            {
                var webSelectedResource = from entity in this.WebResources
                                          select new
                                          {
                                              entity.Title,
                                              entity.Link,
                                          };
                gvwResource.DataSource = webSelectedResource;
                gvwResource.DataBind();
            }

        }
    }
}


请按Ctrl+F5尝试运行你的网站,输入你的关键字开始搜索吧,比如onecode,bing,azure,hotmail等等。

  • 0
    点赞
  • 11
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值