搜索之路- c#从html中提取文本

最新推荐文章于 2024-05-03 15:15:35 发布

bestdowt1314

最新推荐文章于 2024-05-03 15:15:35 发布

阅读量641

点赞数

分类专栏：搜索技术学习笔记技术资料积累文章标签： html c# string attributes input methods

技术资料积累同时被 2 个专栏收录

28 篇文章 0 订阅

订阅专栏

搜索技术学习笔记

10 篇文章 0 订阅

订阅专栏

直接封装成一个类的，用起来还挺方便的

using System;
using System.Data;
using System.Configuration;
using System.Web;
using System.Web.Security;
using System.Web.UI;
using System.Web.UI.WebControls;
using System.Web.UI.WebControls.WebParts;
using System.Web.UI.HtmlControls;
using System.Text.RegularExpressions;

/// <summary>
/// HtmlExtract 抽取html里面的文本信息
/// </summary>
public class HtmlExtract
{

        #region private attributes
        private string _strHtml;
        #endregion

        #region public mehtods
         public HtmlExtract(string inStrHtml)
        { _strHtml = inStrHtml;}

        public string ExtractText()
        {
            string result = _strHtml;
            result = RemoveComment(result);
            result = RemoveScript(result);
            result = RemoveStyle(result);
            result = RemoveTags(result);
            return result.Trim();
        }
        #endregion

     #region private methods
       private string RemoveComment(string input)
{
string result = input;
//remove comment
result = Regex.Replace(result, @"", string.Empty, RegexOptions.IgnoreCase);
return result;
}
       private string RemoveStyle(string input)
{
string result = input;
//remove all styles
result = Regex.Replace(result, @"<style[^>]*?>.*?</style>", string.Empty, RegexOptions.IgnoreCase | RegexOptions.Singleline);
return result;
}
       private string RemoveScript(string input)
{
string result = input;
result = Regex.Replace(result, @"<script[^>]*?>.*?</script>", string.Empty, RegexOptions.IgnoreCase | RegexOptions.Singleline);
result = Regex.Replace(result, @"<noscript[^>]*?>.*?</noscript>", string.Empty, RegexOptions.IgnoreCase | RegexOptions.Singleline);
return result;
}
       private string RemoveTags(string input)
{
string result = input;
result = result.Replace(" ", " ");
result = result.Replace("'", "/"");
result = result.Replace("<", "<");
result = result.Replace(">", ">");
result = result.Replace("&", "&");
result = result.Replace("<br>", "/r/n");
result = Regex.Replace(result, @"<[/s/S]*?>", string.Empty, RegexOptions.IgnoreCase);
return result;
}
     #endregion
}

bestdowt1314

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
1
评论
搜索之路- c#从html中提取文本

直接封装成一个类的，用起来还挺方便的 using System;using System.Data;using System.Configuration;using System.Web;using System.Web.Security;using System.Web.UI;using System.Web.UI.WebControls;usin
复制链接

扫一扫