搜索引擎（2）从html中提取文本内容

最新推荐文章于 2024-01-08 16:35:56 发布

whucv

最新推荐文章于 2024-01-08 16:35:56 发布

阅读量1.7k

点赞数

分类专栏：搜索引擎

本文链接：https://blog.csdn.net/archielau/article/details/30127731

版权

搜索引擎专栏收录该内容

3 篇文章 0 订阅

订阅专栏

在实现从Web 网页提取文本之前，首先要识别网页的编码，如果有必要，也要识别网页所使用的语言。整体流程如下：
1. 从Web 服务器返回的content type 中提取编码，如果是gb2312 类型的编码要当成GBK处理。
2. 从网页的Meta 信息中识别字符编码，如果和content type 中的编码不一致，以Meta 中声明的编码为准。
3. 如果仍然无法确定网页所使用的字符集，需要从返回流的二进制格式判断。同时要确定网页所使用的语言，例如UTF-8 编码的语言可以是中文，英文，日文或韩文等任何语

言。《自己动手写搜索引擎》

下面是从新浪网下载的一段源文代码

<!DOCTYPE html>
<!--[1,912,1] published at 2014-06-08 00:17:03 from #153 by system-->
<html>
<head>
<meta http-equiv="Content-type" content="text/html; charset=gb2312" />
<title>新闻中心首页_新浪网</title>
<meta name="keywords" content="新闻,时事,时政,国际,国内,社会,法治,聚焦,评论,文化,教育,新视点,深度,网评,专题,环球,传播,论坛,图片,军事,焦点,排行,环保,校园,法治,奇闻,真情" />
<meta name="description" content="新浪网新闻中心是新浪网最重要的频道之一，24小时滚动报道国内、国际及社会新闻。每日编发新闻数以万计。" />

<link rel="alternate" type="application/rss+xml" href="http://rss.sina.com.cn/news/marquee/ddt.xml" title="新闻中心_新浪网" />
<meta name="stencil" content="PGLS000023" />
<meta name="publishid" content="1,912,1" />
<meta name="verify-v1" content="6HtwmypggdgP1NLw7NOuQBI2TW8+CfkYCoyeB8IDbn8=" />
<meta name="msvalidate.01" content="0EBC6AF737F6405C0F32D73B4AA6A640" />
<link rel="apple-touch-icon" href="http://i0.sinaimg.cn/dy/news3.png" />

常用HTML解析器有HtmlParser和JSoup。和.Net平台下的 Winista.Htmlparser.Net
html解析可参考

http://www.gbtags.com/technology/javautilities/20120720jsoupjquerysnatchpage/

html解析器

利用HtmlParser.Net从meta中得到编码信息

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;

using Winista.Text.HtmlParser;
using Winista.Text.HtmlParser.Lex;
using Winista.Text.HtmlParser.Util;
using Winista.Text.HtmlParser.Tags;
using Winista.Text.HtmlParser.Filters;

namespace OpenSearchEngine
{
    public class HtmlParser
    {
        public static String GetCharset(String content)
        {
            const String CHARSET_STRING = "charset";
            int index;
            String ret;
            ret = null;
            if (null != content)
            {
                index = content.IndexOf(CHARSET_STRING);
                if (index != -1)
                {
                    content = content.Substring(index + CHARSET_STRING.Length).Trim();
                    if (content.StartsWith("="))
                    {
                        content = content.Substring(1).Trim();
                        index = content.IndexOf(";");
                        if (index != -1)
                            content = content.Substring(0, index);//remove any double quotes from around charset string
                        if (content.StartsWith("\"") && content.EndsWith("\"") && (1 < content.Length))
                            content = content.Substring(1, content.Length - 1);
                        //remove any single quote from around charset string
                        if (content.StartsWith("'") && content.EndsWith("'") && (1 < content.Length))
                            content = content.Substring(1, content.Length - 1);
                        ret = content;
                    }
                }
            }
            return (ret);
        }
        /// <summary>
        /// 利用HtmlParser.Net得到编码
        /// </summary>
        /// <param name="content"></param>
        /// <returns></returns>
        public static String GetCharsetFromMeta(string content)
        {
            string result = "";
            Lexer lexer = new Lexer(content);//Lexer包含了词法分析的代码；
            Parser parser = new Parser(lexer);//解析器
            NodeFilter filter = new NodeClassFilter(typeof(Winista.Text.HtmlParser.Tags.MetaTag));//节点过滤器
            NodeList htmlNodes = parser.Parse(filter);//使用节点过滤得到NodeList
            /* 解析之后，我们可以采用： 
             * INode[] nodes = nodeList.toNodeArray();
             * 来获取节点数组，也可以直接访问：
             * INode node = nodeList.elementAt(i);
             * 来获取Node。
             * 另外，在Filter后得到NodeList以后，我们仍然可以使用
             * nodeList.extractAllNodesThatMatch(someFilter)
             * 来进一步过滤，同时又可以用
             * nodeList.visitAllNodesWith(someVisitor)来做进一步的访问。
             */
            for (int i = 0; i < htmlNodes.Count; i++)
            {

                ITag tag = htmlNodes[i] as ITag;
                if (tag != null)
                {

                    string charset = GetCharset(tag.GetAttribute("content"));
                    if (!string.IsNullOrEmpty(charset))
                        return charset;

                }
            }
            return result;
        }
        private static ITag getTag(INode node)
        {
            if (node == null)
                return null;
            return node is ITag ? node as ITag : null;
        }

    }
}

string path = @"D:\Docs\Test.htm";
string content = System.IO.File.ReadAllText(path);
tbShow.Text = OpenSearchEngine.HtmlParser.GetCharsetFromMeta(content);