解决网爬工具爬取页面信息出现乱码的问题

最新推荐文章于 2023-06-25 10:04:48 发布

JOHNCOOLS

最新推荐文章于 2023-06-25 10:04:48 发布

阅读量1.3k

点赞数

分类专栏： Asp / Asp.net 学习专用文章标签：工具 string 正则表达式 url html null

Asp / Asp.net 学习专用专栏收录该内容

456 篇文章 2 订阅

订阅专栏

问题：
   网爬工具中自动搜集页面信息时，有的页面出现了出现乱码现象
原因：
   读取页面信息是使用了错误的编码类型。C#.NET从现在的类中获取得来的编码信息有时是错误的，本人认为对不是ASP.NET的应用程序，它读过来的编码信息都是错误的。
解决：
   思路：必须先在运行时获取得该页面的编码，再去读取页面的内容，这样得来的页面内容才不会出现乱码现象。
   方法：
   1:使用ASCII编码去读取页面内容。
   2:使用正则表达式从读取的页面内容中筛选出页面的编码信息。上个步骤获取的页面信息可能会有乱码。但HTML标志是正确的，所有可以从HTML标志中得到编码的信息。
   3.用正确的编码类型去读取页面信息。
   如果哪位有更好的方法，请多赐教啊！
<script type="text/javascript"> </script> <script src="http://pagead2.googlesyndication.com/pagead/show_ads.js" type="text/javascript"> </script> name="google_ads_frame" marginwidth="0" marginheight="0" src="http://pagead2.googlesyndication.com/pagead/ads?client=ca-pub-6628499292856412&dt=1175434824515&lmt=1175434824&prev_fmts=728x90_as&format=728x90_as&output=html&channel=2097912809&url=http%3A%2F%2Fwww.cnblogs.com%2Fxuanfeng%2Farchive%2F2007%2F01%2F21%2F626296.html&color_bg=FFFFFF&color_text=000000&color_link=B3B3B3&color_url=008000&color_border=7F7F7F&ad_type=text_image&cc=427&u_h=768&u_w=1024&u_ah=768&u_aw=1024&u_cd=16&u_tz=480&u_java=true" frameborder="0" width="728" scrolling="no" height="90" allowtransparency="allowtransparency">

   下面附上代码：

代码演示

using System;

using System.Collections.Generic;

using System.Text;

using System.Net;

using System.Web;

using System.IO;

using System.Text.RegularExpressions;

namespace charset

{

class Program

{

static void Main(string[] args)

{

string url = "http://www.gdqy.edu.cn";

GetCharset1(url);

GetChartset2(url);

Console.Read();

}

// 通过HttpWebResponse直接获取页面编码

static void GetCharset1(string url)

{

try

{

WebRequest webRequest = WebRequest.Create(url);

HttpWebResponse webResponse = (HttpWebResponse)webRequest.GetResponse();

string charset = webResponse.CharacterSet;

string contentEncoding = webResponse.ContentEncoding;

string contentType = webResponse.ContentType;

Console.WriteLine("context type:{0}", contentType);

Console.WriteLine("charset:{0}", charset);

Console.WriteLine("content encoding:{0}", contentEncoding);

//测试或取页面是否出现乱码

//Console.WriteLine(getHTML(url,charset));

}

catch (UriFormatException ex)

{

Console.WriteLine(ex.Message);

}

catch(WebException ex)

{

Console.WriteLine(ex.Message);

}

//使用正则表达式获取页面编码

static void GetChartset2(string url)

{

try

{

string html = getHTML(url,Encoding.ASCII.EncodingName);

Regex reg_charset = new Regex(@"charset/b/s*=/s*(?<charset>[^""]*)");

string enconding = null;

if (reg_charset.IsMatch(html))

{

enconding = reg_charset.Match(html).Groups["charset"].Value;

Console.WriteLine("charset:{0}",enconding);

}

else

{

enconding = Encoding.Default.EncodingName;

}

//测试或取页面是否出现乱码

//Console.WriteLine(getHTML(url,enconding));

}

catch (UriFormatException ex)

{

Console.WriteLine(ex.Message);

}

catch(WebException ex)

{

Console.WriteLine(ex.Message);

}

//读取页面内容方法

static string getHTML(string url,string encodingName)

{

try

{

WebRequest webRequest = WebRequest.Create(url);

WebResponse webResponse = webRequest.GetResponse();

Stream stream = webResponse.GetResponseStream();

StreamReader sr = new StreamReader(stream, Encoding.GetEncoding(encodingName));

string html = sr.ReadToEnd();

return html;

}

catch (UriFormatException ex)

{

Console.WriteLine(ex.Message);

return null;

}

catch (WebException ex)

{

Console.WriteLine(ex.Message);

return null;

}

http://www.gdqy.edu.cn页面的使用的编码格式是：gb2312
第一个方法显示的内容是：
context type:text/html
charset:ISO-8859-1
content encoding:
第二个方法显示的内容是：
charset:gb2312

所以第一个方法获取的信息是错误的，第二个方法是对的。
为什么第一个方法获取的的编码格式是：ISO-8859-1呢？
我用Reflector反射工具获取了CharacterSet属性的源代码，从中不难看出其原因。如果能获取出ContentType属性的源代码就不以看出其出错的原因了，但是搞了许久都没找出，如果那位那补上，那就太感谢了。
下面我附上Reflector反射工具获取了CharacterSet属性的源代码，有兴趣的朋友看一看。

CharacterSet源码

public string CharacterSet

{

get

{

this.CheckDisposed();

string text1 = this.m_HttpResponseHeaders.ContentType;

if ((this.m_CharacterSet == null) && !ValidationHelper.IsBlankString(text1))

{

this.m_CharacterSet = string.Empty;

string text2 = text1.ToLower(CultureInfo.InvariantCulture);

if (text2.Trim().StartsWith("text/"))

{

this.m_CharacterSet = "ISO-8859-1";

}

int num1 = text2.IndexOf(";");

if (num1 > 0)

{

while ((num1 = text2.IndexOf("charset", num1)) >= 0)

{

num1 += 7;

if ((text2[num1 - 8] == ';') || (text2[num1 - 8] == ' '))

{

while ((num1 < text2.Length) && (text2[num1] == ' '))

{

num1++;

}

if ((num1 < (text2.Length - 1)) && (text2[num1] == '='))

{

num1++;

int num2 = text2.IndexOf(';', num1);

if (num2 > num1)

{

this.m_CharacterSet = text1.Substring(num1, num2).Trim();

break;

}

this.m_CharacterSet = text1.Substring(num1).Trim();

break;

}

return this.m_CharacterSet;

}
结束！

http://www.cnblogs.com/xuanfeng/archive/2007/01/21/626296.html

JOHNCOOLS

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
解决网爬工具爬取页面信息出现乱码的问题

问题：网爬工具中自动搜集页面信息时，有的页面出现了出现乱码现象原因：读取页面信息是使用了错误的编码类型。C#.NET从现在的类中获取得来的编码信息有时是错误的，本人认为对不是ASP.NET的应用程序，它读过来的编码信息都是错误的。解决：思路：必须先在运行时获取得该页面的编码，再去读取页面的内容，这样得来的页面内容才不会出现乱码现象。方法： 1:使用ASCII编码去读
复制链接

扫一扫