C#---爬虫抓取系列

最新推荐文章于 2021-12-05 17:29:02 发布

weixin_30898109

最新推荐文章于 2021-12-05 17:29:02 发布

阅读量220

点赞数 1

文章标签：爬虫

原文链接：http://www.cnblogs.com/shuai7boy/p/7011236.html

版权

以前就尝试过研究了一些爬虫程序，也找过一些爬虫抓取软件，效果不是很好。今天正好一个培训的网友给了我一个视频，正好研究下，收获颇丰。感谢那位哥们~

1.首先讨论一下抓取一个页面

这里我写了模仿写了一个控制台程序，直接看代码即可：

using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
using System.Net;
using System.Text;

namespace 爬虫
{
    class Program
    {
        static void Main(string[] args)
        {
            string rec=getContent("http://ryj.shuai7boy.cn/");
            Console.WriteLine(rec);
            Console.ReadKey();
        }
        public static string getContent(string strUrl)
        {
            try
            {
                string rl;
                Uri u = new Uri(strUrl);
                HttpWebRequest request = (HttpWebRequest)HttpWebRequest.Create(strUrl);
                request.ContentType = "application/x-www-form-urlencoded";
                HttpWebResponse Response = (HttpWebResponse)request.GetResponse();
                Stream resStream = Response.GetResponseStream();
                StreamReader sr = new StreamReader(resStream, Encoding.GetEncoding("utf-8"));
                StringBuilder sb = new StringBuilder();
                while ((rl=sr.ReadLine())!=null)
                {
                    sb.Append(rl);
                }
                return sb.ToString();

            }
            catch(Exception e)
            {
                Console.WriteLine("can't open url:"+strUrl);
                throw e;
            }
        }

    }

}

上面运行后就直接把网页的Html抓取到显示到控制台了。

这个的原理就是直接请求读取的文件流，然后对文件流进行一行一行遍历。

尝试过的朋友可能会对这个编码产生质疑。首先要说的是utf-8是国际标准，gb2312是针对汉语中国自己制定的。

如果像上面使用gb2312导出我们会看到乱码，但打开网页不会。但当上面改为utf-8时，导出的内容我们能看懂，但打开网页就出现乱码了。这个时候解决办法就是手动将网页里面的utf-8改为gb2312。

至于这是什么原因?编码之间是怎么转换的我还没深入研究，后续讨论。

还可以将上面代码改为直接写入文件：

  static void Main(string[] args)
        {
            string rec=getContent("http://www.baidu.com/");
            string strPath = @"E:\c盘搬家\Desktop\1.html";           
            File.WriteAllText(strPath, rec);
            Console.WriteLine("ok");
            Console.ReadKey();
        }