抓取网站数据不再是难事了，Fizzler（So Easy）全能搞定-CSDN博客

首先从标题说起，为啥说抓取网站数据不再难（其实抓取网站数据有一定难度），SO EASY！！！使用Fizzler全搞定，我相信大多数人或公司应该都有抓取别人网站数据的经历，比如说我们博客园每次发表完文章都会被其他网站给抓取去了，不信你们看看就知道了。还有人抓取别人网站上的邮箱、电话号码、QQ等等有用信息，这些信息抓取下来肯定可以卖钱或者干其他事情，我们每天都会时不时接到垃圾短信或邮件，可能就这么回事了，有同感吧，O(∩_∩)O哈哈~。

本人前段时间了写了两个程序，一个程序是抓取某彩票网站的数据（双色球），一个是抓取求职网站（猎聘、前程无忧、智联招聘等等）的数据，当时在写这两个程序的时候显示尤为棘手，看到一堆的HTML标签真的是想死。首先来回顾一下之前我是如何解析HTML的，非常常规的做法，通过WebRequest拿到HTML内容，再通过HTML标签一步一步截取你想要的内容，以下代码就是截取双色球的红球和篮球的代码。一旦网站的标签发生一点变化可能面临的就是要重新改程序了，使用起来非常不方便。

下面是我在解析双色球的红球和篮球的代码，做得最多的是截取（正则表达式）标签相应的内容，也许这段代码显得还不是很复杂，因为这个截取的数据有限，而且非常有规律所以显得比较简单。

 1         #region * 在一个TR中，解析TD，获取一期的号码
 2         /// <summary>  3 /// 在一个TR中，解析TD，获取一期的号码  4 /// </summary>  5 /// <param name="wn"></param>  6 /// <param name="trContent"></param>  7 private void ResolveTd(ref WinNo wn, string trContent)  8  {  9 List<int> redBoxList = null; 10 //匹配期号的表达式 11 string patternQiHao = "<td align=\"center\" title=\"开奖日期"; 12 Regex regex = new Regex(patternQiHao); 13 Match qhMatch = regex.Match(trContent); 14 wn.QiHao = trContent.Substring(qhMatch.Index + 17 + patternQiHao.Length, 7); 15 //匹配蓝球的表达式 16 string patternChartBall02 = "<td class=\"chartBall02\">"; 17 regex = new Regex(patternChartBall02); 18 Match bMatch = regex.Match(trContent); 19 wn.B = Convert.ToInt32(trContent.Substring(bMatch.Index + patternChartBall02.Length, 2)); 20 //存放匹配出来的红球号码 21 redBoxList = new List<int>(); 22 //匹配红球的表达式 23 string patternChartBall01 = "<td class=\"chartBall01\">"; 24 regex = new Regex(patternChartBall01); 25 MatchCollection rMatches = regex.Matches(trContent); 26 foreach (Match r in rMatches) 27  { 28 redBoxList.Add(Convert.ToInt32(trContent.Substring(r.Index + patternChartBall01.Length, 2))); 29  } 30 //匹配红球的表达式 31 string patternChartBall07 = "<td class=\"chartBall07\">"; 32 regex = new Regex(patternChartBall07); 33 rMatches = regex.Matches(trContent); 34 foreach (Match r in rMatches) 35  { 36 redBoxList.Add(Convert.ToInt32(trContent.Substring(r.Index + patternChartBall07.Length, 2))); 37  } 38 //排序红球号码 39  redBoxList.Sort(); 40 //第一个红球号码 41 wn.R1 = redBoxList[0]; 42 //第二个红球号码 43 wn.R2 = redBoxList[1]; 44 wn.R3 = redBoxList[2]; 45 wn.R4 = redBoxList[3]; 46 wn.R5 = redBoxList[4]; 47 wn.R6 = redBoxList[5]; 48 }

下面这块的代码是某招聘网站的截取数据，就是一串的截取HTML标签的内容，哈哈，当时在写这个时候相当的头痛，不知有做个这方法工作的人是不是有同感，当你解析比较多网站的数据就更加了（我写了抓取前程无忧、猎聘网、前程无忧和拉勾网的数据），O(∩_∩)O哈哈~想死呀，使用正则表达是去截取数据，再去提取相应信息的工作。

// 正则表达式过滤：正则表达式，要替换成的文本
    private static readonly string[][] Filters =   {    new[] { @"(?is)<script.*?>.*?</script>", "" },    new[] { @"(?is)<style.*?>.*?</style>", "" },    new[] { @"(?is)<!--.*?-->", "" }, // 过滤Html代码中的注释    new[] { @"(?is)<footer.*?>.*?</footer>",""},    //new[] { "(?is)<div class=\"job-require bottom-job-require\">.*?</div></div>",""}    new[] { @"(?is)<h3>常用链接：.*?</ul>",""}   };   private void GetJobInfoFromUrl(string url)   {    try    {     JobInfo info = new JobInfo();     //--     string pageStr = GetHtmlCode.GetByget(url, "utf-8");     if (string.IsNullOrEmpty(pageStr))     {      return;     }     //--     pageStr = pageStr.Replace("\r\n", "");//替换换行符     // 获取html，body标签内容     string body = string.Empty;     string bodyFilter = @"(?is)<body.*?</body>";     Match m = Regex.Match(pageStr, bodyFilter);     if (m.Success)     {      body = m.ToString().Replace("<tr >", "<tr>").Replace("\r\n", "");     }     // 过滤样式，脚本等不相干标签     foreach (var filter in Filters)