采集一个网页的数据

最新推荐文章于 2024-08-15 02:43:19 发布

joyhen

最新推荐文章于 2024-08-15 02:43:19 发布

阅读量2.5k

点赞数

分类专栏： ASP.NET C# 文章标签：抓取网页正则表达式

本文链接：https://blog.csdn.net/Joyhen/article/details/8771511

版权

C# 同时被 2 个专栏收录

270 篇文章 3 订阅

订阅专栏

ASP.NET

111 篇文章 0 订阅

订阅专栏

简单的方法，线程的处理也是可以，不过我还不能很好的处理线程是否结束，所以就不贴这方面。

思路：通过WebRequest和WebResponse来获取指定url的内容,然后用正则表达式来匹配我们需要的部分html，这个需要先分析当前请求的页面结构然后做出对应处理。下面我以http://bbs.csdn.net/recommend_tech_topics为例。

http://bbs.csdn.net/recommend_tech_topics页面截图如下：

现在我们只需要中间的帖子信息，查看源代码看看结构：

我们发现当前需求的内容是位于class="tit_1"的div中，知道这个规律就好办了，先上代码：

为了方便翻页查看，我加个上一页、下一页的效果，

<%@ Page Language="C#" AutoEventWireup="true" CodeFile="testcollection.aspx.cs" Inherits="testcollection" %>

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head runat="server">
    <title>测试获取网页信息</title>

    <script src="js/jquery-1.6.min.js" type="text/javascript"></script>
    <style type="text/css">
        a:link, a:visited {color: #335AA4;text-decoration: none;}
        a:hover, a:active {color: #CA0000;text-decoration: underline;}
        
        #pageUrlInfo{ font-family:Arial,宋体; width:960px; margin:0 auto; font-size:14px;} 
        #pageUrlInfo ul, li{ list-style:none;}
        #pageUrlInfo ul li{ line-height:23px;}
        .list_1 .time {float: right;color: #999;font-size: 12px;}
        
        .pageBar{width:960px; margin:0 auto; font-size:14px;}
        .pageBar ul li{ float:right; width:100px;}
    </style>
    
</head>
<body>
    <a name="top"></a>
    <div class="pageBar">
        <ul>
            <li><a href="#bottom">回到底部</a></li>
            <li><a href="testcollection.aspx?page=<%=(int.Parse(pageindex)+1).ToString() %>">下一页</a></li>
            <li><a href="testcollection.aspx?page=<%=(int.Parse(pageindex)-(pageindex!="1" ? 1 : 0 )).ToString() %>">上一页</a></li>            
        </ul>
    </div><br />
    <form id="form1" runat="server">
    <div runat="server" id="pageUrlInfo">
        
    </div>
    <a name="bottom"></a>
    <div class="pageBar" style=" width:960px; margin:0 auto;">
        <ul>
            <li><a href="#top">回到顶部</a></li>
            <li><a href="testcollection.aspx?page=<%=(int.Parse(pageindex)+1).ToString() %>">下一页</a></li>
            <li><a href="testcollection.aspx?page=<%=(int.Parse(pageindex)-(pageindex!="1" ? 1 : 0 )).ToString() %>">上一页</a></li>            
        </ul>
    </div>
    </form>
    <script type="text/javascript">

        $(document).ready(function() {
            $(".list_1 ul li a").each(function() {
                var _urlSuffix = $(this).attr("href");
                $(this).attr({ "href": "<%=HttpUrlDomain %>" + _urlSuffix, "target": "_blank" });
            })
            //...
        })
    </script>
</body>
</html>

using System;
using System.IO;
using System.Net;
using System.Text;
using System.Text.RegularExpressions;

 protected void Page_Load(object sender, EventArgs e)
    {
        ///recommend_tech_topics?page=2
        string rl;
        WebRequest myReq = WebRequest.Create(HttpUrlDomain + "/recommend_tech_topics?page=" + pageindex);
        WebResponse myRes = myReq.GetResponse();
        Stream resStream = myRes.GetResponseStream();
        StreamReader sr = new StreamReader(resStream, Encoding.UTF8);
        StringBuilder sb = new StringBuilder();
        while ((rl = sr.ReadLine()) != null)
        {
            sb.AppendLine(rl);
        }

        Regex regex = new Regex("<div class=\"list_1\">([\\s\\S]*)</div>([\\s\\S]*)<div class=\"page_nav\">", RegexOptions.Compiled);
        Match match;
        match = regex.Match(sb.ToString());
        if (match.Success)
            this.pageUrlInfo.InnerHtml = match.Groups[0].Value;

        myRes.Close();
    }
    /// <summary>
    /// 获取页码
    /// </summary>
    public string pageindex
    {
        get
        {
            return Request.QueryString["page"] != null ?
                (int.Parse(Request.QueryString["page"].ToString()) > 0 ? Request.QueryString["page"].ToString() : "1")
                : "1";
        }
    }
    public string HttpUrlDomain
    {
        get { return "http://bbs.csdn.net"; }
    }

好了，看看效果怎么样：