[C#]数据采集

最新推荐文章于 2024-08-21 22:17:52 发布

jdhlowforever

最新推荐文章于 2024-08-21 22:17:52 发布

阅读量1.1k

点赞数

文章标签： c# string 正则表达式 url regex 数据库

这里指的是，你发现任意一个页面，没有RSS,没有数据接口，只是通过浏览器能够访问，把上面的数据用程序拿过来用。

基本原理：

1，通过http请求页面，返回字符串的代码；

2，通过第一步后，数据就是一组字符串，相当于你在浏览器点击查看源代码的内容。一般就开始用正则表达式，提取有用的数据，排除无用的；

3，有需要的可以把数据存储到自己的数据库中，也报过图片处理等。

4，把提取出来的数据生成自己需要的页面。

一个偷取页面的过程就这样完成了。下面是两种第一步的代码，原理是一样的。

------------------------------------------------------------------

     /// <summary>
        /// 传入URL返回网页的html代码
        /// </summary>
        /// <param name="Url">URL</param>
        /// <returns></returns>
        public string getUrltoHtml(string Url)
        {
            try
            {
                System.Net.WebRequest wReq = System.Net.WebRequest.Create(Url);
                System.Net.WebResponse wResp = wReq.GetResponse();
                System.IO.Stream respStream = wResp.GetResponseStream();
                System.IO.StreamReader reader = new System.IO.StreamReader(respStream, System.Text.Encoding.GetEncoding("gb2312"));
                return savefile(reader.ReadToEnd());

            }
            catch (System.Exception ex)
            {
                WriteErrFile(ex);
            }
            return "";
        }

----------------------------------------------

        /// 获取远程文件源代码
        /// </summary>
        /// <param name="url">远程url</param>
        /// <returns></returns>
        public string GetRemoteHtmlCode(string Url)
        {
            string s = "";
            MSXML2.XMLHTTP _xmlhttp = new MSXML2.XMLHTTPClass();
            _xmlhttp.open("GET", Url, false, null, null);
            _xmlhttp.send("");
            if (_xmlhttp.readyState == 4)
            {
                s = System.Text.Encoding.Default.GetString((byte[])_xmlhttp.responseBody);
            }
            return s;
        }

-----------------------------

第二步，正则表达式的一个小例子，把div中的内容全部返回了，接下来，存到自己的数据库还做什么就随意了。

string Reg = "<div id=m>.+?</div>";
string GetValue = o.GetRegValue(Reg, GetRemoteHtmlCode(http://www.baidu.com ));

    public bool GetRegValue(string RegexString, string RemoteStr)
    {
        string MatchVale = "";
        Regex r = new Regex(RegexString);
        Match m = r.Match(RemoteStr);
        if (m.Success)
        {
            MatchVale = m.Value;
        }
        return MatchVale;
    }