页面信息抓取

最新推荐文章于 2024-04-07 16:18:22 发布

weixin_30642305

最新推荐文章于 2024-04-07 16:18:22 发布

阅读量113

点赞数

原文链接：http://www.cnblogs.com/lihuiping258/archive/2007/04/02/697251.html

版权

这两天在整这个，提取某一个网站的信息，不断学习中.......
using System.IO;
using System.Net;
using System.Text;
using System.Text.RegularExpressions;
//根据 url 读取源文件
         public string GetSourceHtml(string urlstr)
        {
            WebRequest wreq = WebRequest.Create(urlstr);
            WebResponse wres = wreq.GetResponse();
            Stream rece = wres.GetResponseStream();

Byte[] read = new Byte[512];
int bytes = rece.Read(read, 0, 512);

            string reshtml = "";
            while (bytes > 0)
            {
                Encoding encode = Encoding.GetEncoding("gb2312");
                reshtml += encode.GetString(read, 0, bytes);
                bytes = rece.Read(read, 0, 512);
            }
            return reshtml;
        }
//提取源文件中相关的 url
.......
//提取源文件中的文章内容，去掉页面的头尾。
.......
定位页面中内容的一点想法：
现在要提取的内容就是提取只包含内容的表格，而去掉其它的，可以根据不需要的内容的一些关键字，去掉那些表格并根据标题，位置和内容中的一些固定元素判断内容所在的具体位置，从而实现提取。

转载于:https://www.cnblogs.com/lihuiping258/archive/2007/04/02/697251.html

确定要放弃本次机会？

福利倒计时

: :

立减 ¥

普通VIP年卡可用

立即使用

weixin_30642305

关注关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
页面信息抓取

这两天在整这个，提取某一个网站的信息，不断学习中.......using System.IO;using System.Net;using System.Text;using System.Text.RegularExpressions;//根据 url 读取源文件public string GetSourceHtml(string urlstr) {...
复制链接

扫一扫