半通用的数据采集(1)

先说说为什么是半通用版采集系统,之前确实也是准备做通用采集系统的,但是项目赶得紧,时间不是很够,就半路上刹住了,做成了一半一半的样子。相信大家看完这篇博客后能对数据采集有更深的理解的。

 

刚拿到项目,很快就弄了一个出来。但是质量不高,很快就被退回来了。

 

一看就知道差了,界面首先就不行,一看代码更糟糕。

代码(早期测试代码):

#region 将获取的数据按照一定得条件转化为数据集

         private static DataSet GetDataSet(string type)

         {

              DataTable dt=CreateTbBlockTrade();

              switch(type)

              {

                   case "ShangHai":

                       int trstartshanghai=XMLHelper.GetInt("trstartshanghai");

                    int trendshanghai=XMLHelper.GetInt("trendshanghai");

                       int tdstartshanghai=XMLHelper.GetInt("tdstartshanghai");

                       int tdendshanghai=XMLHelper.GetInt("tdendshanghai");

                    XmlNodeList xmlnodelist=XMLHelper.GetXmlNodeList(System.Windows.Forms.Application.StartupPath+"//messagexml.xml","TR");

                       int countshanghai =xmlnodelist.Count-tdstartshanghai;

                       for(int i=trstartshanghai; i<countshanghai; i++)

                       {

                            XmlNodeList xmlnodechildren=xmlnodelist[i].ChildNodes;

                            int shanghaixmlnodechildrencount=xmlnodechildren.Count-tdendshanghai;

                            DataRow dr=dt.NewRow();

                            for(int j=tdstartshanghai; j<shanghaixmlnodechildrencount; j++)

                            {

                                 string ss=xmlnodechildren[j].InnerText;

                                 switch(j)

                                 {

                                     case 1:

                                          dr["TradeDate"]=ss;

                                          break;

                                     case 2:

                                          ss=ss.Replace("&nbsp;","");

                                          string[] nameandid=ss.Split('(',')');

                                          dr["StockName"]=nameandid[0];

                                          dr["StockID"]=nameandid[1];

                                          break;

                                     case 3:

                                          dr["TradePrice"]=ss;

                                          break;

                                     case 4:

                                          dr["TradeTotal"]=ss;

                                          break;

                                     case 5:

                                          dr["TradeNum"]=ss;

                                          break;

                                     case 6:

                                          dr["BuyDepartment"]=ss;

                                          break;

                                     case 7:

                                          dr["SaleDepartment"]=ss;

                                          break;

                                     default:

                                          break;

                                 }

                            }

                       }

                       break;

........

首先,值得肯定的是,目前的要求是实现了。但是这种方式采取的是分析网页源代码字符串的形式作的,存在这很大的隐患。它禁不起一点点小的波动。其实,它根本就没有通用性,就像是定制的一样。

pass了

 

为了寻求一种新的采集方式,我请教了很多人。后来本人采取了自己觉得还行的办法。

大家先看两个配置文件:

config.xml

<?xml version="1.0" encoding="utf-8"?>

<systemconfig>

  <sqlserver>

    <DataSource>192.168.10.104</DataSource>

    <DataBase>InformationCenter</DataBase>

    <UserName>sa</UserName>

    <PassWord>sa</PassWord>

  </sqlserver>

  <LogSaveLongTime>81</LogSaveLongTime>

  <!--日志保存时间,单位天-->

  <website>

    <WebSiteMessage>

      <WebSiteURL>http://www.sse.com.cn/sseportal/webapp/datapresent/SSELargeTradeInfoAct?CURSOR=1</WebSiteURL>

      <WebSitePeriod>1</WebSitePeriod>

      <GatherTime>2009-10-16 7:06:01</GatherTime>

      <GatherInterval></GatherInterval>

      <WebSiteName>上交所</WebSiteName>

      <!数据集配置路劲 -->

      <WebSitePath>Website/633906785473437500.xml</WebSitePath>

      <!--table数是否相等 -->

      <TableNum>27</TableNum>

      <!是否需要采集 -->

      <NeedsGatDather>0</NeedsGatDather>

      <!是否采集完成 -->

      <GatDatherComplete>1</GatDatherComplete>

    </WebSiteMessage>

    <WebSiteMessage>

      <WebSiteURL>http://www.szse.cn/main/disclosure/news/dzjy/</WebSiteURL>

      <WebSitePeriod>1</WebSitePeriod>

      <GatherTime>2009-10-16 7:00:00</GatherTime>

      <GatherInterval></GatherInterval>

      <WebSiteName>深交所</WebSiteName>

      <WebSitePath>Website/633906787124687500.xml</WebSitePath>

      <TableNum>48</TableNum>

      <NeedsGatDather>0</NeedsGatDather>

      <GatDatherComplete>1</GatDatherComplete>

    </WebSiteMessage>

    <WebSiteMessage>

      <WebSiteURL>http://www.sse.com.cn/sseportal/webapp/datapresent/SSELargeTradeInfoAct?CURSOR=1&amp;amp;STARTDATE=2001-10-14&amp;amp;ENDDATE=2009-10-14&amp;amp;QUERYTYPE=1&amp;amp;byear=2003&amp;amp;bmonth=01&amp;amp;bday=01&amp;amp;eyear=2009&amp;amp;emonth=10&amp;amp;eday=14&amp;amp;STOCKID=&amp;amp;x=8&amp;amp;y=6</WebSiteURL>

      <WebSitePeriod>1</WebSitePeriod>

      <GatherTime>2009-10-16 7:00:00</GatherTime>

      <GatherInterval></GatherInterval>

      <WebSiteName>上交所分页测试数据</WebSiteName>

      <WebSitePath>Website/633911151612968750.xml</WebSitePath>

      <TableNum>27</TableNum>

      <NeedsGatDather>0</NeedsGatDather>

      <GatDatherComplete>1</GatDatherComplete>

    </WebSiteMessage>

  </website>

</systemconfig>

在这个系统配置文件里面我保存了一下的几个信息:

数据库信息,系统日志信息,要采集的网站信息,以及各网站的采集状态,采集时间。

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值