【整理】如何使用C#实现网易博客中圈子用户数据的采集

新浪博客,网易博客,都是博客中的佼佼者,其中网易提供的圈子信息,更胜一筹,使得一般用户能够通过访问圈子进入相关的群组,或者获取相关圈子用户的信息等,以实现各种精准营销的目的。虽然新浪遮遮掩掩不提供圈子的相关信息,相对而言,网易博客提供圈子,能够使得更多的人、更多的程序支持,推高博客的知名度及实用性。网易博客可以通过地址http://q.163.com/ 访问,它是经过两级分类的,如下所示。

 

 点击分类进入即可查看到每个子分类都有很多圈子,圈子累死QQ的群组,是某一兴趣团体的博客,里面收集很多相关的资料及信息,如下所示:

 

 这里不关心圈子的有哪些宝贵学习资料,我更关心的是这些圈子的用户如何采集出来,由于用户都是网易的用户,因此他们一个账户就会对应一个账号,有163.com,163.net,yeah.net,126.com等等的,我们先看看圈子的用户信息是如何显示的。

 

 我们看到上图里面圈子的信息是一个列表,有的圈子多,有的圈子少,不过他们的名称中都会关联一个博客地址的,由于博客地址和邮件地址有一一对应关系,因此可以获取对应的邮件信息,这就是我们所要的重要信息。

下面用一个程序来介绍如何采集圈子的分类、圈子数据以及圈子用户资料信息,测试的程序如下所示:

下面用一个程序来介绍如何采集圈子的分类、圈子数据以及圈子用户资料信息,测试的程序如下所示:

 

 下面我们来看看按钮”刷新分类数据“的实现代码,主要是获取圈子大类、圈子子类以及保存数据操作,代码如下所示:

复制代码
         private   void  btnRefreshCategory_Click( object  sender, EventArgs e)
        {
            
string  url  =   " http://q.163.com/ " ;
            
string  mainTypeReg  =   " <div\\s*style=\"font-size:14px;\"><b><a\\s*?href=\"(?<value>.*?)\">(?<key>.*?)</a></b></div> " ;
            
string  subTypeReg  =   " <div\\s*class=\"left\"><a\\s*href=\"(?<value>.*?)\">(?<key>.*?)</a></div>  " ;

            
#region  取得大类

            httpHelper.Encoding 
=  Encoding.Default;
            
string  content  =  httpHelper.GetHtml(url);
            Regex re 
=   new  Regex(mainTypeReg, RegexOptions.IgnoreCase  |  RegexOptions.Singleline  |  RegexOptions.IgnorePatternWhitespace);
            Match mc 
=  re.Match(content);
            Dictionary
< string string >  typeDict  =   new  Dictionary < string string > ();
            
if  (mc.Success)
            {
                MatchCollection mcs 
=  re.Matches(content);
                
foreach  (Match me  in  mcs)
                {
                    
string  strKey  =  me.Groups[ " key " ].Value;
                    
string  strValue  =  me.Groups[ " value " ].Value;
                    
// 截取连接前面部分作为大类标识
                     string  newValue  =  strValue.TrimEnd( ' / ' );
                    
int  eIndex  =  newValue.LastIndexOf( ' / ' );
                    newValue 
=  newValue.Substring( 0 , eIndex)  +   " / " ;

                    
if  ( ! typeDict.ContainsKey(strKey))
                    {
                        typeDict.Add(strKey, newValue);
                    }
                }
            }
            
#endregion

            
#region  取得子类
            Dictionary
< string , CircleSubTypeInfo >  circleDict  =   new  Dictionary < string , CircleSubTypeInfo > ();
            re 
=   new  Regex(subTypeReg, RegexOptions.IgnoreCase  |  RegexOptions.Singleline  |  RegexOptions.IgnorePatternWhitespace);
            mc 
=  re.Match(content);
            
if  (mc.Success)
            {
                MatchCollection mcs 
=  re.Matches(content);
                
foreach  (Match me  in  mcs)
                {
                    
string  strKey  =  me.Groups[ " key " ].Value;
                    
string  strValue  =  me.Groups[ " value " ].Value;
                    
// 截取连接前面部分作为大类标识
                     string  typeValue  =  strValue.TrimEnd( ' / ' );
                    
int  eIndex  =  typeValue.LastIndexOf( ' / ' );
                    typeValue 
=  typeValue.Substring( 0 , eIndex)  +   " / " ;

                    
if  ( ! circleDict.ContainsKey(strKey))
                    {
                        CircleSubTypeInfo info 
=   new  CircleSubTypeInfo();
                        info.Name 
=  strKey;
                        info.LinkUrl 
=  strValue;
                        info.TypeUrlValue 
=  typeValue;

                        circleDict.Add(strKey, info);
                    }
                }
            }
            
#endregion


            
#region  保存数据
            Database db 
=  DatabaseFactory.CreateDatabase();
            DbCommand command 
=   null ;
            
string  sql  =   "" ;

            
foreach  ( string  key  in  typeDict.Keys)
            {
                sql 
=   string .Format( " Insert into CircleType(TypeName, TypeValue) values('{0}', '{1}')  " , key, typeDict[key]);   
                command 
=  db.GetSqlStringCommand(sql);
                db.ExecuteNonQuery(command);
            }

            
foreach  ( string  key  in  circleDict.Keys)
            {
                CircleSubTypeInfo info 
=  circleDict[key];
                sql 
=   string .Format( " Insert into CircleSubType(SubTypeName, LinkUrl, TypeUrlValue) values('{0}', '{1}', '{2}')  " , info.Name, info.LinkUrl, info.TypeUrlValue);
                command 
=  db.GetSqlStringCommand(sql);
                db.ExecuteNonQuery(command);
            } 
            
#endregion

            
this .lblTips.Text  =   " 获取分类操作完成 " ;

        }   

复制代码


其中主要是采用了正则表达式来对获取的内容进行处理,然后整理出来相关的分类数据放到数据库中,以便获取圈子用户信息作准备。

有了圈子分类信息,我们第二步骤就是看如何获取圈子数据,然后才能通过圈子的唯一ID获取圈子的用户资料,这步也是必须的,获取圈子资料是比较复杂的,需要组装较多的参数获取资料,部分代码如下所示。

复制代码
             foreach  ( string  key  in  urlDict.Keys)
            {
                
string  keyNumberReg  =   " /mapCircleList/(?<d1>[1-9]\\d*)/(?<d2>[1-9]\\d*)*/(?<d3>[1-9]\\d*)/ " ;
                Regex re 
=   new  Regex(keyNumberReg, RegexOptions.IgnoreCase  |  RegexOptions.Singleline  |  RegexOptions.IgnorePatternWhitespace);
                
                LogTextHelper.WriteLine(
string .Format( " 正在处理类型:{0} " , urlDict[key]));
                cookie 
=   new  System.Net.CookieContainer();

                
string  urlKey  =  key;
                Match mc 
=  re.Match(urlKey);
                
string  d1  =  mc.Groups[ " d1 " ].Value;
                
string  d2  =  mc.Groups[ " d2 " ].Value;
                
string  d3  =  mc.Groups[ " d3 " ].Value;
                
int  pageSize  =   30 ;

                urlKey 
=  urlKey.Trim( ' / ' ); // 清除前后的/字符
                 string  url  =   " http://q.163.com/dwr/call/plaincall/CircleMainpageBean.getCircleByType2IdInMemberOrder.dwr " ;
                
// string refUrl = " http://q.163.com/mapCircleList/2/11/48/?fromCircleCircleMap ";
                 string  refUrl  =   string .Format( " http://q.163.com/{0}/?fromCircleCircleMap " , urlKey);
                
#region  内容正则表达式
                StringBuilder circleReg 
=   new  StringBuilder();
                circleReg.Append(
" s[0-9]\\d*.circleId=(?<circleId>[0-9]\\d*[^;]) " );
                circleReg.Append(
" .*?s[0-9]\\d*.circleType1Str=\"(?<circleType1Str>.*?)\" " );
                circleReg.Append(
" .*?s[0-9]\\d*.circleType2Str=\"(?<circleType2Str>.*?)\" " );
                circleReg.Append(
" .*?s[0-9]\\d*.createDateStr=\"(?<createDateStr>.*?)\" " );
                circleReg.Append(
" .*?s[0-9]\\d*.creatorId=(?<creatorId>[0-9]\\d*[^;]) " );
                circleReg.Append(
" .*?s[0-9]\\d*.creatorName=\"(?<creatorName>.*?)\" " );
                circleReg.Append(
" .*?s[0-9]\\d*.creatorSpaceUrl=\"(?<creatorSpaceUrl>.*?)\" " );
                circleReg.Append(
" .*?s[0-9]\\d*.description=\"(?<description>.*?)\" " );
                circleReg.Append(
" .*?s[0-9]\\d*.joinDeclaration=\"(?<joinDeclaration>.*?)\" " );
                circleReg.Append(
" .*?s[0-9]\\d*.linkImgUrl=\"(?<linkImgUrl>.*?)\" " );
                circleReg.Append(
" .*?s[0-9]\\d*.memberNum=(?<memberNum>[0-9]\\d*[^;]) " );
                circleReg.Append(
" .*?s[0-9]\\d*.name=\"(?<name>.*?)\" " );
                circleReg.Append(
" .*?s[0-9]\\d*.urlName=\"(?<urlName>.*?)\" " ); 
                circleReg.Append(  " .*?s[0-9]\\d*.visitNum=(?<visitNum>[0-9]\\d*[^;]) " );
复制代码


通过组装参数数据,然后获取页面数据,对页面数据进行分析即可,主要代码如下所示:

复制代码
                    if  (mc.Success)
                    {
                        
string  message  =   string .Format( " 正在处理类型{0}:{1}, 第{2}次数据, 共处理了{3} " , urlDict[key], url, i  +   1 , j);
                        CallCtrlWithThreadSafety.SetText(
this .lblTips, message,  this );
                        Application.DoEvents();
                        Thread.Sleep(
10 );

                        MatchCollection mcs 
=  re.Matches(content);
                        
foreach  (Match me  in  mcs)
                        {
                            
#region  MyRegion
                            j
++ ;

                            
int  memberNum  =   0 ;
                            
try
                            {
                                memberNum 
=  Convert.ToInt32(me.Groups[ " memberNum " ].Value);
                            }
                            
catch  { }
                            
if  (memberNum  <   50 )
                            {
                                flag 
=   false ;
                                
break
                            }
            
                            sql 
=   string .Format( @" insert into Circle(circleId,circleType1Str,circleType2Str,createDateStr,creatorId,
                            creatorName,creatorSpaceUrl,description,joinDeclaration,linkImgUrl,memberNum,name2,urlName,SubTypeName) 
                            values('{0}','{1}','{2}','{3}','{4}','{5}','{6}','{7}','{8}','{9}','{10}','{11}','{12}','{13}') 
" , me.Groups[ " circleId " ].Value,
                            UnicodeHelper.UnicodeToString(me.Groups[
" circleType1Str " ].Value.Replace( " ' " "" )), UnicodeHelper.UnicodeToString(me.Groups[ " circleType2Str " ].Value.Replace( " ' " "" )),
                            me.Groups[
" createDateStr " ].Value, me.Groups[ " creatorId " ].Value, UnicodeHelper.UnicodeToString(me.Groups[ " creatorName " ].Value),
                            me.Groups[
" creatorSpaceUrl " ].Value, UnicodeHelper.UnicodeToString(me.Groups[ " description " ].Value.Replace( " ' " "" )), UnicodeHelper.UnicodeToString(me.Groups[ " joinDeclaration " ].Value.Replace( " ' " "" )),
                            me.Groups[
" linkImgUrl " ].Value, me.Groups[ " memberNum " ].Value, UnicodeHelper.UnicodeToString(me.Groups[ " name " ].Value.Replace( " ' " "" )), me.Groups[ " urlName " ].Value, urlDict[key]);
                            command 
=  db.GetSqlStringCommand(sql);
                            
try
                            {
                                db.ExecuteNonQuery(command);
                            }
                            
catch  (Exception ex)
                            {
                                LogTextHelper.WriteLine(sql);
                                LogTextHelper.WriteLine(ex.ToString());
                            }

                            message 
=   string .Format( " 正在处理{0}:{1} 正在写入数据{2}次 " , urlDict[key], url, j);
                            CallCtrlWithThreadSafety.SetText(
this .lblTips, message,  this );
                            Application.DoEvents();
                            Thread.Sleep(
10 );
                            
                            
#endregion
                        }                            
                    }
                    
else
                    {
                        flag 
=   false ; // 没有匹配就停止  
                         break ;
                    }

                } 

复制代码

 

 构造获取圈子用户信息也是比较复杂的一个过程,需要组装更多的参数来获取相关的数据,部分主要实现代码如下所示:

复制代码
            httpHelper  =   new  HttpHelper();
            httpHelper.Encoding 
=  Encoding.Default;
            cookie 
=   new  CookieContainer();
            Regex re 
=   null ;
            Match mc 
=   null ;
            
int  pageSize  =   30 ;
            
string  url  =   " http://q.163.com/dwr/call/plaincall/CircleBean.getNewCircleUsers.dwr " ;

            
foreach  ( string  key  in  circlelDict.Keys)
            {
                
string  circleId  =  key;
                
string  urlName  =  circlelDict[key];
                 string  refUrl  =   string .Format( " http://q.163.com/{0}/members/ " , urlName);

                
#region  内容正则表达式
                StringBuilder circleReg 
=   new  StringBuilder();
                circleReg.Append(
" s[0-9]\\d*.ageStr=\"(?<ageStr>.*?)\" " );
                circleReg.Append(
" .*?s[0-9]\\d*.city=\"(?<city>.*?[^;])\" " );
                circleReg.Append(
" .*?s[0-9]\\d*.hometownCity=(\"(?<hometownCity>.*?[^;])\"|(?<hometownCity>null)) " );
                circleReg.Append(
" .*?s[0-9]\\d*.hometownProvince=(\"(?<hometownProvince>.*?)\"|(?<hometownProvince>null)) " );
                circleReg.Append(
" .*?s[0-9]\\d*.name=(\"(?<name>.*?)\"|(?<name>null)) " );
                circleReg.Append(
" .*?s[0-9]\\d*.nickname=(\"(?<nickname>.*?)\"|(?<nickname>null)) " );
                circleReg.Append(
" .*?s[0-9]\\d*.profileImage140=\"(?<profileImage140>.*?)\" " );
                circleReg.Append(
" .*?s[0-9]\\d*.profileImage60=\"(?<profileImage60>.*?)\" " );
                circleReg.Append(
" .*?s[0-9]\\d*.province=\"(?<province>.*?)\" " );
                circleReg.Append(
" .*?s[0-9]\\d*.qq=(\"(?<qq>.*?)\"|(?<qq>null)) " );
                circleReg.Append(
" .*?s[0-9]\\d*.realName=(\"(?<realName>.*?)\"|(?<realName>null)) " );
                circleReg.Append(
" .*?s[0-9]\\d*.spaceName=(\"(?<spaceName>.*?)\"|(?<spaceName>null)) " );
                circleReg.Append(
" .*?s[0-9]\\d*.userId=(?<userId>[0-9]\\d*[^;]) " );
                circleReg.Append(
" .*?s[0-9]\\d*.userName=\"(?<userName>.*?)\" " );
                
#endregion

                
bool  flag  =   true ;
                
int  i  =   0 ;
                
int  j  =   0 ;
                List
< CircleMemberInfo >  entityList  =   new  List < CircleMemberInfo > ();
                
while  (flag)
                {
                    
#region  构造提交参数
                    StringBuilder sb 
=   new  StringBuilder();
                    sb.AppendFormat(
" callCount=1 " );
                    sb.AppendFormat(
" &page=/{0}/members/ " , urlName);
                    sb.AppendFormat(
" &httpSessionId= " );
                    sb.AppendFormat(
" &scriptSessionId=D4DAC4AD9C3BF9B71C82802BDDBA0C25369 " );
                    sb.AppendFormat(
" &c0-scriptName=CircleBean " );
                    sb.AppendFormat(
" &c0-methodName=getNewCircleUsers " );
                    sb.AppendFormat(
" &c0-id=0 " ); // 保留字符
                    sb.AppendFormat( " &c0-param0=number:{0} " , circleId); // 11
                    sb.AppendFormat( " &c0-param1=number:{0} " , pageSize); // 数量
                    sb.AppendFormat( " &c0-param2=number:{0} " , pageSize  *  i); // 0,30,60
                    sb.AppendFormat( " &c0-param3=boolean:true " );
                    sb.AppendFormat(
" &batchId={0} " , i);

                    i++;  

                    #endregion 

复制代码

然后我们通过代码来获取页面数据了,实现代码如下:

复制代码
string  content  =   "" ;
                    
try
                    {
                        httpHelper.ContentType 
=   " text/plain " ;
                        content 
=  httpHelper.GetHtml(url, cookie, sb.ToString(),  true , refUrl);
                        re 
=   new  Regex(circleReg.ToString(), RegexOptions.IgnoreCase  |  RegexOptions.Singleline  |  RegexOptions.IgnorePatternWhitespace);
                        mc 
=  re.Match(content);
                    }
                    
catch  (Exception ex)
                    {
                        LogTextHelper.WriteLine(ex.ToString());
                        
break ;

                    }  

复制代码


然后我们就开始用正则表达式来分析返回的数据,以便显示或者添加到数据库中,以供他用,代码实现如下所示:、

复制代码
                        MatchCollection mcs  =  re.Matches(content);
                        
foreach  (Match me  in  mcs)
                        {
                            
#region  MyRegion
                            j
++ ;                           

                            sql 
=   string .Format( @" insert into CircleMember(userId,userName,realName,nickname,circleId) 
                            values('{0}','{1}','{2}','{3}','{4}') 
" , me.Groups[ " userId " ].Value, httpHelper.RemoveHtml(UnicodeHelper.UnicodeToString(me.Groups[ " userName " ].Value)),
                             httpHelper.RemoveHtml(UnicodeHelper.UnicodeToString(me.Groups[
" realName " ].Value.Replace( " ' " "" ))), httpHelper.RemoveHtml(UnicodeHelper.UnicodeToString(me.Groups[ " nickname " ].Value.Replace( " ' " "" ))), circleId);
                            command 
=  db.GetSqlStringCommand(sql);
                            
try
                            {
                                db.ExecuteNonQuery(command);
                            }
                            
catch  (Exception ex)
                            {
                                LogTextHelper.WriteLine(sql);
                                LogTextHelper.WriteLine(ex.ToString());
                            }

                            message 
=   string .Format( " 正在处理{0} 正在写入数据{1}次 " , urlName, j);
                            CallCtrlWithThreadSafety.SetText(
this .lblTips, message,  this );
                            Application.DoEvents();
                            Thread.Sleep(
10 ); 
                            
#endregion

                         } 

复制代码


以上就是获取数据的一个完整的过程,其中涉及到获取圈子大类、圈子分类、圈子信息,以及最终获取圈子的用户信息,其中较为详细的介绍了各种数据的正则分析过程,如果您有这方面的应用,一个可以参考下面的代码,其中就涉及到了网易博客圈子的邮件采集,如下所示。

 


该软件除了可以采集网易博客圈子用户的邮件信息,给营销推广人提供资料外,还可以利用网易博客的找朋友模块,获取相关的用户数据信息,当然也是可以转换为邮件地址信息的了,如下所示。

 

 

以上对网易博客的应用的代码实现以及一个较为综合的软件产品介绍,希望能带给大家更多的启示和知识了解。 


  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值