爬虫算法

最新推荐文章于 2022-04-13 15:44:36 发布
landerluu
最新推荐文章于 2022-04-13 15:44:36 发布
阅读量1.2w
点赞数
分类专栏：数据结构文章标签：算法 url function 网络爬虫服务器 javascript
数据结构专栏收录该内容
1 篇文章 0 订阅
订阅专栏
多次被人问到这个问题，看来需要做一点说明，这里简单介绍一下搜索引擎的机器爬虫的制作和一些基本要注意的事项。





说的简单易懂一些，网络爬虫跟你使用的〖离线阅读〗工具差不多。说离线，其实还是要跟网络联结，否则怎么抓东西下来？





那么不同的地方在哪里？





1】 网络爬虫高度可配置性。


2】 网络爬虫可以解析抓到的网页里的链接


3】 网络爬虫有简单的存储配置


4】 网络爬虫拥有智能的根据网页更新分析功能


5】 网络爬虫的效率相当的高





那么依据特征，其实也就是要求了，如何设计爬虫呢？要注意哪些步骤呢？





1】 url 的遍历和纪录


这点 larbin 做得非常的好，其实对于url的遍历是很简单的，例如：


cat [what you got]| tr /" //n | gawk '{print $2}' | pcregrep ^http://

 


就可以得到一个所由的 url 列表





2】多进程 VS 多线程


各有优点了，现在一台普通的PC 例如 booso.com 一天可以轻松爬下5个G的数据。大约20万网页。





3】时间更新控制


最傻的做法是没有时间更新权重，一通的爬，回头再一通的爬。


通常在下一次爬的的数据要跟上一次进行比较，如果连续5次都没有变化，那么将爬这个网页的时间间隔扩大1倍。





如果一个网页在连续5次爬取的时候都有更新，那么将设置的爬取时间缩短为原来的1／2。





注意，效率是取胜的关键之一。





4】爬的深度是多少呢？


看情况了。如果你比较牛，有几万台服务器做网络爬虫，我劝您跳过这一点。


如果你同我一样只有一台服务器做网络爬虫，那么这样一个统计您应该知道：





网页深度：网页个数：网页重要程度


0 : 1 : : 10


1 :20 : :8


2: :600: :5


3: :2000: :2


4 above: 6000: 一般无法计算





好了，爬到三级就差不多了，再深入一是数据量扩大了3／4倍，二是重要度确下降了许多，这叫做“种下的是龙种，收获的是跳蚤。”





5】爬虫一般不之间爬对方的网页，一般是通过一个Proxy出去，这个proxy有缓解压力的功能，因为当对方的网页没有更新的时候，只要拿到 header 的 tag就可以了，没有必要全部传输一次了，可以大大节约网络带宽。





apache webserver里面纪录的 304 一般就是被cache的了。





6】请有空的时候照看一下robots.txt





7】存储结构。


这个人人见智，google 用 gfs 系统，如果你有7／8台服务器，我劝你用NFS系统，要是你有70／80个服务器的话我建议你用afs 系统，要是你只有一台服务器，那么随便。





给一个代码片断，是我写的新闻搜索引擎是如何进行数据存储的：





NAME=`echo $URL |perl -p -e 's/([^/w/-/./@])/$1 eq "/n" ? "/n":sprintf("%%%2.2x",ord($1))/eg'`


mkdir -p $AUTHOR


newscrawl.pl $URL --user-agent="news.booso.com+(+http://booso.com

)" -outfile=$AUTHOR/$NAME


<%

' BSD 2.0 license,

' http://www.opensource.org/licenses/bsd-license.php


'

'转贴或修改请保留bug提供人和bug修复人的信息,包括邮箱和网站名称,

'如bug提供人和bug修复人另有要求的除外

'可以在 http://www.vtalkback.com/site-map
 对代码进行测试

'版本 0.1.2

'------------------------------变量初始化-----------------------------------------------------------

ver="0.1.2"

'script configuration

'debug =0

'Response.CharSet="gb2312";

'current_charset="utf-8"

current_charset="gb2312"    '必须使用小写

'Url="http://www.vtalkback.com
"

'Url="http://www.jwmodel.com
"

Url=request("url")

Url=trim(url)

if right(Url,1)="/" then

  Url=left(url,len(url)-1)

end if



first_page=Url

'response.write first_page& " "& url&" "

'response.flush



'first_page=""

none_http_url=right(url,len(url)-len("http://
"))     '生成无 http://的url




root_url_len=instr(none_http_url,"/")



if(root_url_len=0) then

  root_url_len=len(none_http_url)

end if



root_url="http://
" & left(none_http_url,root_url_len)  '去掉尾部的 '/'

if right(root_url,1)="/" then

  root_url=left(root_url,len(root_url)-1)

end if



'response.write root_url & " <br>"

'response.flush



str_depth = request("url_depth")

FinalDepth=CInt(str_depth)

'---------------Depth limit----------------------

'if FinalDepth>2 then 

'  FinalDepth=2

'end if

'FinalDepth=1



'response.write "str_depth =" & str_depth &  " <br>"

'response.flush



'------------------------------------------------

LimitUrl=1000

'leave sitemapDate empty if you want sitemapDate=now

sitemapDate=""

'sitemapPriority possible value are from 0.1 to 1.0

sitemapPriority="0.7"

'sitemapChangefreq possible value are: always, hourly, daily, weekly, monthly, yearly, never

sitemapChangefreq="monthly"

'see http://www.time.gov/
 for utcOffset

utcOffset=1



Dim objRegExp,objUrlArchive,strHTML,objMatch,crawledUrlArchive,BytesStream,CharsetRegExp,CharsetUrlArchive,oHttp

Set oHttp=Server.CreateObject("WinHttp.WinHttpRequest.5.1")

Server.ScriptTimeout=300

set xmlhttp = CreateObject("MSXML2.ServerXMLHTTP")

Set BytesStream   =   Server.CreateObject("ADODB.Stream")   



Set objUrlArchive=Server.CreateObject("Scripting.Dictionary")

Set crawledUrlArchive=Server.CreateObject("Scripting.Dictionary")

Set CharsetUrlArchive=Server.CreateObject("Scripting.Dictionary")

Set objRegExp = New RegExp

objRegExp.IgnoreCase = True

objRegExp.Global = True



'you can change this patterns for your needs

'objRegExp.Pattern = "href=(.*?)[/s|>]"

'objRegExp.Pattern

 = "<!--(.*?)-->|<(/s*)a(/s*)href=(.*?)[" & chr(34) & 

"](.*?)" & "[" & chr(34)&"]"

objRegExp.Pattern = "<!--(.*?)-->|<(/s*)a(/s*)href=(.*?)[/s|>]"

Set CharsetRegExp = New RegExp

CharsetRegExp.IgnoreCase = True

CharsetRegExp.Global = True

CharsetRegExp.Pattern = "<META(.*?)Content-Type(.*?)>"

'to remove elements from html urls

RemoveText=array("<",">","a href=",chr(34),"'","href=")

'to exclude elements from urls

ExcludeUrl=array("mailto:","javascript:",".css",".ico","file:")



'if you want sitemapDate=now

if sitemapDate="" then filelmdate=now()



sitemapDate=iso8601date(filelmdate,utcOffset)



'------------------------------</变量初始化>-----------------------------------------------------------

crawl first_page,1



For Depth=0 to FinalDepth-1



  arrUrl=objUrlArchive.Keys

  arrDepth=objUrlArchive.Items

  For LoopUrl= 0 to ubound(arrurl)

    willCrawlUrl=url&"/"&arrUrl(LoopUrl)

    willCrawldepth=arrDepth(LoopUrl)



'    response.write "willCraw="& willCrawlUrl &" depth="&willCrawldepth&" <br>"

'    response.flush

    

    if crawledUrlArchive.Exists(willCrawlUrl)=false and willCrawldepth < FinalDepth then 

'      response.write "Craw="& willCrawlUrl &" depth="&willCrawldepth&" <br>"

'      response.flush

      crawledUrlArchive.add willCrawlUrl,1

      'if ubound(arrurl)>max_url_count then

      '  Exit For

      'end if

      'debugging

      'response.write "<!-- pagefound='"&loopurl&"'-->"

      crawl willCrawlUrl,willCrawldepth+1      

      if objUrlArchive.Count-1>LimitUrl then exit for   'if you want to limit the url number

    end if



  Next

  erase arrUrl

  erase arrDepth

Next



' create the xml on the fly

'arrDepth=objUrlArchive.Items

'response.write

  "<textarea rows=" &chr(34)& "93" &chr(34)& "name=" 

&chr(34)& "S1" &chr(34)& "cols=" &chr(34)& "138"

 &chr(34)& ">"



writeHead  '输出文件头



arrCharset=CharsetUrlArchive.items

arrurl=objUrlArchive.Keys

For LoopUrl=0 to ubound(arrurl)

'  response.write "<loc>"&server.htmlencode(url&"/"&arrUrl(LoopUrl))&"</loc>"     '输出url

'  response.write

 

"<loc>&&&&"&server.urlEncode(url&"/"&arrUrl(LoopUrl))&"</loc>"

   '输出url

  cur_charset=arrCharset(LoopUrl)

  cur_url=arrUrl(LoopUrl)

  writelink cur_url,cur_charset

Next



response.write  Chr(13) & Chr(10)

response.write "</urlset>"

response.write  Chr(13) & Chr(10)



'response.write "</textarea>"

'arrUrl=objUrlArchive.Keys



'response.write "<!-- pagefound='"&ubound(arrurl)+2&"'--> " 

'---------------------------<清除环境变量>-----------------------------------------------

erase arrUrl

erase arrCharset  



'erase arrDepth

objUrlArchive.RemoveAll()

crawledUrlArchive.RemoveAll()

CharsetUrlArchive.RemoveAll()

Set BytesStream   =   Nothing   

set xmlhttp = nothing

set oHttp = nothing

'---------------------------</清除环境变量>-----------------------------------------------

'*************************************************************************************************************

function writeHead()

  response.ContentType = "text/xml; charset=gb2312"

  response.write "<?xml version='1.0' encoding='gb2312'?>"

  response.write Chr(13) & Chr(10)

  response.write "<!-- generator='http://www.vtalkback.com/sitemap/&#39;
 ver='" &ver &"'-->"  

  response.write  Chr(13) & Chr(10) 

  response.write "<!-- pagefound='"&ubound(objUrlArchive.Keys)+2&"'--> "  

  response.write  Chr(13) & Chr(10)

  response.write "<urlset xmlns='http://www.google.com/schemas/sitemap/0.84&#39;
>"

  response.write  Chr(13) & Chr(10)



  response.write "<url>"

  response.write  Chr(13) & Chr(10)

  response.write "<loc>"&url&"/</loc>"

  response.write  Chr(13) & Chr(10)

  response.write "<lastmod>"&sitemapDate&"</lastmod>"

  response.write  Chr(13) & Chr(10)

  response.write "<priority>"&sitemapPriority&"</priority>"

  response.write  Chr(13) & Chr(10)

  response.write "<changefreq>"&sitemapChangefreq&"</changefreq>"

  response.write  Chr(13) & Chr(10)

  response.write "</url>"

  response.write  Chr(13) & Chr(10)

end function

'*************************************************************************************************************

Function writeLink(write_url,write_charset)

'  response.write "<loc>"&write_charset&"</loc>"         



  response.write "<url>"

  response.write  Chr(13) & Chr(10)

  

  if write_charset="gb2312" then      

'    write_url=num2gb(write_url)  

    write_url=urldecode(write_url)

'    response.write write_charset

  else  'utf8

'    response.write write_charset&"aa"

    write_url=urldecode(write_url)  'new_url=url2utf(cur_url) new_url=num2gb(cur_url) new_url=gb2utf(new_url)  

  end if

'  write_url=replace(write_url,"&amp;",GB2UTF8("&"))

    



  response.write "<loc>"&url&"/"&write_url&"</loc>"         '输出url

  response.write  Chr(13) & Chr(10)

'-------------------------------------------------------------------------------------

  response.write "<lastmod>"&sitemapDate&"</lastmod>"

  response.write  Chr(13) & Chr(10)

  response.write "<priority>"&sitemapPriority&"</priority>"

  response.write  Chr(13) & Chr(10)

  response.write "<changefreq>"&sitemapChangefreq&"</changefreq>"

  response.write  Chr(13) & Chr(10)

  response.write "</url>"

  response.write  Chr(13) & Chr(10)

end function



'***********************************爬行**************************************************************************

Function crawl(sub_url,crawl_depth) 

  sub_url=urldecode(sub_url)

'----------------<sub_url处理>-------------------------------------------

'--------------不能处理重定向bug---由(http://www.xiaoyezi.com
 )报告---------------------------------

'----------------(http://www.vtalkback.com
 , calldefine@sohu.com) 修补----------------------------------------

'  sub_url1=sub_url

'  response.write "sub_url="& sub_url&"<br>  "

  sub_url=GetAbsoluteURL(sub_url,1)  '读取重定向后的 url

'  if sub_url1<>sub_url then

'    response.write  sub_url1&" "&sub_url

'    response.flush  

'  end if 

'------------</sub_url处理>-------------------------------------------



'------------<sub_dir处理>-------------------------------------------

'  response.write "sub_url="& sub_url&"<br>  "&len(sub_url)&" "& len(url) &"</br>"

'  response.flush  

  sub_dir=right(sub_url,len(sub_url) - len(url))  ' http://www.vtalkback.com/blog
 -> blog

'  response.write "sub_dir="& sub_dir&" <br>"



  if instr(sub_dir,"?")>0 then

    sub_dir=left(sub_dir,InStr(sub_dir,"?")-1)

  end if

  

  if instr(sub_dir,".")>0 or instr(sub_dir,"?")>0 or instr(sub_dir,"=")>0 or instr(sub_dir,"#")>0 then

    sub_dir=left(sub_dir,InStrRev(sub_dir,"/"))  

  end if

  

  if sub_dir<>"" and right(sub_dir ,1)<>"/" then

    sub_dir=sub_dir&"/"

  end if 

  

'  response.write "sub_url="& sub_url&" <br>"

'  response.flush

'------------</sub_dir处理>----------------------------------------------------

'  response.write "sub_url="& sub_url&" <br>"

'  response.flush

  xmlhttp.open "GET", sub_url, false

  xmlhttp.send ""



'------------------------------------<取网页charset值>------------------------------------------------

  if   XmlHttp.readystate <> 4 then 

    exit function

  end if



  htmlText   =   xmlhttp.responseText  

  if htmlText="" then

    exit function

  end if

  

  For Each CharsetMatch in CharsetRegExp.Execute(htmlText)

    CharsetMatch=lcase(CharsetMatch)

    char_index=instr(CharsetMatch,"charset=")

    if char_index>0 then

      CharsetMatch=right(CharsetMatch,len(CharsetMatch)-char_index-7)

'      CharsetMatch=trim(CharsetMatch)

      char_index=instr(CharsetMatch,chr(34))

      CharsetMatch=left(CharsetMatch,char_index-1) '去掉双引号

      current_charset=trim(CharsetMatch)

'      current_charset=CharsetMatch

  

    end if

  next

  

'  response.write "--------------" & current_charset &"--------------------- <br>"

'  response.flush  

'------------------------------------</取网页charset值>------------------------------------------------

  



'-------------------------------<编码读取>------------------------------------------------------------------------------

'-------------------不能处理gb2312的bug---由(membership1@163.com)(http://www.sijiholiday.com
 )报告-------------------

'-------------------------(http://www.vtalkback.com
)(calldefine@sohu.com) 修补------------------------------------------



'  strHTML=bytes2BSTR(xmlHttp.responseBody)   



  BytesStream.Type = 1

  BytesStream.Mode =3

  BytesStream.Open

  BytesStream.Write xmlHttp.responseBody

  BytesStream.Position = 0

  BytesStream.Type = 2

  BytesStream.Charset = current_charset 

'    strHTML = xmlhttp.responseText

     strHTML=BytesStream.ReadText   

    BytesStream.close   



'    response.binarywrite   htmlbody 

'  response.write(strHtml)

'  response.flush

'----------------------------------</编码读取>-----------------------------------------------------------------------------



  For Each objMatch in objRegExp.Execute(strHTML)

'    response.write objMatch & "<br>"

'    response.flush



    if left(objMatch,4)<>"<!--" then



    for i=0 to ubound(excludeUrl)

      if instr(objmatch,excludeUrl(i))>0 then objmatch=""

    next



    if objmatch<>"" then

'      response.write "objmatch1="& objMatch& "    <br>"

    

'      response.write "obj match is   "&right(objMatch,len(objMatch)-1)&"<br>"

'      response.flush

'------------------------<url整理>---------------------------------------------------------------

'---------------不能处理gb2312的bug---由(membership1@163.com)(http://www.sijiholiday.com
 )报告--------------------

'-------------------------(http://www.vtalkback.com
)(calldefine@sohu.com) 修补------------------------------------------

'      objMatch=server.htmlencode(objMatch)

'      response.write objMatch & "<br>"

'      response.flush



'      for i=0 to ubound(RemoveText)     '清除无效字符 chr(34),"'"

'        objMatch=replace(lcase(objMatch),lcase(RemoveText(i))," ")

'      next



      objMatch=lcase(objMatch)

      objMatch=replace(objMatch,chr(34)," ") '去掉url中的符号

      objMatch=replace(objMatch,"'"," ")

      objMatch=replace(objMatch,">"," ")

'      response.write "objmatch2="& objMatch& "    <br>"



'      str_index=instr(objMatch,chr(34))   '去掉引号和引号左边的内容 

'      objMatch=right(objMatch,len(objMatch)-str_index)



      str_index=instr(objMatch,"=")     '去掉第一个等号和等号左边的内容 

      objMatch=right(objMatch,len(objMatch)-str_index)



      objMatch=ltrim(objMatch)       '取出有效字符



'      response.write objMatch & "<br>"

'      response.flush



      str_index=instr(objMatch," ")     '去掉空格和空格右边的内容



'      response.write objMatch &"  "&str_index &"<br>"

'      response.flush

      if str_index <> 0 then      

        objMatch=left(objMatch,str_index-1)

      end if

'      response.write objMatch & str_index &"<br>"

'      response.flush



'      str_index=instr(objMatch,chr(34))   '去掉引号和引号右边的内容

'      objMatch=left(objMatch,str_index-1)

'------------------------</url整理>---------------------------------------------------------------



'------------------------<root

反斜线处

理>---------------------------------------------------------------------------------  '--------------------

不能处理/url格式的bug---由(cndanxian@gmail.com)(http://www.gamelee.cn
 )报告-----------------------

'-------------------------(http://www.vtalkback.com
)(calldefine@sohu.com) 修补------------------------------------------

  



      if left(objMatch,1)="/" then           '/blog --> http://www.vtalkback.com/blog
 

        objMatch=root_url & objMatch

      end if



'      response.write objMatch & "<br>"

'      response.flush

'------------------------</root反斜线处理>-------------------------------------------------------------------------



'--------------------------------<去掉root url>----------------------------------------------------------

'      response.write "url2="& url& "    <br>"

'      response.flush



      'in some cases this is better if left(objMatch,len(url))=Url then



      if left(objMatch,len(url))=Url then

        the_url=right(objMatch,len(objMatch) - len(url))

        if the_url<>"" and left(the_url,1)="/" then 

          the_url=right(the_url,len(the_url) - 1) '去掉左边 "/"

        end if

        objMatch = the_url

        

      elseif left(objMatch,len(none_http_url))=none_http_url then

        the_url=right(objMatch,len(objMatch) - len(none_http_url))

        if the_url<>"" and left(the_url,1)="/" then 

          the_url=right(the_url,len(the_url) - 1) '去掉左边 "/"

        end if

        objMatch = the_url



      elseif instr(objMatch,"http://
")=0 and objmatch<>"" then

        the_url=sub_dir&objMatch

        if the_url<>"" and left(the_url,1)="/" then 

          the_url=right(the_url,len(the_url) - 1) '去掉左边 "/"

        end if

        objMatch = the_url

'        response.write "subdir="& sub_dir& "    <br>"

'        response.flush

        

        

      else  '(out of domain)

        objMatch=""

      end if

'--------------------------------</去掉root url>------------------------------------------------------

           end if  



    if objmatch<>"" then



'------------------------<&符号转换>--------------------------------------------------------------------------------



      objMatch=replace(objMatch,"&","&amp;")      '& to &amp

      objMatch=replace(objMatch,"&amp;#","&#")

      objMatch=replace(objMatch,"&amp;amp;","&amp;")



      if right(objMatch,1)="/" then       '右边的去掉 "/"

        objMatch=left(objMatch,len(objMatch)-1)

      end if



'      response.write objMatch & "<br>"

'      response.flush

'------------------------</&符号转换>---------------------------------------------------------------------------



'--------------------------------<编码处理>------------------------------------------------      

'注 如果原始页面有%表示的url编码,此时%会被转换成为%25



        if current_charset="gb2312"  then

'          objMatch=gb2num(objMatch)



'        response.write current_charset&" "& sub_url & "<br>"

'        response.write objMatch & "<br>"

        objMatch= server.urlEncode(objMatch)          

'        response.write objMatch &" " & "<br>"

'        response.flush

      else

'        response.write "url2 " & current_charset&" "&sub_url & "<br>"

'        response.write objMatch &" " & "<br>"

'        objMatch= urlDecode(objMatch)

        objMatch= server.urlEncode(objMatch)

'        response.write objMatch &" " & "<br>"

'        response.flush



'        objMatch=server.htmlencode(objMatch)  

'        objMatch=encodeURI(objMatch)

'        objMatch=UTF2GB(objMatch)

'          objMatch=gb2num(objMatch)

      end if



'--------------------------------</编码处理>------------------------------------------------      

      

'      if objMatch<>newMatch then

'        response.write objMatch & "<br>"

'        response.write newMatch & "<br>"

'        response.flush

'      end if



      if objUrlArchive.Exists(objMatch)= false and the_url<>"" then

        objUrlArchive.Add objMatch,crawl_depth

        CharsetUrlArchive.Add objMatch,current_charset

'        response.write objMatch &"  "&sub_url&  "<br>"  '显示url所在页面----------

'        response.flush

        

'        writelink ObjMatch,current_charset        

      end if

    

    end if

    end if  

  Next

End Function



'*************************************************************************************************************



function gb2num(str)

  newStr=""

  for i=1 to len(str)   'gb2312处理

    c=mid(str,i,1)

    if asc(c)<0 then

      gb2312Code=ascW(c)

      if gb2312Code <0 then 

        gb2312Code =gb2312Code+65536

      end if

      newStr=newStr & "&#" & gb2312Code & ";"

    else

      newStr=newStr&c

    end if

  next 

  gb2num=newStr

end function

'*************************************************************************************************************

function url2utf(str)

  url2utf=decodeURI(str)

end function



'*************************************************************************************************************



Function URLDecode(enStr) '注 如果原始页面有%表示的url编码,到此时%会成为%25,decode后还原为%

  dim deStr 

  dim c,i,v 

  deStr="" 

  

  for i=1 to len(enStr) 

    c=Mid(enStr,i,1) 

    if c="%" then 

      v=eval("&h"+Mid(enStr,i+1,2))   'eval 计算一个表达式的值

      if v<128 then 

        deStr=deStr&chr(v) 

        i=i+2 

      else 

        

        if isvalidhex(mid(enstr,i,3)) then  '双字节url符号

          if isvalidhex(mid(enstr,i+3,3)) then 

            v=eval("&h"+Mid(enStr,i+1,2)+Mid(enStr,i+4,2)) '+65536

            

            deStr=deStr& chr(v)



            i=i+5 

          else       '单个url符号

            v=eval("&h"+Mid(enStr,i+1,2)+cstr(hex(asc(Mid(enStr,i+3,1))))) 

            deStr=deStr&chr(v) 

            i=i+3 

          end if 

        else 

'          destr=destr&c 

        end if 

      end if 

    else 

      if c="+" then 

        deStr=deStr&" " 

      else 

        deStr=deStr&c 

      end if 

    end if 

  next 

'  response.write  Chr(13) & Chr(10)

'  response.write "enstr="&enStr  

'  response.write  Chr(13) & Chr(10)

'  response.write "destr="&deStr  

'  response.write  Chr(13) & Chr(10)

'  response.flush

  

  URLDecode=deStr 

end function 

'*************************************************************************************************************

function isvalidhex(str) 

  isvalidhex=true 

  str=ucase(str) 

  if len(str)<>3 then isvalidhex=false:exit function 

  if left(str,1)<>"%" then isvalidhex=false:exit function 

    c=mid(str,2,1) 

  if not (((c>="0") and (c<="9")) or ((c>="A") and (c<="Z"))) then isvalidhex=false:exit function 

    c=mid(str,3,1) 

  if not (((c>="0") and (c<="9")) or ((c>="A") and (c<="Z"))) then isvalidhex=false:exit function 

end function 



'*************************************************************************************************************



function num2gb(str)

  newStr=""

  for i=1 to len(str)

    c=mid(str,i,1)

    if c="&" and mid(str,i+1,1)="#" then

      num=""

      for j=i+2 to len(str)

        ch=mid(str,j,1)

        if ch=";" then

          i=j

          exit for

        end if

        num=num &ch

      next

      newStr=newStr & chrW(CLng(num)) 'GB2UTF8(chrW(CLng(num)))        

    else

      newStr=newStr & c

    end if

  next

  num2gb=newStr

end function

'*************************************************************************************************************

'Function GB2UTF(Chinese) 

'  For i = 1 to Len (Chinese) 

'   a = Mid(Chinese, i, 1) 

'   GB2UTF = GB2UTF & "&#x" & Hex(Ascw(a)) & ";" 

'  Next 

'End Function 

'*************************************************************************************************************

Function GB2UTF(Chinese) 

  For i = 1 to Len (Chinese) 

   a = Mid(Chinese, i, 1) 

   GB2UTF = GB2UTF &  Ascw(a)

  Next 

End Function 

'*************************************************************************************************************



function UTF2GB(UTFStr) 

    for Dig=1 to len(UTFStr) 

        if mid(UTFStr,Dig,1)="%" then 

            if len(UTFStr) >= Dig+8 then 

                GBStr=GBStr & ConvChinese(mid(UTFStr,Dig,9)) 

                Dig=Dig+8 

            else 

                GBStr=GBStr & mid(UTFStr,Dig,1) 

            end if 

        else 

            GBStr=GBStr & mid(UTFStr,Dig,1) 

        end if 

    next 

    UTF2GB=GBStr 

end function 

'*************************************************************************************************************



Function iso8601date(dLocal,utcOffset)  

  Dim d

  ' convert local time into UTC

  d = DateAdd("H",-1 * utcOffset,dLocal)



  ' compose the date

  iso8601date = Year(d) & "-" & Right("0" & Month(d),2) & "-" & Right("0" & Day(d),2)

'

 & "T" & _Right("0" & Hour(d),2) & ":" & Right("0" 

& Minute(d),2) & ":" & Right("0" & Second(d),2) & 

"Z"

End Function



'*************************************************************************************************************

Function GetAbsoluteURL(sUrl,iStep) '读取重定向后的 url

  Dim bUrl,bDat

  GetAbsoluteURL=sUrl

  If iStep>15 Then

          Response.Write "递归嵌套超过15层" & "<br />"

    exit function

  End If



  If InStr(sUrl,"?")>0 THen

    Dim tmpUrl : tmpUrl=split(sUrl,"?")

    bUrl=tmpUrl(0)

    bDat=tmpUrl(1)

  Else

    bUrl=sUrl

    bDat=""

  End If



'  Response.Write "<p style=""border-top:solid 1px silver;padding:0px;margin:0px;"">"

'  Response.Write "正在准备获取 " & sUrl & "<br />"



'  if bDat<>"" Then Response.Write "3 &nbsp;&nbsp;>>参数： " & bDat & "<br />"



  oHttp.Option(6)=0   '禁止自动Redirect

  oHttp.SetTimeouts 5000,5000,30000,5000

'  Response.Write burl&"<br /> "

  oHttp.Open "HEAD",sUrl,False

  On Error Resume Next

  oHttp.Send bDat



'  response.write  oHttp.responseText 

'  response.flush

'  Response.Write "  <br /> "

  

  If Err.Number<>0 Then

'    Response.Write "<font color=""red"">发生错误：" & Err.Description & "</font><br />"

    Err.Clear

'    GetAbsoluteURL=""

'    Set oHttp=Nothing

'    Response.Write "</p>"

    Exit Function

  End If

'  Response.Write " <br /> "



  On Error Goto 0

'  Response.Write "&nbsp;&nbsp;>>HTTP 状态：" & oHttp.Status  & "<br />"



  If oHttp.Status<>200 And oHttp.Status<>302 and oHttp.Status<>301 Then

'    Response.Write "<font color=""red"">HTTP错误：" & oHttp.StatusText  & "</font><br />"

    Err.Clear

    GetAbsoluteURL=""

'    Set oHttp=Nothing

'    Response.Write "</p>"

    Exit Function

  End If

  Dim sLoca 

  On Error Resume Next

  sLoca=oHttp.getResponseHeader("Location")



  If Err.Number<>0 Then

    Err.Clear

    sLoca=""

  End If

  

  On Error Goto 0

'  Response.Write "  <br /> "



  If sLoca = "" Then

'    Response.Write

 "&nbsp;&nbsp;>>Content-Type:" & 

oHttp.getResponseHeader("Content-Type") & "<br />"

'    Response.Write "&nbsp;&nbsp;>>没有返回Location头<br />"

    GetAbsoluteURL=sUrl

'    Set oHttp=Nothing

'    Response.Write " </p>"

    GetAbsoluteURL=sUrl

    Exit Function

  Else



'    Response.Write

 " &nbsp;&nbsp;>>Content-Type:" & 

oHttp.getResponseHeader("Content-Type") & "<br />"

'    Response.Write " 收到Location头：" & sLoca & "<br />"

'    Response.Write " </p>"



    '这里要生成新的URL



    If InStr(sLoca,"://")<=0 Then

      '没有指定协议，按当前URL的位置重新设置

      Dim ind : ind=InstrRev(sUrl,"/")

      sUrl=Left(sUrl,ind)

      sLoca=sUrl & sLoca

    End If

    GetAbsoluteURL=GetAbsoluteURL(sLoca,iStep+1)



  End If

End Function



%>

<script language="javaScript" runat="Server">

function UTF8toGB(str){

  return decodeURIComponent(str)



}

function encodeURI(str){

  return encodeURIComponent(str)

}

function decodeURI(str){

  return decodeURIComponent(str)

}



function GB2UTF8(str){

  return encodeURIComponent(str)

}



function convert(str) { 

  return string(str.getBytes("UTF-8"),"gb2312"); 

} 



</script>