特殊网页爬虫——VBA开发文档

标签: VBA 网页爬虫 网抓 Access 办公自动化
12人阅读 评论(0) 收藏 举报
分类:

特殊网页爬虫——VBA开发文档 

作者:AntoniotheFuture

关键词:VBA,Access,网页爬虫,网抓

开发平台:Access

平台版本上限:2010

平台版本下限:尚未出现

开发语言:VBA

简介:目前在一家保险公司上班,统计数据需要经常从一个公司的网页系统中下载报表,操作比较简单,但是要操作的东西太多,比较烦人,对于日常数据的提取,我就想着如果可以定制任务就好了,正好我之前也有过一点点网页爬虫的经验,于是着手开始写,对于这个网页呢,有很多“不太友好”的地方,比如说:登陆进去之后会启动新窗口(可能是公司为了信息安全),点击的控件花样百出,不易定位,格式不尽相同,有些日期是MM-DD-YYYY,有些是YYYY-MM-DD HH:MM,而且网页内存在iframe,用传统的网页爬虫无法实现,在参阅了无数资料之后,今天终于开发出来了,现在分享给大家,共同进步。


功能描述:

  1. 可以预先按步骤设定任务,完全模拟网页中的手工操作,对各种控件进行操作,只需轻轻点击开始,网页即可自动填表,提交,等待报表在网页中产生即可。
  2. 可以重新获取IE的控制权,防止新窗口出现后丢失窗口。
  3. 可以根据预设参数获取时间和日期,如:下一个工作日的前一天的23:59:59。
  4. 加入“工作日表”,可在里面提前设定“补假”,“加班”等特殊工作日,准确判断下一工作日。

表设计:

    首先是报表列表,用于定位网页中报表页面


因为网页中有多处重复的HTMLname,而且无法用其他方法定位控件,特加入“开始搜索位置”用于控件的查找


然后是控件表,用于定位表单页面中的控件,还可以根据预先设定的控件类型做不同的动作。



任务表,用于记录任务基本信息,比较简单



任务流程表,在窗体中定制的流程将会记录到这个表中:


下面是窗体部分:

    控件详情和报表详情窗口,没什么特殊,可用于快速添加网页控件信息。





任务详情窗体,整合了任务创建,流程设置,登陆信息输入和执行功能

新建任务后,在增加流程按钮的左边输入要操作的控件和要输入的值类型和值的本身,完成整个任务定制后,保存即可执行,系统将会打开IE窗口并执行相应操作,省去不少时间,还能避免手动输入出错。



 工作日表,用于记录工作日和更改工作日



下面是部分SQL查询:

报表列表查询 

SELECT 报表列表.ID, 报表列表.报表名称
FROM 报表列表;

更改工作日类型子窗体
用于查找下一个工作日

SELECT 工作日表.*
FROM 工作日表
WHERE (((工作日表.工作日)>=DateAdd("d",-7,Date())))
ORDER BY 工作日表.工作日;

任务流程查询
用于在任务详情界面显示流程

SELECT 任务流程表.任务ID, 任务流程表.流程数, 任务流程表.打开报表, 任务流程表.表ID, [报表列表]![报表名称] AS表名, 任务流程表.控件ID, [element]![名称] AS 控件名, 任务流程表.控件值类型, 任务流程表.控件值
FROM (任务流程表 LEFTJOIN 报表列表 ON 任务流程表.表ID = 报表列表.ID) LEFT JOIN element ON 任务流程表.控件ID = element.ID

WHERE (((任务流程表.任务ID)=[Forms]![任务详情]![ID]));


任务流程转VBA

与VBA对接,包含了执行任务过程中所需的所有控件数据。

SELECT 任务流程表.ID, 任务流程表.任务ID, 任务流程表.流程数, 任务流程表.打开报表, 报表列表.报表名称, 报表列表.层级, 报表列表.一级, 报表列表.开始搜索位置, 报表列表.二级, 报表列表.是否使用二级网页位置, 报表列表.二级网页位置, 报表列表.三级, 报表列表.四级, element.名称, element.控件类型, element.值, element.数据类型, element.HTMLname,element.HTMLID, element.时间类型, 任务流程表.控件值类型, 任务流程表.控件值

FROM (任务流程表 LEFTJOIN 报表列表 ON 任务流程表.表ID = 报表列表.ID) LEFT JOIN element ON 任务流程表.控件ID = element.ID;


下面是VBA代码部分
  

更改工作日类型


'批量修改工作日
Private Sub Command20_Click()
Dim STemp2 As String 
Dim i
If IsNull(Me.Text0) Then
   MsgBox ("请输入开始日期!")
   Exit Sub
ElseIf IsNull(Me.Text4) Then
   MsgBox ("请输入结束日期!")
   Exit Sub
ElseIf IsNull(Me.List40) Then
   MsgBox ("请选择更改类型!")
Else  
Dim Rs2 As ADODB.Recordset
Set Rs2 = New ADODB.Recordset 
STemp2 = "select * From 工作日表 where 工作日 between #" & Me![Text0]& "# and #" & Me![Text4] & "#"
Rs2.Open STemp2, CurrentProject.Connection,adOpenKeyset, adLockOptimistic 
For i = 1 To Rs2.RecordCount
   Rs2("类型") = Me![List40]
   Rs2.Update
   Rs2.MoveNext
Next
Me.Refresh
MsgBox ("成功将"& i - 1 & "天更改为" & Me![List40])
Exit Sub
End If
Exit Sub
Rs2.Close
Set Rs2 = Nothing 
End Sub

自动登录并获取网页

用于对付窗口弹出问题


Private Sub Command268_Click()
'On Error Resume Next
'定义变量
Dim IE As Object
Dim webs, webs2, webs3, webs4, webs5, dmt,dmt1, dmt2, usrno, elements, element1, xxx
Dim vtag   '网页对象
Dim loop1, loop2, loop3   '循环计数器
Dim objIE As Object, myHWND
Dim dWinFolder As New ShellWindows, t
Dim Czpmxurl As String, Czpmxname As String
Dim Czpmxhwnd As Long, aa        '窗口句柄
Dim cifno$, cifcname$, ResultLink$ 
'text9 = 用户名 text11= 密码
'IE清除缓存&打开登录界面 
Call DeleteCacheURLList
Set IE =CreateObject("InternetExplorer.Application")
IE.Navigate"example.com"
IE.Visible = True     '若=0 False不显示 ,=1 True 显示
IE.Silent = True
Do While IE.Busy Or IE.ReadyState <>4
   DoEvents
Loop
delay Me.Combo17  
Set dmt = IE.Document
IE.Document.getElementById("j_username").Value= Me.Text9
IE.Document.getElementById("j_password").Value= Me.Text11
delay 2
IE.Document.getElementById("j_password").focus
SendKeys "{enter}"
Do While IE.Busy Or IE.ReadyState <>4
   DoEvents
Loop
delay Me.Combo17 + 3
   Czpmxhwnd = FindWindow(vbNullString, "来自网页的消息")      '根据窗口标题查找,找到后返回句柄
   If Czpmxhwnd <> 0 Then
       aa = SetForegroundWindow(Czpmxhwnd)   '将网页调到前台
       delay 1
       SendKeys "{ENTER}", True
   End If  
delay 1
Call Command271_Click
End Sub


任务执行

根据设定的任务,按流程对网页中控件进行操作


Private Sub Command271_Click()
'定义变量
Dim IE As Object
Dim webs, webs2, webs3, webs4, webs5, dmt,dmt1, dmt2, dmt3, dmt4, usrno, elements, element1, xxx, departmentNoHTML
Dim vtag, worktype  '网页对象
Dim loop1, loop2, loop3, loop4  '循环计数器  1=网页对象查找,2= ,3=工作日确定,4=流程进行
Dim objIE As Object, myHWND
Dim dWinFolder As New ShellWindows, t
Dim Czpmxurl As String, Czpmxname As String
Dim Czpmxhwnd As Long, aa        '窗口句柄
Dim cifno$, cifcname$, ResultLink$

Dim today0 '今天零点
Dim monthday10000  '当月零点
Dim nworkday '下一工作日
Dim nworkdaypday2359 '下一工作日前一天23点59分
Dim nworkday7  '下一工作日7点

Dim STemp3, STemp4 As String
Dim Rs3 As ADODB.Recordset
Dim Rs4 As ADODB.Recordset
Set Rs3 = New ADODB.Recordset
Set Rs4 = New ADODB.Recordset

workdaytype = "正常"
today0 = Format(Date & "00:00:00", "YYYY/MM/DD HH:MM:SS")
monthday10000 = Format(DateSerial(Year(Date),Month(Date), 1) & " 00:00:00", "YYYY/MM/DD HH:MM:SS")

STemp3 = "select * From 工作日表 where 类型 = " & "'" &workdaytype & "'" & "order by 工作日"
Rs3.Open STemp3, CurrentProject.Connection,adOpenKeyset, adLockOptimistic
For loop3 = 0 To Rs3.RecordCount
   If DateDiff("d", Date, Rs3("工作日"))> 0 Then
       nworkday = Rs3("工作日")
       Exit For
   ElseIf loop3 = Rs3.RecordCount Then
       MsgBox ("请更新工作日表!")
       Exit Sub
       Exit For
   Else
       Rs3.MoveNext
   End If
Next

nworkdaypday2359 =Format(DateAdd("d", -1, nworkday) & " 23:59:59","YYYY/MM/DD HH:MM:SS")
nworkday7 = Format(nworkday & "07:00:00", "YYYY/MM/DD HH:MM:SS")

Do
   For Each objIE In dWinFolder
           If InStr(1, objIE.LocationURL, "elis-lcs.paic") > 0 Then
                Czpmxname =objIE.LocationName            '标题
                Czpmxurl =objIE.LocationURL              '链接
                Exit Do   '通过链接objIE.LocationURL包含的关键字查询,或用objIE.LocationName即窗口标题包含的关键字来查询
           End If
   Next
       DoEvents
Loop

   Set IE = objIE  '转换ie窗口控制权
   Do Until IE.ReadyState = 4 And IE.Busy = False
       DoEvents
   Loop
   Set dmt = IE.Document
STemp4 = "select * From 任务流程转VBA where 任务ID = " & Me![任务ID] & " order by 流程数"
Rs4.Open STemp4, CurrentProject.Connection,adOpenKeyset, adLockOptimistic
For loop4 = 0 To Rs4.RecordCount - 1
   If Rs4("打开报表") = True Then
继续:
       Set elements = dmt.all.tags("a")
       Debug.Print IE.ReadyState
       For loop1 = 0 To elements.length - 1
           If elements.Item(loop1).innerText = Rs4("一级")Then
                elements.Item(loop1).Click
                Exit For
           End If
       Next
'特殊重名控件
           For loop1 = Rs4("开始搜索位置") Toelements.length - 1
                Ifelements.Item(loop1).innerText = Rs4("二级")Then
                   elements.Item(loop1).FireEvent ("onmouseover")
                    Exit For
                End If
           Next

       delay 0.5

       If Rs4("层级") = 3 Then
           Set elements = dmt.all.tags("a")
           Debug.Print IE.ReadyState
           For loop1 = 0 To elements.length - 1
           If elements.Item(loop1).innerText = Rs4("三级")Then
                elements.Item(loop1).Click
                Exit For
           End If
           Next
       ElseIf Rs4("层级") = 4 Then
           Set elements = dmt.all.tags("a")
           Debug.Print IE.ReadyState
           For loop1 = 0 To elements.length - 1
                Ifelements.Item(loop1).innerText = Rs4("三级")Then
                   elements.Item(loop1).FireEvent ("onmouseover")
                    Exit For
                End If
           Next

           For loop1 = 0 To elements.length - 1
            If elements.Item(loop1).innerText =Rs4("四级") Then
                elements.Item(loop1).Click
                Exit For
           End If
           Next
           delay 1
       Else
           MsgBox ("请在报表列表中添加报表层级!!!")
           Exit Sub
       End If
       delay 5

       GoTo 报表操作
   Else                                                                                   '打开报表——结束

网页表单填写操作:
       Set dmt1 = IE.Document.frames(1).Document  'getElementsByTagName("INPUT")(0)
       Set elements = dmt1.all.tags("INPUT")       'Or "SELECT"
       If Rs4("控件类型") = "文本框" Then
           For loop1 = 0 To elements.length - 1
           If IsNull(Rs4("HTMLname")) = False Then
                If elements.Item(loop1).Name =Rs4("HTMLname") Then

ID匹配:
                    Select Case Rs4("控件值类型")
                    Case "预先制定值"
                       elements.Item(loop1).Value = Rs4("控件值")
                    Case "当时"
                       elements.Item(loop1).Value = Format(Date & " " & Time(),Rs4("时间类型"))
                    Case "手动输入"
                       elements.Item(loop1).Value = InputBox("请输入"& Rs4("报表名称") & "中" & Rs4("名称") &"的值:(" & Rs4("时间类型") & ")", "请输入")
                    Case "当月0点"
                        elements.Item(loop1).Value= Format(monthday10000, Rs4("时间类型"))
                    Case "今天0点"
                       elements.Item(loop1).Value = Format(today0, Rs4("时间类型"))
                    Case "下一工作日前一天23点59分"
                        elements.Item(loop1).Value= Format(nworkdaypday2359, Rs4("时间类型"))
                    Case "下一工作日7点"
                       elements.Item(loop1).Value = Format(nworkday7, Rs4("时间类型"))
                    Case "本月份"
                       elements.Item(loop1).Value = Format(Date, Rs4("时间类型"))
                    End Select
                    Exit For
                End If
           Else
                If elements.Item(loop1).ID =Rs4("HTMLID") Then
                    GoTo ID匹配
                End If
           End If
            Next
       ElseIf Rs4("控件类型") = "复选框" Then
           For loop1 = 0 To elements.length - 1
                If elements.Item(loop1).Value =Rs4("值") Then
                    elements.Item(loop1).Click
                    Exit For
                End If
           Next
       ElseIf Rs4("控件类型") = "单选框" Then
           For loop1 = 0 To elements.length - 1
                If elements.Item(loop1).Name =Rs4("HTMLname") Then
                    elements.Item(loop1).Click
                   Exit For
                End If
           Next
       ElseIf Rs4("控件类型") = "按钮" Then
           For loop1 = 0 To elements.length - 1
                If elements.Item(loop1).Value =Rs4("值") Then
                   elements.Item(loop1).FireEvent ("onclick")
                    delay 2
                    Exit For
                End If
           Next
       ElseIf Rs4("控件类型") = "下拉框" Then
           Set elements = dmt1.all.tags("select")
           For loop1 = 0 To elements.length - 1
                If IsNull(Rs4("HTMLname"))= False Then
                    Ifelements.Item(loop1).Name = Rs4("HTMLname") Then
ID匹配2:
                       elements.Item(loop1).Value = Rs4("控件值")
                        Exit For
                    End If
                Else
                    If elements.Item(loop1).ID= Rs4("HTMLID") Then
                    GoTo ID匹配2
                    End If
                End If
           Next
       End If
       Rs4.MoveNext
   End If

下一步:
Next
Me.Refresh
Exit Sub
Rs3.Close
Rs4.Close
Set Rs3 = Nothing
Set Rs4 = Nothing
End Sub

任务控件添加

用于在任务详情界面中添加需要操作的控件。


Private Sub Command45_Click()
Dim STemp As String
Dim Rs As ADODB.Recordset
Set Rs = New ADODB.Recordset
STemp = "select * From 任务流程表 where 任务ID = " & Me![任务ID]
Rs.Open STemp, CurrentProject.Connection,adOpenKeyset, adLockOptimistic
Rs.AddNew
Rs("任务ID")= Me![任务ID]
Rs("流程数")= Rs.RecordCount + 1
Rs("表ID")= Me.Combo60
Rs("表名")= Me.Combo60.Column(1)
Rs("控件ID")= Me.Combo66
Rs("控件名")= Me.Combo66.Column(1)
Rs("控件值类型")= Me.Combo100
Rs("控件值")= Text76
Rs("打开报表")= Me.Check319
Rs.Update
Me.Refresh
Exit Sub
Rs.Close
Set Rs = Nothing
End Sub

寻找已打开IE

Declare Function FindWindow Lib"user32" Alias "FindWindowA" (ByVal lpClassName As String,ByVal lpWindowName As String) As Long
Declare Function SetForegroundWindow Lib"user32" (ByVal HWnd As Long) As Long

窗口寻找


Private Const ERROR_CACHE_FIND_FAIL As Long= 0
Private Const ERROR_CACHE_FIND_SUCCESS AsLong = 1
Private Const ERROR_FILE_NOT_FOUND As Long= 2
Private Const ERROR_ACCESS_DENIED As Long =5
Private Const ERROR_INSUFFICIENT_BUFFER AsLong = 122
Private Const MAX_PATH As Long = 260
Private Const MAX_CACHE_ENTRY_INFO_SIZE AsLong = 4096
Private Const LMEM_FIXED As Long = &H0
Private Const LMEM_ZEROINIT As Long =&H40
Private Const LPTR As Long = (LMEM_FIXED OrLMEM_ZEROINIT)
Private Const NORMAL_CACHE_ENTRY As Long =&H1
Private Const EDITED_CACHE_ENTRY As Long =&H8
Private Const TRACK_OFFLINE_CACHE_ENTRY AsLong = &H10
Private Const TRACK_ONLINE_CACHE_ENTRY AsLong = &H20
Private Const STICKY_CACHE_ENTRY As Long =&H40
Private Const SPARSE_CACHE_ENTRY As Long =&H10000
Private Const COOKIE_CACHE_ENTRY As Long =&H100000
Private Const URLHISTORY_CACHE_ENTRY AsLong = &H200000
Private Const URLCACHE_FIND_DEFAULT_FILTER AsLong = NORMAL_CACHE_ENTRY Or _
                                                   COOKIE_CACHE_ENTRY Or _
                                                   URLHISTORY_CACHE_ENTRY Or _
                                                   TRACK_OFFLINE_CACHE_ENTRY Or _
                                                   TRACK_ONLINE_CACHE_ENTRY Or _
                                                   STICKY_CACHE_ENTRY
Private Type FILETIME
  dwLowDateTime As Long
  dwHighDateTime As Long
End Type
Private Type INTERNET_CACHE_ENTRY_INFO
  dwStructSize As Long
  lpszSourceUrlName As Long
  lpszLocalFileName As Long
  CacheEntryType  As Long
  dwUseCount As Long
  dwHitRate As Long
  dwSizeLow As Long
  dwSizeHigh As Long
  LastModifiedTime As FILETIME
  ExpireTime As FILETIME
  LastAccessTime As FILETIME
  LastSyncTime As FILETIME
  lpHeaderInfo As Long
  dwHeaderInfoSize As Long
  lpszFileExtension As Long
  dwExemptDelta  As Long
End Type
Private Declare FunctionFindFirstUrlCacheEntry Lib "wininet" Alias"FindFirstUrlCacheEntryA" (ByVal lpszUrlSearchPattern As String,lpFirstCacheEntryInfo As Any, lpdwFirstCacheEntryInfoBufferSize As Long) AsLong
Private Declare FunctionFindNextUrlCacheEntry Lib "wininet" Alias "FindNextUrlCacheEntryA"(ByVal hEnumHandle As Long, lpNextCacheEntryInfo As Any,lpdwNextCacheEntryInfoBufferSize As Long) As Long
Private Declare Function FindCloseUrlCacheLib "wininet" (ByVal hEnumHandle As Long) As Long
Private Declare FunctionDeleteUrlCacheEntry Lib "wininet" Alias"DeleteUrlCacheEntryA" (ByVal lpszUrlName As String) As Long
Private Declare Sub CopyMemory Lib"kernel32" Alias "RtlMoveMemory" (pDest As Any, pSource AsAny, ByVal dwLength As Long)
Private Declare Function lstrcpyA Lib"kernel32" (ByVal RetVal As String, ByVal Ptr As Long) As Long
Private Declare Function lstrlenA Lib"kernel32" (ByVal Ptr As Any) As Long
Private Declare Function LocalAlloc Lib"kernel32" (ByVal uFlags As Long, ByVal uBytes As Long) As Long
Private Declare Function LocalFree Lib"kernel32" (ByVal hMem As Long) As Long
Public Sub DeleteCacheURLList()
  Dim icei As INTERNET_CACHE_ENTRY_INFO
  Dim hFile As Long
  Dim cachefile As String
  Dim posUrl As Long
  Dim posEnd As Long
  Dim dwBuffer As Long
  Dim pntrICE As Long
  hFile = FindFirstUrlCacheEntry(0&, ByVal 0, dwBuffer)
   If(hFile = ERROR_CACHE_FIND_FAIL) And _
     (Err.LastDllError = ERROR_INSUFFICIENT_BUFFER) Then
     pntrICE = LocalAlloc(LMEM_FIXED, dwBuffer)
     If pntrICE <> 0 Then
        CopyMemory ByVal pntrICE, dwBuffer, 4
        hFile = FindFirstUrlCacheEntry(vbNullString, _
                                        ByValpntrICE, _
                                       dwBuffer)
        If hFile <> ERROR_CACHE_FIND_FAIL Then
           Do
               CopyMemory icei, ByVal pntrICE,Len(icei)
               If (icei.CacheEntryType And _
                   NORMAL_CACHE_ENTRY) =NORMAL_CACHE_ENTRY Then
                  cachefile =GetStrFromPtrA(icei.lpszSourceUrlName)
                  Call DeleteUrlCacheEntry(cachefile)
               End If
               Call LocalFree(pntrICE)
              dwBuffer = 0
               CallFindNextUrlCacheEntry(hFile, ByVal 0, dwBuffer)
              'allocate and assign the memoryto the pointer
              pntrICE =LocalAlloc(LMEM_FIXED, dwBuffer)
               CopyMemory ByVal pntrICE,dwBuffer, 4
                             DoEvents
           Loop While FindNextUrlCacheEntry(hFile, ByVal pntrICE, dwBuffer)
        End If 'hFile
     End If 'pntrICE
  End If 'hFile
  Call LocalFree(pntrICE)
  Call FindCloseUrlCache(hFile)
End Sub
Private Function GetStrFromPtrA(ByVal lpszAAs Long) As String
  GetStrFromPtrA = String$(lstrlenA(ByVal lpszA), 0)
 Call lstrcpyA(ByVal GetStrFromPtrA, ByVal lpszA)
End Function


查看评论

Guru of the Week 条款13:面向对象程序设计

GotW #13 OOP著者:Herb Sutter      翻译:kingofark[声明]:本文内容取自www.gotw.ca网站上的Guru of the Week栏目,其著作权归原著者本人所...
  • kingofark
  • kingofark
  • 2001-10-24 08:49:00
  • 1165

vba 爬虫常用对象和方法

最近又要小爬一下动态网页,于是复习了一下常用对象。Sub WebCrawler(ByRef Item, ByRef DraftPage)Dim sKey As String Dim k As Inte...
  • u011410413
  • u011410413
  • 2017-01-20 10:56:20
  • 2014

Excel VBA中特殊单元格的定位

Option Explicit '1 已使用的单元格区域 Sub RangesUsed() 'UsedRange属性 Sheets("sheet1").UsedRange.Select ...
  • zch19960629
  • zch19960629
  • 2017-05-14 09:14:36
  • 548

正则表达式--模式匹配--特殊变量(1)

$` | $& | $’ | $1 | $2 | $n   >>>每一次成功的匹配了一个模式(包括替换),操作符都会把变量 $`|$&|$’分别设置为匹配内容左边的内容,匹配的内容和匹配右边的文...
  • bangemantou
  • bangemantou
  • 2012-05-11 20:26:52
  • 1610

VBA爬虫小试

因为进不去数据库今天终于需要实战VBA网页爬虫了。370条记录,用时三分钟。想说其实挺慢的。以后慢慢改进吧。抓下来之后采用Text to Columns 用着刚刚好。 Sub Crawler() ...
  • u011410413
  • u011410413
  • 2015-11-06 16:09:00
  • 4553

vba连接各种数据库字符串

  • 2011年05月12日 16:31
  • 32KB
  • 下载

autocad 2018 VBA帮助文档

  • 2018年03月13日 17:08
  • 5.14MB
  • 下载

网页爬虫工具

  • 2018年04月07日 15:00
  • 157KB
  • 下载

VBA获取系统环境变量及特殊文件夹的各种代码与方法

 这几天帮客户做个系统,需要获到系统环境变量及一些特殊文件夹。收集和研究了各种代码。记录一下,以免自己忘记,同时也分享一下给大家,避免大家遇到同样问题时,再去花费大量时间。 一、我个人整理...
  • amesman
  • amesman
  • 2016-07-07 08:30:09
  • 570

VB实现的一个小的网络爬虫及链接分析程序

  • 2010年01月09日 16:59
  • 137KB
  • 下载
    个人资料
    等级:
    访问量: 0
    积分: 41
    排名: 0
    文章存档