学以致用——ikb知识库英文词条词频分析-Part1-数据提取(VBA)

课题描述:

ikb系统中已录入了数万条记录,这么多数据,能否分析出哪些词出现的次数最多?这些高频词有没有什么业务含义?有什么意义吗?


众里寻他千百度,蓦然回首,那人却在灯火阑珊处。

本来想在网上搜一下Excel中Frequency函数的原代码,却无意中找到了一段代码,稍作修改,即可帮我实现我长久以来的上述想法:词频分析。

话不多说,直接分享代码及数据提取结果。

VBA代码:

Sub MakeWordList()
    Dim InputSheet As Worksheet
    Dim WordListSheet As Worksheet
    Dim PuncChars As Variant, x As Variant
    Dim i As Long, r As Long
    Dim txt As String
    Dim wordCnt As Long
    Dim AllWords As Range
    Dim PC As PivotCache
    Dim PT As PivotTable
    
    Application.ScreenUpdating = False
    Set InputSheet = ActiveSheet
    'Set WordListSheet = Worksheets.Add(after:=Worksheets(Sheets.Count))
    Set WordListSheet = Worksheets.Add(after:=ActiveSheet)
    WordListSheet.Range("A1") = "All Words"
    WordListSheet.Range("A1").Font.Bold = True
    InputSheet.Activate
    wordCnt = 2
    PuncChars = Array(".", ",", ";", ":", "'", "!", "#", _
        "$", "%", "&", "(", ")", " - ", "_", "--", "+", _
        "=", "~", "/", "\", "{", "}", "[", "]", """", "?", "*")
    r = 1

'   Loop until blank cell is encountered
    Do While Cells(r, 1) <> ""
'       covert to UPPERCASE
        txt = UCase(Cells(r, 1))
'       Remove punctuation
        For i = 0 To UBound(PuncChars)
            txt = Replace(txt, PuncChars(i), "")
        Next i
'       Remove excess spaces
        txt = WorksheetFunction.Trim(txt)
'       Extract the words
        x = Split(txt)
        For i = 0 To UBound(x)
            WordListSheet.Cells(wordCnt, 1) = x(i)
            wordCnt = wordCnt + 1
        Next i
    r = r + 1
    Loop
    
'   Create pivot table
    WordListSheet.Activate
    Set AllWords = Range("A1").CurrentRegion
    Set PC = ActiveWorkbook.PivotCaches.Add _
        (SourceType:=xlDatabase, _
        SourceData:=AllWords)
    Set PT = PC.CreatePivotTable _
        (TableDestination:=Range("C1"), _
        TableName:="PivotTable1")
    With PT
        .AddDataField .PivotFields("All Words")
        .PivotFields("All Words").Orientation = xlRowField
    End With
End Sub




运行了几个小时后(数据量确实比较大,新增了20多万行),得到提取结果。

按频数倒序排列并且用空白字符替换掉虚词(a, an, the, and等)等没有多少业务含义的词之后,得到了想要的数据集。如下所示。



虚词替换VBA代码:

Sub RemoveFormWord()
Dim i As Integer  '定义循环变量

i = 2  '表头占用一行,数据从第二行开始


'从第二行开始循环赋值
Do While ActiveSheet.Cells(i, "F").Value <> ""
    Columns("A:A").Select
    Selection.Replace What:=Cells(i, "F").Value, Replacement:="", LookAt:=xlWhole, _
        SearchOrder:=xlByRows, MatchCase:=False, SearchFormat:=False, _
        ReplaceFormat:=False
    i = i + 1
Loop
End Sub

有了这些数据,想干什么不行呢?哈哈。

  • 0
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值