课题描述:
ikb系统中已录入了数万条记录,这么多数据,能否分析出哪些词出现的次数最多?这些高频词有没有什么业务含义?有什么意义吗?
众里寻他千百度,蓦然回首,那人却在灯火阑珊处。
本来想在网上搜一下Excel中Frequency函数的原代码,却无意中找到了一段代码,稍作修改,即可帮我实现我长久以来的上述想法:词频分析。
话不多说,直接分享代码及数据提取结果。
VBA代码:
Sub MakeWordList()
Dim InputSheet As Worksheet
Dim WordListSheet As Worksheet
Dim PuncChars As Variant, x As Variant
Dim i As Long, r As Long
Dim txt As String
Dim wordCnt As Long
Dim AllWords As Range
Dim PC As PivotCache
Dim PT As PivotTable
Application.ScreenUpdating = False
Set InputSheet = ActiveSheet
'Set WordListSheet = Worksheets.Add(after:=Worksheets(Sheets.Count))
Set WordListSheet = Worksheets.Add(after:=ActiveSheet)
WordListSheet.Range("A1") = "All Words"
WordListSheet.Range("A1").Font.Bold = True
InputSheet.Activate
wordCnt = 2
PuncChars = Array(".", ",", ";", ":", "'", "!", "#", _
"$", "%", "&", "(", ")", " - ", "_", "--", "+", _
"=", "~", "/", "\", "{", "}", "[", "]", """", "?", "*")
r = 1
' Loop until blank cell is encountered
Do While Cells(r, 1) <> ""
' covert to UPPERCASE
txt = UCase(Cells(r, 1))
' Remove punctuation
For i = 0 To UBound(PuncChars)
txt = Replace(txt, PuncChars(i), "")
Next i
' Remove excess spaces
txt = WorksheetFunction.Trim(txt)
' Extract the words
x = Split(txt)
For i = 0 To UBound(x)
WordListSheet.Cells(wordCnt, 1) = x(i)
wordCnt = wordCnt + 1
Next i
r = r + 1
Loop
' Create pivot table
WordListSheet.Activate
Set AllWords = Range("A1").CurrentRegion
Set PC = ActiveWorkbook.PivotCaches.Add _
(SourceType:=xlDatabase, _
SourceData:=AllWords)
Set PT = PC.CreatePivotTable _
(TableDestination:=Range("C1"), _
TableName:="PivotTable1")
With PT
.AddDataField .PivotFields("All Words")
.PivotFields("All Words").Orientation = xlRowField
End With
End Sub
运行了几个小时后(数据量确实比较大,新增了20多万行),得到提取结果。
按频数倒序排列并且用空白字符替换掉虚词(a, an, the, and等)等没有多少业务含义的词之后,得到了想要的数据集。如下所示。
虚词替换VBA代码:
Sub RemoveFormWord()
Dim i As Integer '定义循环变量
i = 2 '表头占用一行,数据从第二行开始
'从第二行开始循环赋值
Do While ActiveSheet.Cells(i, "F").Value <> ""
Columns("A:A").Select
Selection.Replace What:=Cells(i, "F").Value, Replacement:="", LookAt:=xlWhole, _
SearchOrder:=xlByRows, MatchCase:=False, SearchFormat:=False, _
ReplaceFormat:=False
i = i + 1
Loop
End Sub
有了这些数据,想干什么不行呢?哈哈。