R语言PDF词频统计函数

该博客介绍了如何使用R语言进行PDF文件的词频统计,并解决了无意义词汇和数字干扰的问题。通过自定义函数`wordstat.pdf`,实现了对英文停用词的删除和数字的过滤,确保了统计的准确性。以雅思备考资料为例,展示了函数的使用方法,结果显示词频最高的词汇为'children'。
摘要由CSDN通过智能技术生成

一、Introduction

有关R语言对PDF词频统计的博客已很多,但有以下问题未解决:

  • 英文进行词频统计时,“a” “an” "it"等词汇无实际意义,数字的出现也会干扰词频统计。
  • 未把相关代码整合成自定义函数,导致使用不方便。

二、代码

hasdigit <- function(str){
  if(!is.character(str)){
    stop("'str' should be character.")
  }
  n <- nchar(str)
  for(i in 1:n){
    ch <- substr(str, i, i)
    if(ch>="0"&&ch<="9"){
      return(T)
    }
  }
  return(F)
}

wordstat.pdf <- function(file, lo=3, simplify=T, del_num=T){
  # lo: minimum word length
  # simplify: whether to delete simple words like "a", "an", "we", etc
  # del_num: whether to delete numbers
  if(!"pdftools" %in% .packages()){
    library(pdftools)
  }
  if(!"jiebaRD" %in% .packages()){
    library(jiebaRD)
  }
  if(!"jiebaR" %in% .packages()){
    library(jiebaR)
  }
  if(!"wordcloud2" %in% .packages()){
    library(wordcloud2)
  }
  
  text <- pdf_text(file)
  seg <- tolower(qseg[text])
  seg <- sort(seg, decreasing = TRUE)
  seg <- table(seg)
  
  # deplete namelist
  if(simplify){
    ex <- c("is","are","be","was","were","become","becomes","do","did","does","a","an","the",
            "can","will","would","could","should","may","might","have","has",
            "and","or","not","but","although","though","no","also","if","against","any",
            "for","on","off","from","to","of","in","by","like","as","at","about","up","down",
            "below","between","above","with",
            "many","more","much","most","better","worse","worst","best","good","bad",
            "it","them","its","their","we","you","our","this","that","these","those",
            "what","when","where","how","which","whose","why",
            "get","some","other","others")
    seg[ex] <- -1     
  }
  
  # deplete digit
  if(del_num){
    for (i in 1:length(seg)){
      if(hasdigit(names(seg)[i])){
        seg[i] <- -1
      }
    }  
  }
  seg <- seg[seg > lo]
  return(seg)
}

三、使用案例

  • 需要安装包:pdftools、jiebaRD、jiebaR、wordcloud2

以雅思Simon考官撰写的Ideas for IELTS topics为例(从网站ielts-simon.com中可获取PDF),先运行【二、代码】,将两个函数导入R语言环境中,再运行下列代码:

file <- "D:\\Ideas_for_IELTS_topics.pdf"
res <- wordstat.pdf(file)
wordcloud2(res, size = 1, shape = 'circle',color = 'random-light')

结果如下:
在这里插入图片描述可见Simon考官提供的雅思Part2素材中,children一词高频出现。

参考资料https://blog.csdn.net/BEYONDMA/article/details/85465403

  • 1
    点赞
  • 7
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

HaoranWu_ZJU

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值