用R读取PDF并进行数据挖掘

               

用R读取PDF并进行数据挖掘,例子如下:

# here is a pdf for miningurl <- "http://www.noisyroom.net/blog/RomneySpeech072912.pdf"dest <- tempfile(fileext = ".pdf")download.file(url, dest, mode = "wb")# set path to pdftotxt.exe and convert pdf to textexe <- "C:\\Program Files\\xpdfbin-win-3.03\\bin32\\pdftotext.exe"system(paste("\"", exe, "\" \"", dest, "\"", sep = ""), wait = F)# get txt-file name and open itfiletxt <- sub(".pdf", ".txt", dest)shell.exec(filetxt); shell.exec(filetxt) # strangely the first try always throws an error..# do something with it, i.e. a simple word cloudlibrary(tm)library(wordcloud)library(Rstem)txt <- readLines(filetxt) # don't mind warning..txt <- tolower(txt)txt <- removeWords(txt, c("\\f", stopwords()))corpus <- Corpus(VectorSource(txt))corpus <- tm_map(corpus, removePunctuation)tdm <- TermDocumentMatrix(corpus)m <- as.matrix(tdm)d <- data.frame(freq = sort(rowSums(m), decreasing = TRUE))# Stem wordsd$stem <- wordStem(row.names(d), language = "english")# and put words to column, otherwise they would be lost when aggregatingd$word <- row.names(d)# remove web address (very long string):d <- d[nchar(row.names(d)) < 20, ]# aggregate freqeuncy by word stem and# keep first words..agg_freq <- aggregate(freq ~ stem, data = d, sum)agg_word <- aggregate(word ~ stem, data = d, function(x) x[1])d <- cbind(freq = agg_freq[, 2], agg_word)# sort by frequencyd <- d[order(d$freq, decreasing = T), ]# print wordcloud:wordcloud(d$word, d$freq)# remove filesfile.remove(dir(tempdir(), full.name=T)) # remove files


           

再分享一下我老师大神的人工智能教程吧。零基础!通俗易懂!风趣幽默!还带黄段子!希望你也加入到我们人工智能的队伍中来!https://blog.csdn.net/jiangjunshow

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值