用R读取PDF并进行数据挖掘,例子如下:
# here is a pdf for mining
url
dest
download.file(url, dest, mode = "wb")
# set path to pdftotxt.exe and convert pdf to text
exe
system(paste("\"", exe, "\" \"", dest, "\"", sep = ""), wait = F)
# get txt-file name and open it
filetxt
shell.exec(filetxt); shell.exec(filetxt) # strangely the first try always throws an error..
# do something with it, i.e. a simple word cloud
library(tm)
library(wordcloud)
library(Rstem)
txt
txt
txt
corpus
corpus
tdm
m
d
# Stem words
d$stem
# and put words to column, otherwise they would be lost when aggregating
d$word
# remove web address (very long string):
d
# aggregate freqeuncy by word stem and
# keep first words..
agg_freq
<%}%>
agg_word
d
# sort by frequency
d
# print wordcloud:
wordcloud(d$word, d$freq)
# remove files
file.remove(dir(tempdir(), full.name=T)) # remove files
来自 “ ITPUB博客 ” ,链接:http://blog.itpub.net/301743/viewspace-745512/,如需转载,请注明出处,否则将追究法律责任。
上一篇:
Java又爆致命漏洞
下一篇:
Eclipse 4.2 SR1版悄悄发布
![user_pic_default.png](http://blog.itpub.net/images/user_pic_default.png)
请登录后发表评论
登录
全部评论
<%=items[i].createtime%>
<%=items[i].content%>
<%if(items[i].items.items.length) { %>
<%for(var j=0;j
<%}%> <%if(items[i].items.total > 5) { %>
<%}%>
<%=items[i].items.items[j].createtime%>
<%=items[i].items.items[j].username%> 回复 <%=items[i].items.items[j].tousername%>: <%=items[i].items.items[j].content%>
还有<%=items[i].items.total-5%>条评论
) data-count=1 data-flag=true>点击查看
<%}%>
转载于:http://blog.itpub.net/301743/viewspace-745512/