数据科学与大数据分析项目练习-6在数据集Cora上应用文本分析

在数据集Cora上应用文本分析


首先还是要安装并加载需要使用的包

require("ggplot2")
install.packages("reshape2")
install.packages("lda")


require("reshape2")
require("lda")
#加载文档和词汇表数据(cora.documents)
data(cora.documents)
data(cora.vocab)
# 当前/活动主题将自动应用于您绘制的每个绘图
theme_set(theme_bw())

加载得到的cora documents 如下所示
在这里插入图片描述


# 显示的主题集群数量
K <- 10

# 要显示的文档数量
N <- 9

result <- lda.collapsed.gibbs.sampler(cora.documents,
                                      K,  
                                      cora.vocab,
                                      25,  
                                      0.1,
                                      0.1,
                                      compute.log.likelihood=TRUE) 

得到的result如图:
在这里插入图片描述


# 获取集群中的顶部单词
top.words <- top.topic.words(result$topics, 5, by.score=TRUE)

# 构建主题的比例
topic.props <- t(result$document_sums) / colSums(result$document_sums)

document.samples <- sample(1:dim(topic.props)[1], N)
topic.props <- topic.props[document.samples,]

topic.props[is.na(topic.props)] <-  1 / K

colnames(topic.props) <- apply(top.words, 2, paste, collapse=" ")

topic.props.df <- melt(cbind(data.frame(topic.props),
                             document=factor(1:N)),
                       variable.name="topic",
                       id.vars = "document")  

qplot(topic, value*100, fill=topic, stat="identity",
      ylab="proportion (%)", data=topic.props.df,
      geom="histogram") +
  theme(axis.text.x = element_text(angle=0, hjust=1, size=12)) +
  coord_flip() +
  facet_wrap(~ document, ncol=3)

最终我们得到了下图:
在这里插入图片描述
参考书目

  1. Data Science and Big Data Analytics: Discovering, Analyzing, Visualizing and Presenting Data, EMC Education Services, John Wiley & Sons, 27 Jan. 2015
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
Data Science and Big Data Analytics is about harnessing the power of data for new insights. The book covers the breadth of activities and methods and tools that Data Scientists use. The content focuses on concepts, principles and practical applications that are applicable to any industry and technology environment, and the learning is supported and explained with examples that you can replicate using open-source software. This book will help you: Become a contributor on a data science team Deploy a structured lifecycle approach to data analytics problems Apply appropriate analytic techniques and tools to analyzing big data Learn how to tell a compelling story with data to drive business action Prepare for EMC Proven Professional Data Science Certification Corresponding data sets are available at www.wiley.com/go/9781118876138. Get started discovering, analyzing, visualizing, and presenting data in a meaningful way today! Table of Contents Chapter 1 Introduction to Big Data Analytics Chapter 2 Data Analytics Lifecycle Chapter 3 Review of Basic Data Analytic Methods Using R Chapter 4 Advanced Analytical Theory and Methods: Clustering Chapter 5 Advanced Analytical Theory and Methods: Association Rules Chapter 6 Advanced Analytical Theory and Methods: Regression Chapter 7 Advanced Analytical Theory and Methods: Classification Chapter 8 Advanced Analytical Theory and Methods: Time Series Analysis Chapter 9 Advanced Analytical Theory and Methods: Text Analysis Chapter 10 Advanced Analytics—Technology and Tools: MapReduce and Hadoop Chapter 11 Advanced Analytics—Technology and Tools: In-Database Analytics Chapter 12 The Endgame, or Putting It All Together

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值