用户稿件 | 好家伙，到底谁在用TBtools?

最新推荐文章于 2024-08-24 09:53:16 发布

刘永鑫Adam

最新推荐文章于 2024-08-24 09:53:16 发布

阅读量1.4k

点赞数

文章标签：可视化大数据数据可视化编程语言 java

原文链接：https://www.jianshu.com/p/7dfa81db8308

版权

导语

老友邵扬（与我年龄相仿，尚在海外留学）详细整理了自TBtools预印本上传（2018年）以来，引用了TBtools的文稿信息。这份资料和TBtools本身开发是类似的。一不小心，就可能是工作量大且吃力不讨好。邵扬曾经在他的简书（阅读原文可跳转）提及，大意是“对TBtools是黑转粉”。可以说，绝大多数是路人转黑，毕竟我确实不擅长交际。所以，是有趣的人，一件有趣的事情，我还是后来和他成为朋友之后才刷到那篇推文。他时常很热情的在群里回复TBtools使用问题，而我多少还是劝他别管。我很清楚，对于一个已经对TBtools的使用甚至设计逻辑有透彻了解的人，在社群里，除了燃烧自己，贡献时间精力，感觉已经不大可能有收获。而对于一个博士研究生，更多的精力，应该花费在自己课题上。想想比较矛盾，一方面我期望更多人来社群做贡献，因为这个对TBtools用户来说有益无害，另一方面当我看到有热心做贡献的人时，我又更多地期望以及提醒他们，别这么干，该做点对自己真正有益的事情。尽管，我的言辞往往并不效果。啥也不说了，回到主题。

这是一份有价值的推文，详细总结了用户都用TBtools，干了啥，发表了啥工作。邵扬整理出来的资料，事实上，完全可以作为数据可视化练习，尤其是R语言可视化，如ggplot2入门的练习数据。我个人的导语显得冗长且片面，大伙还是看看邵扬整理的吧。预祝大伙新年2021元旦快乐！

要点概况

越来越多的文章使用并引用TBtools
TBtools的用户都喜欢把文章发在谁家？
用户最爱用热图
使用了TBtools的文章都发在了什么水平杂志上
TBtools用户都喜欢研究什么

写在前面

之前写了publish or perish的使用方法。那么问题来了，我们获取了这些数据能拿来干什么？于是萌生了一个想法，不如统计一下引用TBtools的文章，统计一下每年文章发表数量（截止12月初），发表的层次，用户都用了哪些功能等等。当初的想法是我一个滑铲就全部搞定了，但实现起来嘛……一干就是两个周，并且短期内我是再也不想看基因挖掘相关的任何东西了……统计到最后实在是想吐了，如果你发现你发的文章被错误的统计了，或者没统计到……欢迎私信告诉我，我一定不改（

数据导入和准备工作

这里的话数据我已经整理好了，如果你也感兴趣的话点这里进行下载(https://cowtransfer.com/s/6f63b578a5e34f)。
首先载入数据

#Wipe the environmentrm(list=ls())
#Set the work directory
setwd("/Users/yangshao/OneDrive")
#Import the data
dta <- read.csv("clean_cite.csv", header = T)
#Take a peek at the data
skim(dta)
#Dimension of the data
dim(dta)

PS:这里是我的地址，不要无脑复制……

之后载入需要用到的包，接下来就可以进行分析了！如果没有安装对应的包，先让电脑安装就好了。实在不会可以让你师兄先来学，然后教你

library(skimr)library(tidyverse)
library(ggplot2)
library(ggpubr)
library(ggthemes)
library(tidytext)
library(stringr)
library(wordcloud)
library(networkD3)

越来越多的文章使用并引用TBtools

首先是每年引用文章数的可视化

#Citation per yearcite_trend <- sort(table(dta$Year), decreasing = TRUE)
cite_trend <- as.data.frame(cite_trend)
#Visualization
colnames(cite_trend)  <- c('Year','Num.')
ggbarplot(cite_trend, x = "Year", fill = "Year", order = c('2018','2019','2020'), y = "Num.", palette = 'jco') + 
  scale_y_continuous(expand = c(0, 0))

从2018年至2020年，TBtools的每年引用数呈现上升趋势（废话）。截止2020年12月初，2020年的引用数已经超过500次。需要说明的是，这里的数据是相对比较保守的数据，实际引用数可能远不止这些。

TBtools的用户都喜欢把文章发在谁家？

使用以下代码完成可视化，别问我为啥用ggpubr……

#Sort the top 15 publishercite_publisher <- head(sort(table(dta$Publisher), decreasing = TRUE),15)
#Transform the data to a dataframe
cite_publisher <- as.data.frame(cite_publisher)
#Rename the col names
colnames(cite_publisher)  <- c('Publisher','Count')
#Visualization
ggbarplot(cite_publisher, x = "Publisher", fill = "Publisher", y = "Count") + 
  scale_y_continuous(expand = c(0, 0)) +
  theme_pubr(
  base_family = "",
  border = FALSE,
  margin = TRUE,
  legend = 'none',
  x.text.angle = 45
)

截止2020年12月初，mdpi发表了多数文章，施普林格（包含BMC）和爱斯维尔紧随其后。
三个出版商一共发表了538篇，占总数的65.1%。

cite_publisher %>% head(3) %>% select(Count) %>% sum()
cite_publisher %>% head(3) %>% select(Count) %>% sum()/count(dta)

那么引用了TBtools的文章发表在了哪些杂志上呢？同样的方式完成可视化，别问我为什么不用ggpubr……

#Sort the top 15 Journalcite_source <- head(sort(table(dta$Source), decreasing = TRUE),15)
#Transform the data to a dataframe
cite_source <- as.data.frame(cite_source)
#Rename the col
colnames(cite_source)  <- c('Journal Name','Count')
#Visualization
ggplot(cite_source,aes(x = cite_source$`Journal Name`, y = Count, fill = cite_source$`Journal Name`)) + 
  geom_bar(stat='identity') + 
  scale_y_continuous(expand = c(0, 0)) + 
  xlab("Journal Name") +
  guides(fill=guide_legend(title="Journal Name")) +
  theme_pubr(
  base_family = "",
  border = FALSE,
  margin = TRUE
) + rremove("x.text")

从结果上看，多数文章发在了OA杂志上，例如来自mdpi的International Journal of Molecular Sciences，例如来自Springer的BMC Plant Biology和BMC Genomics，例如来自Frontier的Frontier in Genetics和Frontiers in Physiology，例如PeerJ, PLoS等……按你胃，有钱发OA真好……

用户最爱用热图

每篇文章都会在文章里提到使用了TBtools的哪些功能。很遗憾这部分功能只能手动登录了，因为很多情况下，大家对同一个功能叙述存在差异，例如同样是一个热图功能有的人说'the DEGs were visualized by TBtools'有的人会说'the heatmaps were used to show/draw 。。。。(TBtools， github)'或者是'the correlation coefficients were visualized by TBtools' 等等。还有Circos图，很多时候难以分辨到底是Circos画的还是用TBtools画的，这种情况提到了TBtools索性都用Synteny代替了。还有一些懒得索性就说'the data were visualized by aaa, bbb, and TBtools' 的，这种也只能自己手动找。笔者已经很认真的过滤了，但人工做这方面工作难免会出错，有些时候甚至看到某几个图就把几个常用功能一列完事了……所以如果你发现我列举的功能跟文章提到的有出入，李姐万岁……
在这里分析的时候需要先做成长透视表（pivot_longer），然后选择数据开始作图，这里因为需要做三次同样的图，干脆就把表格的命令做成一个函数，直接调用就好了。这样显得不那么弱智（

#Create a pivot table
dta %>% select(Year,17:25) %>% pivot_longer(!Year,
                                            names_to = 'Name',
                                            values_to = 'Fun') -> Fun_count_pivot
#Function for counting
Fun_count <- function(year, Pivot){
  table(Pivot %>% filter(Year == year) %>% .$Fun) %>% 
  as.data.frame() %>% .[-1,] -> Count_Year
  colnames(Count_Year) <-c('Function', 'Count')
  top_three <- tail(Count_Year[order(Count_Year$Count),],3)
  ggbarplot(Count_Year, x = "Function", fill = "Function", y = "Count") + 
  scale_y_continuous(limits = c(0, NA), expand = expansion(mult = c(0,.1))) +
  theme_pubr(
  base_family = "",
  border = FALSE,
  margin = TRUE
) + rremove("x.text") + geom_text(top_three, mapping = aes(label = Function), nudge_y = top_three$Count*.2) + geom_text(top_three, mapping = aes(label = Count), nudge_y = top_three$Count*.1)
}
#Visualization the mentioned functions in papers
Fun_count("2018",Fun_count_pivot)
Fun_count("2019",Fun_count_pivot)
Fun_count("2020",Fun_count_pivot)

2018年

2019年

2020年

从2018年到2020年，用户使用TBtools的姿势越来越多：2018年，主要提及的功能只有15个，而到了2020年文章中提及的功能有29个。从2019年开始，TBtools用户群体最多使用的三个功能为:'Heatmap', 'motif'和'Gene structure'。

最多提及的功能可以通过以下代码查看

Fun_sum <- function(year, Pivot){table(Pivot %>% filter(Year == year) %>% .$Fun) %>% 
  as.data.frame() %>% .[-1,] -> Fun_sum
  colnames(Fun_sum) <- c('Function', 'Count')
  Fun_sum <- head(Fun_sum[order(-Fun_sum$Count),],15)
  return(Fun_sum)
}
Count_2018 <- Fun_sum(2018,Fun_count_pivot)
Count_2019 <- Fun_sum(2019,Fun_count_pivot)
Count_2020 <- Fun_sum(2020,Fun_count_pivot)
data.frame("2018" = Count_2018,"2019" = Count_2019, "2020" = Count_2020) %>% view()

与2019年相比，今年的TBtools用户更喜欢用TBtools做进化树，韦恩图和fasta文件的相关操作。在2020年，TBtools的多数功能确实都有用户在用。尤其是陈博士重点照顾的Heatmap，gene structure annotation，motif和phylogenetic tree。常用到你不一起做就觉得少了点什么的程度。当然，这也侧面反映了今年家族挖掘的研究还是很流行……

使用了TBtools的文章都发在了什么水平杂志上

我知道IF这东西不好做横向对比。之前也有想法做个TBtools到底有没有帮助作者把文章发在更好的杂志上。但最后还是放弃了。说到底我认为一篇文章能否发在更好的杂志上还得看你的文章是否有真东西。工具是帮助你更好的展现你的想法，讲好你的故事而不是帮你做一些花里胡哨的图糊弄reviewer的。
所以在这里做一下TBtools这三年的IF趋势吧。

IF_by_year <- dta %>% select(Year, Source, ImpactFactor)IF_by_year$Year <- as.character(IF_by_year$Year)
IF_by_year$ImpactFactor <- as.numeric(IF_by_year$ImpactFactor)
gghistogram(IF_by_year, x = 'ImpactFactor', bins = 100, add = "mean", color = "Year", palette = 'mpg', rug = TRUE,add_density = TRUE, alpha = .8,xticks.by = 1)
gghistogram(IF_by_year, x = 'ImpactFactor', bins = 100, add = "mean", color = "Year", palette = '

有意思的是，这三年TBtools的文章主要落在了IF=3-4的区间。而在2020年，IF的两端出现了凸起。：一方面，拿家族基因挖掘当财富密码的人越来越多。这块相互竞争激烈，一些文章不得不选择低IF的杂志发表文章。而另一方面也有不少文章发表在了老牌期刊上，而这些期刊的IF常常落在6-7的区间。当然这里有个例外：'Genomics'，根据学之策公众号的实时IF统计显示，'Genomics'的实时IF实际在3.2左右。我能说什么呢，套路深啊……

dta %>% select(Year,Source, ImpactFactor) %>%   filter(ImpactFactor < 7 & ImpactFactor > 6 & Source != 'Genomics') %>% group_by(Source) %>%
  summarise(Count = n()) %>% ggplot(aes(x=Source, y=Count, fill = Source)) + 
  geom_bar(stat='identity',show.legend = FALSE) + 
  scale_y_continuous(limits = c(0, NA), expand = expansion(mult = c(0,.1))) + 
  coord_flip() + theme_few()

为了减少这部分的影响，这里剔除了这个'Genomics'期刊。

IF_by_year %>%   filter(Source != 'Genomics') %>%
gghistogram(., x = 'ImpactFactor', bins = 100, add = "mean", color = "Year", palette = 'mpg', rug

dta %>% select(Year,Source, ImpactFactor,17:25) %>% 
  filter(ImpactFactor > 6& Source != 'Genomics') %>% view()

这里再次看了下6分以上的文章都喜欢用TBtools的什么功能。一看不得了，这可不是随便用的啊，一看就是有备而来。最多用的功能还是Heatmap。广州这个地方肯定很热……嗯，希望没什么奇怪的联想。这里就不做图了，感兴趣的不妨自己试试。

TBtools用户都喜欢研究什么

这个问题说真的我很难回答，毕竟做这个我不专业。大概琢磨了下分词，弄了个词云出来看看。

dta %>% select(Year, Title) %>%   unnest_tokens(output = 'word', input = 'Title', token = "words", to_lower = TRUE) %>% #Text segmentation
  select(word) %>% 
  anti_join(stop_words) %>% #delete the stop words 
  count(word, sort = TRUE) -> word_count


set.seed(2020) #for reproducibility 
wordcloud(words = word_count$word, 
          freq = word_count$n, 
          min.freq = 1,
          max.words = 500, 
          random.order = FALSE, 
          rot.per = 0.35,            
          colors=brewer.pal(8, "Dark2")
          )

hmm,genome-wid and gene family expression analysis/identification...这还能说什么呢。可涨点薪吧。

最后，因为植物口出身所以看一下都用了什么植物做研究……

get_source <- function(name){dta %>% select(Title, ImpactFactor) %>% 
  mutate(lower_title = tolower(Title)) %>% 
  filter(str_detect(lower_title, name)) %>% 
  select(ImpactFactor)
}


list <- c('cotton','arabidopsis', 'tea','potato','rice','tomato','wheat','soybean','cucumber','barley','maize','tobacco','brachypodium','pepper','radish')


list_fruit <- c('pear','apple','citrus','cucumis','sugarcane','strawberry','dimocarpus','grape','litchi','banana','watermelon','jujube','peach','prunus')


index <- data.frame(list.i. = '', n = '')
for (i in 1:length(list)){
  count <- get_source(list[i]) %>% count()
  index_i <- data.frame(list[i],count)
  index <- rbind(index, index_i)
}
index <- index[-1,]
colnames(index) <- c('Species name','Count')
index$Count <- as.numeric(index$Count)
index <- index[order(-index$Count),]
view(index)

嗨，我觉得该改标题了，就叫什么？研究小麦的居然还不知道TBtools?好了……本来还想一个滑铲把家族基因统计了，我怕又铲到沟里去，算了……另外还有个包含水果的list（我知道这么分类不专业，别骂了别骂了），感兴趣的可以自己看看。

写在最后

没啥想说的了，最后搞个山鸡图和热图祝大家科研顺利吧……

#Create a new datasetdta %>% filter(Source != 'Genomics') %>% 
  select(Year, Publisher, Source, ImpactFactor, 17:25) %>% 
  pivot_longer(cols = starts_with('Fun_'),
               names_to = 'Name', 
               values_to = 'Fun') %>% 
  select(Year, Publisher, Source, ImpactFactor, Fun) %>% 
  filter(Fun != '') -> sankey_dta
#Add an IF column
sankey_dta %>% 
  mutate(IF = ifelse(ImpactFactor > 6,'General Impact',
                     ifelse(ImpactFactor < 2, 'Have Impact',
                            ifelse(ImpactFactor > 2 & ImpactFactor < 6, 'Specific Impact',"no")))) %>% 
  select(Year,Publisher,Source,ImpactFactor, IF,Fun) -> sankey_dta
sankey_dta %>% group_by(Source,ImpactFactor) %>% filter(ImpactFactor>5|n()>10) -> sankey_dta


sankey_dta %>% select(Year,Publisher) %>% 
  group_by(Year,Publisher) %>% 
  summarise(value = n()) -> sankey_index
colnames(sankey_index) <- c('source','target','value')
sankey_index$source <- as.character(sankey_index$source)  


sankey_index <- sankey_dta %>% select(Publisher,Source) %>% 
  group_by(Publisher,Source) %>% 
  summarise(value = n()) %>% 
  rename(.,source = Publisher, target = Source) %>% 
  rbind(sankey_index,.)


sankey_dta %>% select(Source,ImpactFactor) %>% 
  mutate(IF = ifelse(ImpactFactor > 6,'General_Impact',
                     ifelse(ImpactFactor < 2, 'Have_Impact',
                            ifelse(ImpactFactor > 2 & ImpactFactor < 6, 'Specific_Impact',"no")))) %>% 
  group_by(Source,IF) %>% 
  summarise(value = n()) %>% rename(.,source = Source, target = IF) %>% 
  rbind(sankey_index,.) -> sankey_index


sankey_dta %>% select(ImpactFactor,Fun) %>% 
  mutate(IF = ifelse(ImpactFactor > 6,'General_Impact',
                     ifelse(ImpactFactor < 2, 'Have_Impact',
                            ifelse(ImpactFactor > 2 & ImpactFactor < 6, 'Specific_Impact',"no")))) %>% 
  group_by(IF,Fun) %>% 
  summarise(value = n()) %>% rename(.,source = IF, target = Fun) %>% 
  rbind(sankey_index,.) -> sankey_index


sankey_index[order(sankey_index$source,sankey_index$target),]


nodes <- data.frame(
  name=c(as.character(sankey_index$source), as.character(sankey_index$target)) %>% 
    unique()
)


sankey_index$IDsource <- match(sankey_index$source, nodes$name)-1 
sankey_index$IDtarget <- match(sankey_index$target, nodes$name)-1
sankey_index[order(sankey_index$IDsource,sankey_index$IDtarget),]
p <- sankeyNetwork(Links = sankey_index, Nodes = nodes, Source = "IDsource", Target = "IDtarget", 
              Value = "value", NodeID = "name", fontSize = 12, nodeWidth = 20,LinkGroup="source")
p