R语言按组折叠文本《collapse text by group in data frame》

最新推荐文章于 2024-10-07 13:39:05 发布

yuanzhoulvpi

最新推荐文章于 2024-10-07 13:39:05 发布

阅读量1.4k

点赞数 1

分类专栏： R

本文链接：https://blog.csdn.net/yuanzhoulvpi/article/details/100680370

版权

R 专栏收录该内容

42 篇文章 9 订阅

订阅专栏

在R语言的DataFrame里面，我们可以按照某一列对另外一组进行折叠：比如下图：

我们希望按照图1的sheetid进行归类，然后将他们转换成图2。

那么怎么做：我也是新手，，以前学过，现在都忘完了。后来去StackOverflow查了一些。竟然有那么多方法。我来给大家分享一下。

数据集就是按照我自己提供的数据。而且是中文的，更加符合本土

--------------------------------------------------------------------------------------------------------------

接下来将数据导入rstudio里面。

library(dplyr)
df <- read.table("ceshi1.txt", header = TRUE, sep = '\t', 
                 fileEncoding = "gbk", stringsAsFactors = FALSE)

接下来才是见证R强大的地方，使用管道函数、groupby、summarise。

其实这里面最让我印象深刻的是这个summarise函数。我以前看《R数据科学》以为这个函数里面只能进行数值统计，像是max，min之类的，没想到可以加R的函数，这个函数竟然可以是字符操作的函数。我这里面就是对得到的文本去重然后用逗号进行连接放到对应的行上去。

new <- df %>% group_by(SheetID) %>%
  summarise(text = paste(unique(CateName), collapse = ","))

给大家看一下效果

真的很强。哈哈哈。我说的不好，也是为了自己记笔记。

其中我还看到其他的答案，我也给大家列举出来

为了方便就不用这里案例了，就用最简单的：

> df <- read.table(header = TRUE, text = "
+                  group text
+                  a a1
+                  a a2
+                  a a3
+                  b b1
+                  b b2
+                  b b4
+                  c c2 
+                  c c1
+                  c c4
+                  c c6
+                  c c7")
> df
   group text
1      a   a1
2      a   a2
3      a   a3
4      b   b1
5      b   b2
6      b   b4
7      c   c2
8      c   c1
9      c   c4
10     c   c6
11     c   c7
> #现在进行简单的操作
> sapply(unique(df$group), function(x) {
+   paste(df[df$group == x, "text"], collapse = ",")
+ })
[1] "a1,a2,a3"       "b1,b2,b4"       "c2,c1,c4,c6,c7"

接下来使用aggregate函数（说实话，个人感觉aggregate函数这个写法有点看不懂，参数太多了）

> aggregate(df$text, list(df$group), paste, collapse = ",")
  Group.1              x
1       a       a1,a2,a3
2       b       b1,b2,b4
3       c c2,c1,c4,c6,c7

或者

> aggregate(text ~ group, data = df, FUN = paste, collapse = ",")
  group           text
1     a       a1,a2,a3
2     b       b1,b2,b4
3     c c2,c1,c4,c6,c7

下面使用plyr

> library(plyr)
-------------------------------------------------------------------------------
You have loaded plyr after dplyr - this is likely to cause problems.
If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
library(plyr); library(dplyr)
-------------------------------------------------------------------------------

载入程辑包：‘plyr’

The following objects are masked from ‘package:dplyr’:

    arrange, count, desc, failwith, id, mutate, rename, summarise,
    summarize

> ddply(df, .(group), summarise, text = paste(text, collapse = ","))
  group           text
1     a       a1,a2,a3
2     b       b1,b2,b4
3     c c2,c1,c4,c6,c7

接下来使用data.table包

> library(data.table)
data.table 1.12.2 using 6 threads (see ?getDTthreads).  Latest news: r-datatable.com

载入程辑包：‘data.table’

The following objects are masked from ‘package:dplyr’:

    between, first, last

> dt <- as.data.table(df)
> dt[, list(text = paste(text, collapse = ",")), by = group]
   group           text
1:     a       a1,a2,a3
2:     b       b1,b2,b4
3:     c c2,c1,c4,c6,c7

接下来使用dplyr包

> library(dplyr)

载入程辑包：‘dplyr’

The following objects are masked from ‘package:stats’:

    filter, lag

The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union

> df %>% 
+   group_by(group) %>%
+   summarise(t = paste(text, collapse = ","))
# A tibble: 3 x 2
  group t             
  <fct> <chr>         
1 a     a1,a2,a3      
2 b     b1,b2,b4      
3 c     c2,c1,c4,c6,c7

写了这几个方法，说实话，我更喜欢python，哈哈哈哈。但是这个领域，R人用的太多了，没办法。