R-aggregate()

概述

aggregate函数应该是数据处理中常用到的函数,简单说有点类似sql语言中的group by,可以按照要求把数据打组聚合,然后对聚合以后的数据进行加和、求平均等各种操作。

x=data.frame(name=c("张三","李四","王五","赵六"),sex=c("M","M","F","F"),age=c(20,40,22,30),height=c(166,170,150,155))

构造一个很简单的数据,一组人的性别、年龄和身高,可以用aggregate函数来求不同性别的平均年龄和身高

aggregate(x[,3:4],by=list(sex=x$sex),FUN=mean)

几个注意点:

  • 字符或者factor类型的列不要一起加入计算,会报错
  • by参数要构造成list,如果有多个字段,by就对应队列,和group by多个字段是同样的道理

这个函数的功能比较强大,它首先将数据进行分组(按行),然后对每一组数据进行函数统计,最后把结果组合成一个比较nice的表格返回。根据数据对象不同它有三种用法,分别应用于数据框(data.frame)、公式(formula)和时间序列(ts):

aggregate(x, by, FUN, ..., simplify = TRUE)
aggregate(formula, data, FUN, ..., subset, na.action = na.omit)
aggregate(x, nfrequency = 1, FUN = sum, ndeltat = 1, ts.eps = getOption("ts.eps"), ...)  

语法

aggregate(x, ...)
 
## S3 method for class 'default':
aggregate((x, ...))

## S3 method for class 'data.frame':
aggregate((x, by, FUN, ..., simplify = TRUE))

## S3 method for class 'formula':
aggregate((formula, data, FUN, ...,
          subset, na.action = na.omit))

## S3 method for class 'ts':
aggregate((x, nfrequency = 1, FUN = sum, ndeltat = 1,
          ts.eps = getOption("ts.eps"), ...))

###细节查看  ?aggregate

Example1

  我们通过 mtcars 数据集的操作对这个函数进行简单了解。mtcars 是不同类型汽车道路测试的数据框类型数据:

> str(mtcars)
'data.frame': 32 obs. of 11 variables:
$ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
$ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
$ disp: num 160 160 108 258 360 ...
$ hp : num 110 110 93 110 175 105 245 62 95 123 ...
$ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
$ wt : num 2.62 2.88 2.32 3.21 3.44 ...
$ qsec: num 16.5 17 18.6 19.4 17 ...
$ vs : num 0 0 1 1 0 1 0 1 1 1 ...
$ am : num 1 1 1 0 0 0 0 0 0 0 ...
$ gear: num 4 4 4 3 3 3 3 4 4 4 ...
$ carb: num 4 4 1 1 2 1 4 2 2 4 ...

  先用attach函数把mtcars的列变量名称加入到变量搜索范围内,然后使用aggregate函数按cyl(汽缸数)进行分类计算平均值:

> attach(mtcars)
> aggregate(mtcars, by=list(cyl), FUN=mean)
Group.1 mpg cyl disp hp drat wt qsec vs am gear carb
1 4 26.66364 4 105.1364 82.63636 4.070909 2.285727 19.13727 0.9090909 0.7272727 4.090909 1.545455
2 6 19.74286 6 183.3143 122.28571 3.585714 3.117143 17.97714 0.5714286 0.4285714 3.857143 3.428571
3 8 15.10000 8 353.1000 209.21429 3.229286 3.999214 16.77214 0.0000000 0.1428571 3.285714 3.500000

  by参数也可以包含多个类型的因子,得到的就是每个不同因子组合的统计结果:

> aggregate(mtcars, by=list(cyl, gear), FUN=mean)

Group.1 Group.2 mpg cyl disp hp drat wt qsec vs am gear carb
1 4 3 21.500 4 120.1000 97.0000 3.700000 2.465000 20.0100 1.0 0.00 3 1.000000
2 6 3 19.750 6 241.5000 107.5000 2.920000 3.337500 19.8300 1.0 0.00 3 1.000000
3 8 3 15.050 8 357.6167 194.1667 3.120833 4.104083 17.1425 0.0 0.00 3 3.083333
4 4 4 26.925 4 102.6250 76.0000 4.110000 2.378125 19.6125 1.0 0.75 4 1.500000
5 6 4 19.750 6 163.8000 116.5000 3.910000 3.093750 17.6700 0.5 0.50 4 4.000000
6 4 5 28.200 4 107.7000 102.0000 4.100000 1.826500 16.8000 0.5 1.00 5 2.000000
7 6 5 19.700 6 145.0000 175.0000 3.620000 2.770000 15.5000 0.0 1.00 5 6.000000
8 8 5 15.400 8 326.0000 299.5000 3.880000 3.370000 14.5500 0.0 1.00 5 6.000000

  公式(formula)是一种特殊的R数据对象,在aggregate函数中使用公式参数可以对数据框的部分指标进行统计:

> aggregate(cbind(mpg,hp) ~ cyl+gear, FUN=mean)
cyl gear mpg hp
1 4 3 21.500 97.0000
2 6 3 19.750 107.5000
3 8 3 15.050 194.1667
4 4 4 26.925 76.0000
5 6 4 19.750 116.5000
6 4 5 28.200 102.0000
7 6 5 19.700 175.0000
8 8 5 15.400 299.5000

  上面的公式 cbind(mpg,hp) ~ cyl+gear 表示使用 cyl 和 gear 的因子组合对 cbind(mpg,hp) 数据进行操作。aggregate在时间序列数据上的应用请参考R的函数说明文档。

Example2


## Compute the averages for the variables in 'state.x77', grouped
## according to the region (Northeast, South, North Central, West) that
## each state belongs to.
aggregate(state.x77, list(Region = state.region), mean)
 
## Compute the averages according to region and the occurrence of more
## than 130 days of frost.
aggregate(state.x77,
          list(Region = state.region,
               Cold = state.x77[,"Frost"] > 130),
          mean)
## (Note that no state in 'South' is THAT cold.)
 

## example with character variables and NAs
testDF <- data.frame(v1 = c(1,3,5,7,8,3,5,NA,4,5,7,9),
                     v2 = c(11,33,55,77,88,33,55,NA,44,55,77,99) )
by1 <- c("red", "blue", 1, 2, NA, "big", 1, 2, "red", 1, NA, 12)
by2 <- c("wet", "dry", 99, 95, NA, "damp", 95, 99, "red", 99, NA, NA)
aggregate(x = testDF, by = list(by1, by2), FUN = "mean")
 
# and if you want to treat NAs as a group
fby1 <- factor(by1, exclude = "")
fby2 <- factor(by2, exclude = "")
aggregate(x = testDF, by = list(fby1, fby2), FUN = "mean")
 
 
## Formulas, one ~ one, one ~ many, many ~ one, and many ~ many:
aggregate(weight ~ feed, data = chickwts, mean)
aggregate(breaks ~ wool + tension, data = warpbreaks, mean)
aggregate(cbind(Ozone, Temp) ~ Month, data = airquality, mean)
aggregate(cbind(ncases, ncontrols) ~ alcgp + tobgp, data = esoph, sum)
 
## Dot notation:
aggregate(. ~ Species, data = iris, mean)
aggregate(len ~ ., data = ToothGrowth, mean)
 
## Often followed by xtabs():
ag <- aggregate(len ~ ., data = ToothGrowth, mean)
xtabs(len ~ ., data = ag)
 
 
## Compute the average annual approval ratings for American presidents.
aggregate(presidents, nfrequency = 1, FUN = mean)
## Give the summer less weight.
aggregate(presidents, nfrequency = 1,
          FUN = weighted.mean, w = c(1, 1, 0.5, 1))

Example3

#load data
data <- ChickWeight
head(data)
  weight Time Chick Diet
1     42    0     1    1
2     51    2     1    1
3     59    4     1    1
4     64    6     1    1
5     76    8     1    1
6     93   10     1    1
 
#dimension of the data
dim(data)
[1] 578   4
 
#how many chickens
unique(data$Chick)
 [1] 1  2  3  4  5  6  7  8  9  10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
[31] 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
50 Levels: 18 < 16 < 15 < 13 < 9 < 20 < 10 < 8 < 17 < 19 < 4 < 6 < 11 < 3 < 1 < 12 < ... < 48
 
#how many diets
unique(data$Diet)
[1] 1 2 3 4
Levels: 1 2 3 4
 
#how many time points
unique(data$Time)
 [1]  0  2  4  6  8 10 12 14 16 18 20 21
 
library(ggplot2)
ggplot(data=data, aes(x=Time, y=weight, group=Chick, colour=Chick)) +
       geom_line() +
       geom_point()

------------------------------------------------------

## S3 method for class 'data.frame'
## aggregate(x, by, FUN, ..., simplify = TRUE)

#find the mean weight depending on diet
aggregate(data$weight, list(diet = data$Diet), mean)
  diet        x
1    1 102.6455
2    2 122.6167
3    3 142.9500
4    4 135.2627
 
#aggregate on time
aggregate(data$weight, list(time=data$Time), mean)
   time         x
1     0  41.06000
2     2  49.22000
3     4  59.95918
4     6  74.30612
5     8  91.24490
6    10 107.83673
7    12 129.24490
8    14 143.81250
9    16 168.08511
10   18 190.19149
11   20 209.71739
12   21 218.68889
 
#use a different function
aggregate(data$weight, list(time=data$Time), sd)
   time         x
1     0  1.132272
2     2  3.688316
3     4  4.495179
4     6  9.012038
5     8 16.239780
6    10 23.987277
7    12 34.119600
8    14 38.300412
9    16 46.904079
10   18 57.394757
11   20 66.511708
12   21 71.510273
 
#we could also aggregate on time and diet
head(aggregate(data$weight,
               list(time = data$Time, diet = data$Diet),
               mean
              )
    )
  time diet        x
1    0    1 41.40000
2    2    1 47.25000
3    4    1 56.47368
4    6    1 66.78947
5    8    1 79.68421
6   10    1 93.05263
tail(aggregate(data$weight,
               list(time = data$Time, diet = data$Diet),
               mean
              )
    )
   time diet        x
43   12    4 151.4000
44   14    4 161.8000
45   16    4 182.0000
46   18    4 202.9000
47   20    4 233.8889
48   21    4 238.5556
 
#to see the weights over time across different diets
ggplot(data) + geom_line(aes(x=Time, y=weight, colour=Chick)) +
             facet_wrap(~Diet) +
             guides(col=guide_legend(ncol=3))

Example4

  The aggregate function is more difficult to use, but it is included in the base R installation and does not require the installation of another package.

# Get a count of number of subjects in each category (sex*condition)
cdata <- aggregate(data["subject"], by=data[c("sex","condition")], FUN=length)
cdata
#>   sex condition subject
#> 1   F   aspirin       5
#> 2   M   aspirin       9
#> 3   F   placebo      12
#> 4   M   placebo       4

# Rename "subject" column to "N"
names(cdata)[names(cdata)=="subject"] <- "N"
cdata
#>   sex condition  N
#> 1   F   aspirin  5
#> 2   M   aspirin  9
#> 3   F   placebo 12
#> 4   M   placebo  4

# Sort by sex first
cdata <- cdata[order(cdata$sex),]
cdata
#>   sex condition  N
#> 1   F   aspirin  5
#> 3   F   placebo 12
#> 2   M   aspirin  9
#> 4   M   placebo  4

# We also keep the __before__ and __after__ columns:
# Get the average effect size by sex and condition
cdata.means <- aggregate(data[c("before","after","change")], 
                         by = data[c("sex","condition")], FUN=mean)
cdata.means
#>   sex condition   before     after    change
#> 1   F   aspirin 11.06000  7.640000 -3.420000
#> 2   M   aspirin 11.26667  5.855556 -5.411111
#> 3   F   placebo 10.13333  8.075000 -2.058333
#> 4   M   placebo 11.47500 10.500000 -0.975000

# Merge the data frames
cdata <- merge(cdata, cdata.means)
cdata
#>   sex condition  N   before     after    change
#> 1   F   aspirin  5 11.06000  7.640000 -3.420000
#> 2   F   placebo 12 10.13333  8.075000 -2.058333
#> 3   M   aspirin  9 11.26667  5.855556 -5.411111
#> 4   M   placebo  4 11.47500 10.500000 -0.975000

# Get the sample (n-1) standard deviation for "change"
cdata.sd <- aggregate(data["change"],
                      by = data[c("sex","condition")], FUN=sd)
# Rename the column to change.sd
names(cdata.sd)[names(cdata.sd)=="change"] <- "change.sd"
cdata.sd
#>   sex condition change.sd
#> 1   F   aspirin 0.8642916
#> 2   M   aspirin 1.1307569
#> 3   F   placebo 0.5247655
#> 4   M   placebo 0.7804913

# Merge
cdata <- merge(cdata, cdata.sd)
cdata
#>   sex condition  N   before     after    change change.sd
#> 1   F   aspirin  5 11.06000  7.640000 -3.420000 0.8642916
#> 2   F   placebo 12 10.13333  8.075000 -2.058333 0.5247655
#> 3   M   aspirin  9 11.26667  5.855556 -5.411111 1.1307569
#> 4   M   placebo  4 11.47500 10.500000 -0.975000 0.7804913

# Calculate standard error of the mean
cdata$change.se <- cdata$change.sd / sqrt(cdata$N)
cdata
#>   sex condition  N   before     after    change change.sd change.se
#> 1   F   aspirin  5 11.06000  7.640000 -3.420000 0.8642916 0.3865230
#> 2   F   placebo 12 10.13333  8.075000 -2.058333 0.5247655 0.1514867
#> 3   M   aspirin  9 11.26667  5.855556 -5.411111 1.1307569 0.3769190
#> 4   M   placebo  4 11.47500 10.500000 -0.975000 0.7804913 0.3902456

  If you have NA’s in your data and wish to skip them, use na.rm=TRUE:

cdata.means <- aggregate(data[c("before","after","change")], 
                         by = data[c("sex","condition")],
                         FUN=mean, na.rm=TRUE)
cdata.means
#>   sex condition   before     after    change
#> 1   F   aspirin 11.06000  7.640000 -3.420000
#> 2   M   aspirin 11.26667  5.855556 -5.411111
#> 3   F   placebo 10.13333  8.075000 -2.058333
#> 4   M   placebo 11.47500 10.500000 -0.975000 

转载于:https://www.cnblogs.com/fengzzi/p/10044519.html

  • 0
    点赞
  • 6
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
这段代码的作用是将数据框 `query` 按 `ID` 和 `hour` 列进行分组,然后计算每个组合中 `NO` 列的观测值个数,最后生成一个新的数据框。 需要注意的是,这里使用了 R 中的 `aggregate` 函数,它的基本语法是 `aggregate(formula, data, FUN)`,其中 `formula` 表示需要聚合的列(或变量)和聚合函数的组合,`data` 表示数据框,`FUN` 表示聚合函数。在这个例子中,`formula` 是 `NO~ID+hour`,表示需要聚合的是 `NO` 列,而 `ID` 和 `hour` 列是聚合的条件;`data` 是 `query` 数据框,`FUN` 是 `length` 函数,表示计算每个组合中 `NO` 列的观测值个数。 以下是一个示例代码,用于将数据框 `df` 转换为按 `ID` 列进行分组的数据框,其中每个小时作为新数据框的列名,使用频数填充新数据框中的值: ```R # 假设数据框为df,ID列为id,小时列为hour,频数列为NO # 将数据框按ID和小时列进行分组,并计算每个组合中NO列的观测值个数 hour_freq <- aggregate(NO ~ ID + hour, df, length) # 将小时列转换为每个小时的列,并填充观测值个数 hour_freq_wide <- reshape(hour_freq, idvar = "ID", timevar = "hour", direction = "wide") # 重命名列名 colnames(hour_freq_wide)[-1] <- paste0("hour_", colnames(hour_freq_wide)[-1]) # 最终结果 hour_freq_wide ``` 需要注意的是,上述代码中使用了 `reshape` 函数将小时列转换为每个小时的列,并填充观测值个数。也可以使用其他函数(如 `dcast` 函数)来完成这个操作。

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值