Julia DataFrames ---- groupby/map/combine/aggregate 函数

最新推荐文章于 2024-04-06 22:11:04 发布

October-

最新推荐文章于 2024-04-06 22:11:04 发布

阅读量967

点赞数

分类专栏： julia机器学习&科学计算文章标签： Julia DataFrame groupby map combine

本文链接：https://blog.csdn.net/weixin_41715077/article/details/103756514

版权

julia机器学习&科学计算专栏收录该内容

70 篇文章 28 订阅

订阅专栏

1、支持的统计函数

下面提到的函数也都支持统计函数，具体支持的统计函数见。by：https://blog.csdn.net/weixin_41715077/article/details/103747504

2、函数说明

2.1 groupby

# groupby(d::AbstractDataFrame, cols; sort=false, skipmissing=false),
# 返回类型为：GroupedDataFrame
Return a `GroupedDataFrame` representing a view of an `AbstractDataFrame` split into row groups.

# 参数说明
- `df` : an `AbstractDataFrame` to split
- `cols` : 按照那几列分组
- `sort` : 是否分组的列排序
- `skipmissing` : 是否跳过分组的列值为空的情况

接下来可以使用的函数有：
* [`by`](@ref) : split-apply-combine using functions
* [`aggregate`](@ref) : split-apply-combine; applies functions in the form of a cross product
* [`map`](@ref) : apply a function to each group of a `GroupedDataFrame` (without combining)
* [`combine`](@ref) : combine a `GroupedDataFrame`, optionally applying a function to each group

2.2 map

    map(cols => f, gd::GroupedDataFrame)
    cols必须是一个列名或者列索引
    map(f, gd::GroupedDataFrame)
    f 必须是一个可以调用的函数，支持的函数 (`sum`, `prod`,`minimum`, `maximum`, `mean`, `var`, `std`, `first`, `last` and `length`)
    Apply a function to each group of rows and return a [`GroupedDataFrame`](@ref).

2.3 combine

合并计算
    combine(gd::GroupedDataFrame, cols => f...)
    combine(gd::GroupedDataFrame; (colname = cols => f)...)
    combine(gd::GroupedDataFrame, f)
    combine(f, gd::GroupedDataFrame)
    将多个 [`GroupedDataFrame`](@ref)合并成一个`DataFrame`.

    后续函数
    - [`by(f, df, cols)`](@ref) is a shorthand for `combine(f, groupby(df, cols))`.
    - [`map`](@ref): `combine(f, groupby(df, cols))` is a more efficient equivalent

2.4 aggregate

合并计算
    aggregate(df::AbstractDataFrame, fs)
    aggregate(df::AbstractDataFrame, cols, fs; sort=false, skipmissing=false)
    aggregate(gd::GroupedDataFrame, fs; sort=false)
    `AbstractDataFrame` or [`GroupedDataFrame`](@ref). Return an aggregated data frame.

    # 参数说明
    - `df` : an `AbstractDataFrame`
    - `gd` : a `GroupedDataFrame`
    - `cols` : a column indicator (`Symbol`, `Int`, `Vector{Symbol}`, etc.)
    - `fs` : a function or vector of functions to be applied to vectors within groups; expects each argument to be a column vector
    - `sort` : whether to sort rows according to the values of the grouping columns
    - `skipmissing` : whether to skip rows with `missing` values in one of the grouping columns `cols`

3、代码示例

using DataFrames, CSV, Statistics

"""
#  groupby(d::AbstractDataFrame, cols; sort=false, skipmissing=false),
#  返回类型为：GroupedDataFrame
Return a `GroupedDataFrame` representing a view of an `AbstractDataFrame` split into row groups.

# 参数说明
- `df` : an `AbstractDataFrame` to split
- `cols` : 按照那几列分组
- `sort` : 是否分组的列排序
- `skipmissing` : 是否跳过分组的列值为空的情况

接下来可以使用的函数有：
* [`by`](@ref) : split-apply-combine using functions
* [`aggregate`](@ref) : split-apply-combine; applies functions in the form of a cross product
* [`map`](@ref) : apply a function to each group of a `GroupedDataFrame` (without combining)
* [`combine`](@ref) : combine a `GroupedDataFrame`, optionally applying a function to each group

"""

iris = DataFrame(CSV.File(joinpath(dirname(pathof(DataFrames)),
"C:/D/Julia/DataFrames/DataFrames.jl/docs/src/assets/iris.csv")));

#分组
gd = groupby(iris, :Species,sort= true,skipmissing=true)

#取指定的分组
gd[1]

last(gd)

first(gd)

for g in gd
    g = filter(g-> g>2, g.PetalLength)
    println(g)
end

k = first(keys(gd))
gd[(PetalWidth=="Species")]
gd[(SepalLength<3.0,)]

"""
    map(cols => f, gd::GroupedDataFrame)
    cols必须是一个列名或者列索引
    map(f, gd::GroupedDataFrame)
    f 必须是一个可以调用的函数，支持的函数 (`sum`, `prod`,`minimum`, `maximum`, `mean`, `var`, `std`, `first`, `last` and `length`)
    Apply a function to each group of rows and return a [`GroupedDataFrame`](@ref).
"""
#全部分组
map(iris -> sum(iris.PetalLength), gd)

#指定范围的分组
map(iris -> sum(iris.PetalLength), gd[1:2])

map([:PetalLength, :SepalLength] =>
              x -> (a=mean(x.PetalLength)/mean(x.SepalLength), b=std(x.PetalLength)), gd)

map([:PetalLength, :SepalLength] =>
              x -> (a=mean(x.PetalLength)/mean(x.SepalLength), b=var(x.PetalLength)), gd)

map(:PetalLength => sum, gd)

"""
   合并计算
    combine(gd::GroupedDataFrame, cols => f...)
    combine(gd::GroupedDataFrame; (colname = cols => f)...)
    combine(gd::GroupedDataFrame, f)
    combine(f, gd::GroupedDataFrame)
    将多个 [`GroupedDataFrame`](@ref)合并成一个`DataFrame`.

    后续函数
    - [`by(f, df, cols)`](@ref) is a shorthand for `combine(f, groupby(df, cols))`.
    - [`map`](@ref): `combine(f, groupby(df, cols))` is a more efficient equivalent
"""
combine(gd, :PetalLength => sum)

combine(:PetalLength => sum,gd)

combine(iris -> sum(iris.PetalLength), gd)
iris.PetalLength

combine([:PetalLength, :SepalLength] =>
              x -> (a=mean(filter(x-> x>1, x.PetalLength))/mean(x.SepalLength), b=var(x.PetalLength)), gd)

#指定范围的分组合并
combine(:PetalLength => sum,gd[1:2])


"""
   合并计算
    aggregate(df::AbstractDataFrame, fs)
    aggregate(df::AbstractDataFrame, cols, fs; sort=false, skipmissing=false)
    aggregate(gd::GroupedDataFrame, fs; sort=false)
    `AbstractDataFrame` or [`GroupedDataFrame`](@ref). Return an aggregated data frame.

    # 参数说明
    - `df` : an `AbstractDataFrame`
    - `gd` : a `GroupedDataFrame`
    - `cols` : a column indicator (`Symbol`, `Int`, `Vector{Symbol}`, etc.)
    - `fs` : a function or vector of functions to be applied to vectors within groups; expects each argument to be a column vector
    - `sort` : whether to sort rows according to the values of the grouping columns
    - `skipmissing` : whether to skip rows with `missing` values in one of the grouping columns `cols`
"""
aggregate(iris, :Species , maximum)

aggregate(iris, :Species, [sum, x->mean(skipmissing(x))])

aggregate(groupby(iris, :Species)[1:2], [sum, x->mean(skipmissing(x))])


#其他一些函数
parent(gd) # get the parent DataFrame
vcat(gd...)   # 返回原来的 DataFrame, 但是行的顺序会变
DataFrame(gd) # 返回一个新的 DataFrame,
groupvars(gd) # 获取 groupby分组时使用的列名或者列名组合 是一个 vector


eachcol(iris[!,1:end-1], true) #每一列都拆分成一个 DataFrameColumns(key=>Vector)
# :SepalLength => [4.6, 5.3, 7.0, 6.9, 6.3, 7.1]
# :SepalWidth => [3.2, 3.7, 3.2, 3.1, 3.3, 3.0]
# :PetalLength => [1.4, 1.5, 4.7, 4.9, 6.0, 5.9]
# :PetalWidth => [0.2, 0.2, 1.4, 1.5, 2.5, 2.1]
foreach(c -> println(c[1], ": ", mean(c[2])), eachcol(iris[!,1:end-1], true)) # an iteration returns a Pair with column name and values
foreach(c -> println(mean(c[2])), eachcol(iris[!,1:end-1], true)) # an iteration returns a Pair with column name and values
map(mean, eachcol(iris[!,1:end-1], false)) # map可以对相同的列 执行多个统计函数,借助eachcol函数可以对多列执行一个统计函数
mapcols(mean, iris[!,1:end-1]) # 相当于 map 和 eachcol 两个函数的组合 , mapcols可以对多列执行一个统计函数



eachrow(iris[!,1:end-1]) #每一行都拆分成一个 DataFrameRows
foreach(c -> println(c), eachrow(iris[!,1:end-1])) # an iteration returns a Pair with column name and values
map(r -> r.SepalLength/r.SepalWidth, eachrow(iris[!,1:end-1]))
 # now the returned value is DataFrameRow which works similarly to a one-row DataFrame

October-

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
打赏
0
评论
Julia DataFrames ---- groupby/map/combine/aggregate 函数

1、支持的统计函数下面提到的函数也都支持统计函数，具体支持的统计函数见。by：https://blog.csdn.net/weixin_41715077/article/details/1037475042、函数说明2.1 groupby# groupby(d::AbstractDataFrame, cols; sort=false, skipmissing=false)...
复制链接

扫一扫