71_Pandas.DataFrame排名

饺子大人

于 2024-02-08 20:59:02 发布

阅读量776

点赞数 24

分类专栏： Pandas 文章标签： pandas python 数据分析机器学习数据挖掘

本文链接：https://blog.csdn.net/qq_18351157/article/details/136082369

版权

Pandas 专栏收录该内容

75 篇文章 114 订阅

订阅专栏

71_Pandas.DataFrame排名

使用rank()方法对pandas.DataFrame和pandas.Series的行/列进行排名。

sort_values() 是一种按升序或降序对 pandas.DataFrame 列和 pandas.Series 进行排序的方法，但rank() 返回每个元素的排名而不对数据进行排序。

请参阅下面的文章了解 sort_values()。

17_pandas.DataFrame，Series排序（sort_values，sort_index）

在此对以下内容进行说明。

Rank()的基本用法
指定行/列：axis
仅定位数值：numeric_only
指定升序或降序：ascending
指定对相同值（重复值）的处理：method
指定对缺失值 NaN 的处理：na_option
获取百分比：pct
pandas.Series

以下面的 pandas.DataFrame 为例。

import pandas as pd

df = pd.DataFrame({'col1': [50, 80, 100, 80],
                   'col2': [0.3, pd.np.nan, 0.1, pd.np.nan],
                   'col3': ['h', 'j', 'i', 'k']},
                  index=['a', 'b', 'c', 'd'])

print(df)
#    col1  col2 col3
# a    50   0.3    h
# b    80   NaN    j
# c   100   0.1    i
# d    80   NaN    k

Rank()的基本用法

默认情况下，调用rank()方法按升序对每列进行排名。相同的值（重复的值）有一个平均排名，字符串按字母顺序进行比较。

print(df.rank())
#    col1  col2  col3
# a   1.0   2.0   1.0
# b   2.5   NaN   3.0
# c   4.0   1.0   2.0
# d   2.5   NaN   4.0

指定行/列：axis

默认情况下，排名是按列完成的。对行进行排名时，将参数轴设置为 1。在此示例中，该字符串被忽略。

print(df.rank(axis=1))
#    col1  col2
# a   2.0   1.0
# b   1.0   NaN
# c   2.0   1.0
# d   1.0   NaN

仅定位数值：numeric_only

默认情况下，字符串也会排名。如果您只想定位数值，请将参数 numeric_only 设置为 True。

print(df.rank(numeric_only=True))
#    col1  col2
# a   1.0   2.0
# b   2.5   NaN
# c   4.0   1.0
# d   2.5   NaN

默认为 numeric_only=None，仅包含字符串的行和列会受到排名，但如果存在数字和字符串的混合，如示例中对 pandas.DataFrame 的行进行排名的情况，则通过忽略字符串进行处理。

print(df.rank(axis=1))
#    col1  col2
# a   2.0   1.0
# b   1.0   NaN
# c   2.0   1.0
# d   1.0   NaN

如果数字和字符串混合时使用 numeric_only=False ，则会出现 TypeError 错误。

# print(df.rank(axis=1, numeric_only=False))
# TypeError: '<' not supported between instances of 'str' and 'int'

指定升序或降序：ascending

默认情况下，它们按升序排列。要按降序排序，请将参数升序设置为 False。

print(df.rank(ascending=False))
#    col1  col2  col3
# a   4.0   1.0   4.0
# b   2.5   NaN   2.0
# c   1.0   2.0   3.0
# d   2.5   NaN   1.0

指定对相同值（重复值）的处理：method

默认情况下，如果存在相同值（重复值），则返回它们的平均排名。可以使用argument方法指定对相同值（重复值）的处理。默认为 method=‘average’。平均值成为排名。

print(df.rank(method='average'))
#    col1  col2  col3
# a   1.0   2.0   1.0
# b   2.5   NaN   3.0
# c   4.0   1.0   2.0
# d   2.5   NaN   4.0

如果method=‘min’，则最小值成为排名。体育运动中常见的结果，例如第一名、并列第二名、并列第二名和第四名。

print(df.rank(method='min'))
#    col1  col2  col3
# a   1.0   2.0   1.0
# b   2.0   NaN   3.0
# c   4.0   1.0   2.0
# d   2.0   NaN   4.0

如果method=‘max’，则最大值成为等级。

print(df.rank(method='max'))
#    col1  col2  col3
# a   1.0   2.0   1.0
# b   3.0   NaN   3.0
# c   4.0   1.0   2.0
# d   3.0   NaN   4.0

如果method=‘first’，则相同的值（重复的值）将按出现的顺序排列。请注意，method=‘first’ 仅对数字有效。如果包含字符串，请设置 numeric_only=True。

# print(df.rank(method='first'))
# ValueError: first not supported for non-numeric data

print(df.rank(method='first', numeric_only=True))
#    col1  col2
# a   1.0   2.0
# b   2.0   NaN
# c   4.0   1.0
# d   3.0   NaN

如果method=‘dense’，则最小值将像min一样排名，但后续的排名会更接近。第一名、并列第二名、并列第二名、第三名等。

print(df.rank(method='dense'))
#    col1  col2  col3
# a   1.0   2.0   1.0
# b   2.0   NaN   3.0
# c   3.0   1.0   2.0
# d   2.0   NaN   4.0

指定对缺失值 NaN 的处理：na_option

默认情况下，缺失值NaN不排名，保持NaN。可以使用参数 na_option 指定 NaN 处理。默认值为 na_option=‘keep’。 NaN 仍然是 NaN。

print(df.rank(na_option='keep'))
#    col1  col2  col3
# a   1.0   2.0   1.0
# b   2.5   NaN   3.0
# c   4.0   1.0   2.0
# d   2.5   NaN   4.0

如果 na_option=‘top’，NaN 将位于第一位。存在多个 NaN 时的处理遵循参数方法。

print(df.rank(na_option='top'))
#    col1  col2  col3
# a   1.0   4.0   1.0
# b   2.5   1.5   3.0
# c   4.0   3.0   2.0
# d   2.5   1.5   4.0

print(df.rank(na_option='top', method='min'))
#    col1  col2  col3
# a   1.0   4.0   1.0
# b   2.0   1.0   3.0
# c   4.0   3.0   2.0
# d   2.0   1.0   4.0

如果 na_option=‘bottom’，NaN 将位于底部。存在多个 NaN 时的处理遵循参数方法。

print(df.rank(na_option='bottom'))
#    col1  col2  col3
# a   1.0   2.0   1.0
# b   2.5   3.5   3.0
# c   4.0   1.0   2.0
# d   2.5   3.5   4.0

print(df.rank(na_option='bottom', method='min'))
#    col1  col2  col3
# a   1.0   2.0   1.0
# b   2.0   3.0   3.0
# c   4.0   1.0   2.0
# d   2.0   3.0   4.0

获取百分比：pct

如果参数 pct 设置为 True，它将返回每个元素在整体中的百分比位置。还可以指定其他参数。

print(df.rank(pct=True))
#     col1  col2  col3
# a  0.250   1.0  0.25
# b  0.625   NaN  0.75
# c  1.000   0.5  0.50
# d  0.625   NaN  1.00

print(df.rank(pct=True, method='min', ascending=False, na_option='bottom'))
#    col1  col2  col3
# a  1.00  0.25  1.00
# b  0.50  0.75  0.50
# c  0.25  0.50  0.75
# d  0.50  0.75  0.25

pandas.Series

到目前为止的示例是 pandas.DataFrame，但 pandas.Series 也是如此。

print(df['col1'].rank(method='min', ascending=False))
# a    4.0
# b    2.0
# c    1.0
# d    2.0
# Name: col1, dtype: float64

饺子大人

关注

24
点赞
踩
16

收藏

觉得还不错? 一键收藏
0
评论
71_Pandas.DataFrame排名

使用rank()方法对pandas.DataFrame和pandas.Series的行/列进行排名。sort_values() 是一种按升序或降序对 pandas.DataFrame 列和 pandas.Series 进行排序的方法，但rank() 返回每个元素的排名而不对数据进行排序。请参阅下面的文章了解 sort_values()。在此对以下内容进行说明。以下面的 pandas.DataFrame 为例。
复制链接

扫一扫