Python酷库之旅-第三方库Pandas(115)

神奇夜光杯

于 2024-09-02 07:45:00 发布

阅读量385

点赞数 13

分类专栏： Myelsa的Python酷库之旅文章标签： python pandas 开发语言人工智能标准库及第三方库 excel 学习与成长

本文链接：https://blog.csdn.net/ygb_1024/article/details/141686107

版权

Myelsa的Python酷库之旅专栏收录该内容

158 篇文章 43 订阅

订阅专栏

一、用法精讲

506、pandas.DataFrame.rank方法

506-1、语法

506-2、参数

506-3、功能

506-4、返回值

506-5、说明

506-6、用法

506-6-1、数据准备

506-6-2、代码示例

506-6-3、结果输出

507、pandas.DataFrame.round方法

507-1、语法

507-2、参数

507-3、功能

507-4、返回值

507-5、说明

507-6、用法

507-6-1、数据准备

507-6-2、代码示例

507-6-3、结果输出

508、pandas.DataFrame.sem方法

508-1、语法

508-2、参数

508-3、功能

508-4、返回值

508-5、说明

508-6、用法

508-6-1、数据准备

508-6-2、代码示例

508-6-3、结果输出

509、pandas.DataFrame.skew方法

509-1、语法

509-2、参数

509-3、功能

509-4、返回值

509-5、说明

509-6、用法

509-6-1、数据准备

509-6-2、代码示例

509-6-3、结果输出

510、pandas.DataFrame.sum方法

510-1、语法

510-2、参数

510-3、功能

510-4、返回值

510-5、说明

510-6、用法

一、用法精讲

506、pandas.DataFrame.rank方法

506-1、语法

# 506、pandas.DataFrame.rank方法
pandas.DataFrame.rank(axis=0, method='average', numeric_only=False, na_option='keep', ascending=True, pct=False)
Compute numerical data ranks (1 through n) along axis.

By default, equal values are assigned a rank that is the average of the ranks of those values.

Parameters:
axis{0 or ‘index’, 1 or ‘columns’}, default 0
Index to direct ranking. For Series this parameter is unused and defaults to 0.

method{‘average’, ‘min’, ‘max’, ‘first’, ‘dense’}, default ‘average’
How to rank the group of records that have the same value (i.e. ties):

average: average rank of the group

min: lowest rank in the group

max: highest rank in the group

first: ranks assigned in order they appear in the array

dense: like ‘min’, but rank always increases by 1 between groups.

numeric_onlybool, default False
For DataFrame objects, rank only numeric columns if set to True.

Changed in version 2.0.0: The default value of numeric_only is now False.

na_option{‘keep’, ‘top’, ‘bottom’}, default ‘keep’
How to rank NaN values:

keep: assign NaN rank to NaN values

top: assign lowest rank to NaN values

bottom: assign highest rank to NaN values

ascendingbool, default True
Whether or not the elements should be ranked in ascending order.

pctbool, default False
Whether or not to display the returned rankings in percentile form.

Returns:
same type as caller
Return a Series or DataFrame with data ranks as values.

506-2、参数

506-2-1、axis(可选，默认值为0)：{0或 'index', 1或 'columns'}，指定排名计算的轴，0表示按列(每列单独排名)，1表示按行(每行单独排名)。

506-2-2、method(可选，默认值为'average')：{'average', 'min', 'max', 'first', 'dense'}，指定排名的计算方法：

'average'：同名值的排名取平均值。
'min'：同名值的最小排名。
'max'：同名值的最大排名。
'first'：同名值根据它们在原数据中的顺序排名。
'dense'：同名值的排名不留空位，排名连续

506-2-3、numeric_only(可选，默认值为False)：布尔值，如果为True，仅对数值型列进行排名；如果为False，则所有列都会参与排名，非数值类型的列会被忽略。

506-2-4、na_option(可选，默认值为'keep')：{'keep', 'top', 'bottom'}，指定缺失值的处理方式：

'keep'：缺失值保持在ranking中。
'top'：缺失值视为最大值。
'bottom'：缺失值视为最小值。

506-2-5、ascending(可选，默认值为True)：布尔值，指定排序的顺序，True表示升序排名，False表示降序排名。

506-2-6、pct(可选，默认值为False)：布尔值，如果为True，返回每个值的排名在总数中的百分比；如果为False，返回排名的整数值。

506-3、功能

用于对数据框中的数据进行排名，它可以根据指定的参数对数据进行排序，并返回每个值的排名。

506-4、返回值

返回一个DataFrame，包含每个值的排名。

506-5、说明

无

506-6、用法

506-6-1、数据准备

无

506-6-2、代码示例

# 506、pandas.DataFrame.rank方法
import pandas as pd
# 创建示例数据框
data = {
    'A': [1, 2, 3, 4, 5],
    'B': [5, 4, 3, 2, 1],
    'C': [10, 20, 10, 30, 20]
}
df = pd.DataFrame(data)
# 计算每列的排名
rank_by_column = df.rank()
print("默认列排名：\n", rank_by_column)
# 采用不同的排名方法
rank_min = df.rank(method='min')
print("最小排名方法：\n", rank_min)
rank_average = df.rank(method='average')
print("平均排名方法：\n", rank_average)
# 排名按降序
rank_descending = df.rank(ascending=False)
print("降序排名：\n", rank_descending)
# 处理缺失值的选项
data_with_nan = {
    'A': [1, 2, None, 4, 5],
    'B': [5, None, 3, 2, 1]
}
df_nan = pd.DataFrame(data_with_nan)
rank_na_option = df_nan.rank(na_option='bottom')
print("缺失值视为最小值排名：\n", rank_na_option)
# 计算百分排名
rank_pct = df.rank(pct=True)
print("百分排名：\n", rank_pct)

506-6-3、结果输出

# 506、pandas.DataFrame.rank方法
# 默认列排名：
#       A    B    C
# 0  1.0  5.0  1.5
# 1  2.0  4.0  3.5
# 2  3.0  3.0  1.5
# 3  4.0  2.0  5.0
# 4  5.0  1.0  3.5
# 最小排名方法：
#       A    B    C
# 0  1.0  5.0  1.0
# 1  2.0  4.0  3.0
# 2  3.0  3.0  1.0
# 3  4.0  2.0  5.0
# 4  5.0  1.0  3.0
# 平均排名方法：
#       A    B    C
# 0  1.0  5.0  1.5
# 1  2.0  4.0  3.5
# 2  3.0  3.0  1.5
# 3  4.0  2.0  5.0
# 4  5.0  1.0  3.5
# 降序排名：
#       A    B    C
# 0  5.0  1.0  4.5
# 1  4.0  2.0  2.5
# 2  3.0  3.0  4.5
# 3  2.0  4.0  1.0
# 4  1.0  5.0  2.5
# 缺失值视为最小值排名：
#       A    B
# 0  1.0  4.0
# 1  2.0  5.0
# 2  5.0  3.0
# 3  3.0  2.0
# 4  4.0  1.0
# 百分排名：
#       A    B    C
# 0  0.2  1.0  0.3
# 1  0.4  0.8  0.7
# 2  0.6  0.6  0.3
# 3  0.8  0.4  1.0
# 4  1.0  0.2  0.7

507、pandas.DataFrame.round方法

507-1、语法

# 507、pandas.DataFrame.round方法
pandas.DataFrame.round(decimals=0, *args, **kwargs)
Round a DataFrame to a variable number of decimal places.

Parameters:
decimals
int, dict, Series
Number of decimal places to round each column to. If an int is given, round each column to the same number of places. Otherwise dict and Series round to variable numbers of places. Column names should be in the keys if decimals is a dict-like, or in the index if decimals is a Series. Any columns not included in decimals will be left as is. Elements of decimals which are not columns of the input will be ignored.

*args
Additional keywords have no effect but might be accepted for compatibility with numpy.

**kwargs
Additional keywords have no effect but might be accepted for compatibility with numpy.

Returns:
DataFrame
A DataFrame with the affected columns rounded to the specified number of decimal places.

507-2、参数

507-2-1、decimals(可选，默认值为0)：整数或字典，如果是整数，指定要四舍五入的小数位数。例如，decimals=2将会四舍五入到小数点后两位；如果是一个字典，可以对每一列指定不同的小数位数，例如decimals={'A':1,'B':2}表示对列A四舍五入到一位小数，对列B四舍五入到两位小数。

507-2-2、*args(可选)：其他位置参数，为后续扩展功能做预留。

507-2-3、**kwargs(可选)：其他关键字参数，为后续扩展功能做预留。

507-3、功能

用于对数据框中的数值进行四舍五入，该方法对所有数值型列进行处理，并返回一个新的DataFrame，结果将会是每个数值按指定的小数位进行四舍五入。

507-4、返回值

返回一个新的DataFrame，包含四舍五入后的数值。

507-5、说明

无

507-6、用法

507-6-1、数据准备

无

507-6-2、代码示例

# 507、pandas.DataFrame.round方法
import pandas as pd
# 创建示例数据框
data = {
    'A': [1.123, 2.456, 3.789],
    'B': [4.321, 5.654, 6.987],
    'C': [7.001, 8.002, 9.003]
}
df = pd.DataFrame(data)
# 四舍五入到最近的整数
rounded_int = df.round()
print("四舍五入到整数：\n", rounded_int)
# 四舍五入到两位小数
rounded_two_decimals = df.round(decimals=2)
print("四舍五入到两位小数：\n", rounded_two_decimals)
# 对不同列指定不同的小数位
rounded_custom_decimals = df.round(decimals={'A': 1, 'B': 1, 'C': 0})
print("对不同列指定不同的小数位：\n", rounded_custom_decimals)

507-6-3、结果输出

# 507、pandas.DataFrame.round方法
# 四舍五入到整数：
#       A    B    C
# 0  1.0  4.0  7.0
# 1  2.0  6.0  8.0
# 2  4.0  7.0  9.0
# 四舍五入到两位小数：
#        A     B    C
# 0  1.12  4.32  7.0
# 1  2.46  5.65  8.0
# 2  3.79  6.99  9.0
# 对不同列指定不同的小数位：
#       A    B    C
# 0  1.1  4.3  7.0
# 1  2.5  5.7  8.0
# 2  3.8  7.0  9.0

508、pandas.DataFrame.sem方法

508-1、语法

# 508、pandas.DataFrame.sem方法
pandas.DataFrame.sem(axis=0, skipna=True, ddof=1, numeric_only=False, **kwargs)
Return unbiased standard error of the mean over requested axis.

Normalized by N-1 by default. This can be changed using the ddof argument

Parameters:
axis{index (0), columns (1)}
For Series this parameter is unused and defaults to 0.

Warning

The behavior of DataFrame.sem with axis=None is deprecated, in a future version this will reduce over both axes and return a scalar To retain the old behavior, pass axis=0 (or do not pass axis).

skipnabool, default True
Exclude NA/null values. If an entire row/column is NA, the result will be NA.

ddofint, default 1
Delta Degrees of Freedom. The divisor used in calculations is N - ddof, where N represents the number of elements.

numeric_onlybool, default False
Include only float, int, boolean columns. Not implemented for Series.

Returns:
Series or DataFrame (if level specified).

508-2、参数

508-2-1、axis(可选，默认值为0)：{0 or ‘index’, 1 or ‘columns’}，选择计算的轴，0表示按行计算(对每列进行计算)，1表示按列计算(对每行进行计算)。

508-2-2、skipna(可选，默认值为True)：布尔值，是否忽略NaN值，True表示在计算时跳过NaN值，False表示不跳过。

508-2-3、ddof(可选，默认值为1)：整数，自由度调整的值，标准误差的计算公式会使用(n - ddof) 作为分母，其中n是样本数量，通常设置为1(样本标准差)，如果为0则计算总体标准差。

508-2-4、numeric_only(可选，默认值为False)：布尔值，是否只计算数值型数据，如果为True，仅包含数值型列进行计算，非数值型列将被忽略。

508-2-5、**kwargs(可选)：其他传递给方法的关键字参数，通常不需要使用。

508-3、功能

用于计算数据框中沿指定轴的标准误差(Standard Error of the Mean, SEM)，标准误差可以用来衡量样本均值的稳定性，它是样本标准差除以样本数量的平方根。

508-4、返回值

返回一个Series或DataFrame，包含所选轴上每个列或行的标准误差。

508-5、说明

无

508-6、用法

508-6-1、数据准备

无

508-6-2、代码示例

# 508、pandas.DataFrame.sem方法
import pandas as pd
import numpy as np
# 创建示例数据框
data = {
    'A': [10, 20, 30, np.nan],
    'B': [5, 15, np.nan, np.nan],
    'C': [1, 2, 3, 4]
}
df = pd.DataFrame(data)
# 计算每列的标准误差
sem_columns = df.sem()
print("每列的标准误差：\n", sem_columns)
# 计算每行的标准误差
sem_rows = df.sem(axis=1)
print("每行的标准误差：\n", sem_rows)
# 忽略NaN的情况下的标准误差
sem_skipna = df.sem(skipna=True)
print("忽略NaN的标准误差：\n", sem_skipna)

508-6-3、结果输出

# 508、pandas.DataFrame.sem方法
# 每列的标准误差：
#  A    5.773503
# B    5.000000
# C    0.645497
# dtype: float64
# 每行的标准误差：
#  0     2.603417
# 1     5.364492
# 2    13.500000
# 3          NaN
# dtype: float64
# 忽略NaN的标准误差：
#  A    5.773503
# B    5.000000
# C    0.645497
# dtype: float64

509、pandas.DataFrame.skew方法

509-1、语法

# 509、pandas.DataFrame.skew方法
pandas.DataFrame.skew(axis=0, skipna=True, numeric_only=False, **kwargs)
Return unbiased skew over requested axis.

Normalized by N-1.

Parameters:
axis{index (0), columns (1)}
Axis for the function to be applied on. For Series this parameter is unused and defaults to 0.

For DataFrames, specifying axis=None will apply the aggregation across both axes.

New in version 2.0.0.

skipnabool, default True
Exclude NA/null values when computing the result.

numeric_onlybool, default False
Include only float, int, boolean columns. Not implemented for Series.

**kwargs
Additional keyword arguments to be passed to the function.

Returns:
Series or scalar.

509-2、参数

509-2-1、axis(可选，默认值为0)：{0 or ‘index’, 1 or ‘columns’}，选择计算的轴，0表示对每列计算偏度，1表示对每行计算偏度。

509-2-2、skipna(可选，默认值为True)：布尔值，是否忽略NaN值，True表示在计算偏度时跳过NaN值，False表示不跳过。

509-2-3、numeric_only(可选，默认值为False)：布尔值，是否只计算数值型数据，如果为True，仅包含数值型列进行计算，非数值型列将被忽略。

509-2-4、**kwargs(可选)：其他传递给方法的关键字参数，通常不需要使用。

509-3、功能

用于计算数据框中每列或每行的偏度(skewness)，偏度是一种衡量数据分布不对称程度的统计量。正偏度表示数据右侧尾部较长(大值多)，负偏度表示数据左侧尾部较长(小值多)。

509-4、返回值

返回一个Series或DataFrame，包含所选轴上每个列或行的偏度值。

509-5、说明

无

509-6、用法

509-6-1、数据准备

无

509-6-2、代码示例

# 509、pandas.DataFrame.skew方法
import pandas as pd
import numpy as np
# 创建示例数据框
data = {
    'A': [10, 20, 30, np.nan],
    'B': [5, 15, np.nan, 35],
    'C': [1, 2, 3, 4]
}
df = pd.DataFrame(data)
# 计算每列的偏度
skew_columns = df.skew()
print("每列的偏度：\n", skew_columns)
# 计算每行的偏度
skew_rows = df.skew(axis=1)
print("每行的偏度：\n", skew_rows)
# 忽略 NaN 的情况下的偏度
skew_skipna = df.skew(skipna=True)
print("忽略 NaN 的偏度：\n", skew_skipna)

509-6-3、结果输出

# 509、pandas.DataFrame.skew方法
# 每列的偏度：
#  A    0.00000
# B    0.93522
# C    0.00000
# dtype: float64
# 每行的偏度：
#  0    0.330832
# 1   -1.185115
# 2         NaN
# 3         NaN
# dtype: float64
# 忽略 NaN 的偏度：
#  A    0.00000
# B    0.93522
# C    0.00000
# dtype: float64

510、pandas.DataFrame.sum方法

510-1、语法

# 510、pandas.DataFrame.sum方法
pandas.DataFrame.sum(axis=0, skipna=True, numeric_only=False, min_count=0, **kwargs)
Return the sum of the values over the requested axis.

This is equivalent to the method numpy.sum.

Parameters:
axis{index (0), columns (1)}
Axis for the function to be applied on. For Series this parameter is unused and defaults to 0.

Warning

The behavior of DataFrame.sum with axis=None is deprecated, in a future version this will reduce over both axes and return a scalar To retain the old behavior, pass axis=0 (or do not pass axis).

New in version 2.0.0.

skipnabool, default True
Exclude NA/null values when computing the result.

numeric_onlybool, default False
Include only float, int, boolean columns. Not implemented for Series.

min_countint, default 0
The required number of valid values to perform the operation. If fewer than min_count non-NA values are present the result will be NA.

**kwargs
Additional keyword arguments to be passed to the function.

Returns:
Series or scalar.

510-2、参数

510-2-1、axis(可选，默认值为0)：{0 or ‘index’, 1 or ‘columns’}，指定计算的轴，0表示对每列求和，1表示对每行求和。

510-2-2、skipna(可选，默认值为True)：布尔值，是否忽略NaN值，True表示在计算总和时跳过NaN值，False表示不跳过。

510-2-3、numeric_only(可选，默认值为False)：布尔值，是否仅计算数值型数据，如果为True，仅对数值型列进行求和，非数值型列将被忽略。

510-2-4、min_count(可选，默认值为0)：整数，计算总和时所需的最小非NaN值个数，如果非NaN值少于此数量，则返回NaN。

510-2-5、**kwargs(可选)：其他传递给方法的关键字参数，通常不需要使用。

510-3、功能

用于计算数据框中各列或各行的总和，可以通过调整参数来控制计算的行为和范围。

510-4、返回值

返回一个Series或DataFrame，包含所选轴上每个列或行的总和。

510-5、说明

无

510-6、用法

510-6-1、数据准备

无

510-6-2、代码示例

# 510、pandas.DataFrame.sum方法
import pandas as pd
import numpy as np
# 创建示例数据框
data = {
    'A': [1, 2, 3, np.nan],
    'B': [4, 5, np.nan, 7],
    'C': [np.nan, 8, 9, 10]
}
df = pd.DataFrame(data)
# 计算每列的总和
sum_columns = df.sum()
print("每列的总和：\n", sum_columns)
# 计算每行的总和
sum_rows = df.sum(axis=1)
print("每行的总和：\n", sum_rows)
# 忽略 NaN 的情况下的总和
sum_skipna = df.sum(skipna=True)
print("忽略NaN的总和：\n", sum_skipna)
# 设置min_count参数
sum_min_count = df.sum(min_count=2)
print("至少有两个非NaN值的总和：\n", sum_min_count)

510-6-3、结果输出

# 510、pandas.DataFrame.sum方法
# 每列的总和：
#  A     6.0
# B    16.0
# C    27.0
# dtype: float64
# 每行的总和：
#  0     5.0
# 1    15.0
# 2    12.0
# 3    17.0
# dtype: float64
# 忽略NaN的总和：
#  A     6.0
# B    16.0
# C    27.0
# dtype: float64
# 至少有两个非NaN值的总和：
#  A     6.0
# B    16.0
# C    27.0
# dtype: float64