Python | Pandas | 不完全总结

16 篇文章 3 订阅
12 篇文章 0 订阅

本文对 Pandas 的使用进行不完全总结 1

Updated: 2023 / 08 / 05



数据类型

参考这里 2


数据类型的问题一般都是出了问题之后才会发现的,所以有了一些经验之后就会拿到数据之后,就直接看数据类型,是否与自己想要处理的数据格式一致,这样可以从一开始避免一些尴尬的问题出现。如果需要进行数据预处理,则一般通过数据类型的转化。

此处介绍 pandas 里面的数据类型( data types 也就是常用的 dtype ),以及 pandasnumpy 之间的数据对应关系:

在这里插入图片描述
这里需要强调的是 Pandasobject 类型实际上可以包括多种不同的类型,比如一列数据里,既有整型、浮点型,也有字符串类型,这些都会被标识为 object,所以在处理数据时,可能需要额外的一些方法提前将这些字段做清洗,比如 str.replace(), float(), int(), astype()apply() 等等 3

接下来,主要介绍 objectint64float64datetime64bool 等几种类型,categorytimedelta 两种类型会进行另外介绍。


原始数据文件参考此处 4,原始数据内容如下所示:

data = pd.read_csv('dtype_RawData.csv')

# 读取到的原始数据如下所示:
#
#    Customer Number     Customer Name         2016  ... Month   Day Active
# 0             1002  Quest industries  ¥125,000.00  ...     1  2015      Y
# 1           552278    Smith Plumbing  ¥920,000.00  ...     6  2014      Y
# 2            23477   ACME Industrial   ¥50,000.00  ...     3  2016      Y
# 3            24900        Brekke LTD  ¥350,000.00  ...    10  2015      Y
# 4           651029         Harbor Co   ¥15,000.00  ...     2  2014      N
#
# [5 rows x 9 columns]

信息查询

  1. 数据有哪些列以及每列的数据的数据类型
print(data.columns)
Index(['Customer Number', 'Customer Name', '2016', '2017', 'Percent Growth',
#        'Jan Units', 'Month', 'Day', 'Active'],
#       dtype='object')

print(type(data.dtypes))
# <class 'pandas.core.series.Series'>

print(data.dtypes)
# Customer Number     int64
# Customer Name      object
# 2016               object
# 2017               object
# Percent Growth     object
# Jan Units          object
# Month               int64
# Day                 int64
# Active             object
# dtype: object

print(df.dtypes.index)
# Index(['Customer Number', 'Customer Name', '2016', '2017', 'Percent Growth',
#        'Jan Units', 'Month', 'Day', 'Active'],
#       dtype='object')

print(df.dtypes.values)
# [dtype('int64') dtype('O') dtype('O') dtype('O') dtype('O') dtype('O')
#  dtype('int64') dtype('int64') dtype('O')]

print(df.dtypes.index[4])
print(df.dtypes[df.dtypes.index[4]])
# Percent Growth
# object

原始数据存在名为 Customer Number, Customer Name, 2016, 2017, Percent Growth, Jan Units, Month, Day, Active 的列,共9列。
每一列的数据的类型如 data.dtypes 所展示的。


  1. 获取数组的基本信息(维度、列名称、数据格式、所占空间等)
data.info()

'''
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 9 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   Customer Number  5 non-null      int64 
 1   Customer Name    5 non-null      object
 2   2016             5 non-null      object
 3   2017             5 non-null      object
 4   Percent Growth   5 non-null      object
 5   Jan Units        5 non-null      object
 6   Month            5 non-null      int64 
 7   Day              5 non-null      int64 
 8   Active           5 non-null      object
dtypes: int64(3), object(6)
memory usage: 488.0+ bytes
None
'''

结合上面得到的信息,

  • Customer Number 列的数据是 int64,并不是 float64 格式
  • 20162017 两列的数据是 object,并不是 float64 或者 int64 格式
  • Percent 以及 Jan Units 也是 objects,而不是数字格式
  • MonthDay 应该转化为 datetime64[ns] 格式
  • Active 列应该是布尔值

假如想得到 2016 年与 2017 年的数据总和,可以尝试,但并不会得到我们需要的答案,因为这两列中的数据类型是 object,执行 + 运算操作之后,得到是一个更加长的字符串,如下所示:

print(data['2016'] + data['2017'])

'''
0      ¥125,000.00¥162,500.00
1    ¥920,000.00¥1,012,000.00
2        ¥50,000.00¥62,500.00
3      ¥350,000.00¥490,000.00
4        ¥15,000.00¥12,750.00
dtype: object
'''

  1. 获取数组的维度
print(data.shape)
# (5, 9)

说明对应的数据有 5 行、9 列。


类型转换

参考这里 567

如果不做数据清洗,很难进行下一步的数据分析,为了进行数据格式的转化,pandas 里面有3种比较常用的方法:

  • astype() 强制转化数据类型
  • 通过创建用户自定义的函数进行数据转化
  • 通过 pandas 或者 numpy 提供的 to_nueric() 以及 to_datetime() 等现有函数转化数据类型

astype

可以通过 astype() 将第一列的数据转化为浮点数 float 类型,

data['Day'].astype("float")
print(data['Day'].dtypes)
# int64

但是这样的操作并没有改变原始的数据框,而只是返回的一个拷贝。
想要真正的改变数据框,通常需要通过赋值来进行,如下,

data['Day'] = data['Day'].astype("float")
print(data['Day'].dtypes)
# float64

astype 基本也就是两种用作用:将数字转化为单纯字符串;将单纯数字的字符串转化为数字。
含有其他的非数字的字符串是不能通过 astype 进行转化的,比如 20162017Percent GrowthJan Units 这几列带有特殊符号的数据类型为 object 的列。

因此需要引入其他的方法进行转化,也就有了下面的自定义函数方法 ——


自定义函数

可以通过下面的自定义函数对含有特殊字符的列进行转化,并通过 pandas 中的 apply 函数将 2016 列中的数据全部转化为 float64 类型。

def converter(val):
	'''
	通过 `replace` 函数将 `$` 以及逗号去掉,然后字符串转化为浮点数。
	'''
    new = val.replace(",", "").replace("¥", "")
    return float(new)

data['2016'] = data['2016'].apply(converter)

print(data['2016'])
print(data['2016'].dtype)
'''
0    125000.0
1    920000.0
2     50000.0
3    350000.0
4     15000.0
Name: 2016, dtype: float64

float64
'''

也当然可以通过 lambda 函数将这个比较简单的函数一行带过,如下所示:

data['2016'] = data['2016'].apply(lambda x: x.replace(",", "").replace("¥", "")).astype('float64')
print(data['2016'])
print(data['2016'].dtype)
'''
0    125000.0
1    920000.0
2     50000.0
3    350000.0
4     15000.0
Name: 2016, dtype: float64

float64
'''

现有函数

numpy.where

利用 numpy.where(表达式) 函数将 Active 列转化为布尔值,如下:

import numpy as np

print(data['Active'])
print(data['Active'].dtype)

'''
0    Y
1    Y
2    Y
3    Y
4    N
Name: Active, dtype: object

object
'''


data["Active"] = np.where(data["Active"] == "Y", True, False)


print(data['Active'])
print(data['Active'].dtype)

'''
Name: Active, dtype: object
object
0     True
1     True
2     True
3     True
4    False
Name: Active, dtype: bool

bool
'''

pandas.to_numeric
print(data['Jan Units'])
print(data['Jan Units'].dtype)

'''
0       500
1       700
2       125
3        75
4    Closed
Name: Jan Units, dtype: object

object
'''


data['Jan Units'] = pd.to_numeric(data['Jan Units'], errors='coerce')


print(data['Jan Units'])
print(data['Jan Units'].dtype)

'''
0    500.0
1    700.0
2    125.0
3     75.0
4      NaN
Name: Jan Units, dtype: float64

float64
'''

pandas.to_datetime
data['Year'] = 2022
data['Start Date'] = pd.to_datetime(data[['Month', 'Day', 'Year']])

print(data['Start Date'])
print(data['Start Date'].dtype)

'''
0   2022-01-20
1   2022-06-14
2   2022-03-16
3   2022-10-15
4   2022-02-14
Name: Start Date, dtype: datetime64[ns]

datetime64[ns]
'''

格式

以下面的 df 为例,对修改 df 内数值的格式进行相应说明:

import numpy as np
df = pd.DataFrame(np.arange(0, 1, 0.05).reshape(4, 5), columns=['Asia', 'Africa', 'Australia', 'Europe', 'Russia'])
#    Asia  Africa  Australia  Europe  Russia
# 0  0.00    0.05       0.10    0.15    0.20
# 1  0.25    0.30       0.35    0.40    0.45
# 2  0.50    0.55       0.60    0.65    0.70
# 3  0.75    0.80       0.85    0.90    0.95
#
# Index(['Asia', 'Africa', 'Australia', 'Europe', 'Russia'], dtype='object')
#
# RangeIndex(start=0, stop=4, step=1)
#
# Asia         float64
# Africa       float64
# Australia    float64
# Europe       float64
# Russia       float64
# dtype: object

特殊字符

参考这里 8


百分比

方法

参考这里 9101112


用途方法说明
数字 ➡️ 百分比df.applymap(lambda x: '%.2f%%' % (x*100))适用于 dataframe,但是不适用于 Series
df[col].apply(lambda x: '{:.2%}'.format(x))

df[col].map(lambda n: '{:.2%}'.format(n))
仅适用于 Series
百分比 ➡️ 数字df[col].str.strip('%').astype(float)/100 11仅适用于 Series
df.replace('%$', '', regex=True).astype('float')/100.0

df[['col']].replace('%$', '', regex=True).astype('float')/100.0
适用于 dataframe,但不适用于 Series

缺失值

Pandas 中将如下类型定义为缺失值:
NaN#N/AnanNULLNone

对于缺失值,可以使用 pd.fillna() 作相应的处理 13


语法

pandas.DataFrame.fillna(value=None, method=None, axis=None, inplace=False, limit=None, downcast=None, **kwargs)
参数说明
value用于填充的空值的值。
method{'backfill', 'bfill', 'pad', 'ffill', None}, default None。定义了填充空值的方法
pad / ffill 表示用前面行/列的值,填充当前行/列的空值
backfill / bfill 表示用后面行/列的值,填充当前行/列的空值。
axis轴。 0index,表示按行删除;
1columns,表示按列删除。
inplace是否原地替换。布尔值,默认为 False。如果为 True,则在原DataFrame 上进行操作,返回值为 None
limitintdefault None。如果 method 被指定,对于连续的空值,这段连续区域,最多填充前 limit 个空值(如果存在多段连续区域,每段最多填充前 limit 个空值)。如果method未被指定, 在该 axis 下,最多填充前 limit 个空值(不论空值连续区间是否间断)
downcastdict, default is None,字典中的项为,为类型向下转换规则。或者为字符串 infer,此时会在合适的等价类型之间进行向下转换,比如 float64 to int64 if possible

示例

构造一个含有 nandataframe

dic = {"a": [8, np.nan, np.nan, 6, 9], "b": pd.Series([np.nan, 6, np.nan, np.nan, 10]),
       "c": ["qq", np.nan, np.nan, np.nan, 88]} 
df = pd.DataFrame(dic)
#      a     b    c
# 0  8.0   NaN   qq
# 1  NaN   6.0  NaN
# 2  NaN   NaN  NaN
# 3  6.0   NaN  NaN
# 4  9.0  10.0   88

查询在 df 中哪些是空值,以及,哪些为非空值,

print(pd.isna(df))
#        a      b      c
# 0  False   True  False
# 1   True  False   True
# 2   True   True   True
# 3  False   True   True
# 4  False  False  False

print(pd.notna(df))
#        a      b      c
# 0   True  False   True
# 1  False   True  False
# 2  False  False  False
# 3   True  False  False
# 4   True   True   True

下面,开始向空值以指定的方式进行填充。

  1. df.fillna(value)

inplace 默认为 False,表示 pd.fillna() 返回一个新的 DataFrame,原 df 不变。如果设置为 True,表示在原 df 上填充,返回 None
df.fillna(value),则 df 中的所有 nan 都用单一的 value 替换并返回。

dffill10 = df.fillna(
				10,
                method=None,
                axis=1,  
                limit=2, 
                inplace=False
                )
#       a     b     c
# 0   8.0    10    qq
# 1  10.0   6.0  10.0
# 2  10.0  10.0   NaN
# 3   6.0  10.0  10.0
# 4   9.0  10.0  88.0
  1. df.fillna(dict)

以字典填充 nan。字典的键值是列名,字典的值是待填充到相应列的填充值。

dic = {"a": 0, "b": 1, "c": 2}
dffill10 = df.fillna(
    dic,
    axis=0,
    limit=2,
    inplace=False
)
#      a     b    c
# 0  8.0   1.0   qq
# 1  0.0   6.0    2
# 2  0.0   1.0    2
# 3  6.0   NaN  NaN
# 4  9.0  10.0   88
  1. df.fillna(df)

df 填充 nan

filldf = pd.DataFrame(np.arange(10).reshape((5, 2)), columns=["a", "b"])
#    a  b
# 0  0  1
# 1  2  3
# 2  4  5
# 3  6  7
# 4  8  9

df = df.fillna(filldf, inplace=False)
#      a     b    c
# 0  8.0   1.0   qq
# 1  2.0   6.0  NaN
# 2  4.0   5.0  NaN
# 3  6.0   7.0  NaN
# 4  9.0  10.0   88

filldfdf 的列名相同的列中,filldf 中的值会填充在 df 的同一位置为 nan 的值。


示例

数字 -> 百分比
Series
df['Russia'] = df['Russia'].map(lambda n: '{:.2%}'.format(n))
#    Asia  Africa  Australia  Europe  Russia
# 0  0.00    0.05       0.10    0.15  20.00%
# 1  0.25    0.30       0.35    0.40  45.00%
# 2  0.50    0.55       0.60    0.65  70.00%
# 3  0.75    0.80       0.85    0.90  95.00%
#
# Asia         float64
# Africa       float64
# Australia    float64
# Europe       float64
# Russia        object
# dtype: object

或者,

df['Russia'] = df['Russia'].apply(lambda n: '{:.2%}'.format(n))

又或者,

df[['Russia']] = df[['Russia']].applymap(lambda x: '%.2f%%' % (x*100))

df
df_percent = df.applymap(lambda x: '%.2f%%' % (x*100))
#      Asia  Africa Australia  Europe  Russia
# 0   0.00%   5.00%    10.00%  15.00%  20.00%
# 1  25.00%  30.00%    35.00%  40.00%  45.00%
# 2  50.00%  55.00%    60.00%  65.00%  70.00%
# 3  75.00%  80.00%    85.00%  90.00%  95.00%
#
# Asia         object
# Africa       object
# Australia    object
# Europe       object
# Russia       object
# dtype: object

百分比 -> 数字
Series
df['Russia'] = df['Russia'].str.strip('%').astype(float)/100
#    Asia  Africa  Australia  Europe  Russia
# 0  0.00    0.05       0.10    0.15    0.20
# 1  0.25    0.30       0.35    0.40    0.45
# 2  0.50    0.55       0.60    0.65    0.70
# 3  0.75    0.80       0.85    0.90    0.95
#
# Asia         float64
# Africa       float64
# Australia    float64
# Europe       float64
# Russia       float64
# dtype: object

df
df[['Russia', 'Europe']] = df[['Russia', 'Europe']].applymap(lambda x: '%.2f%%' % (x*100))
#    Asia  Africa  Australia  Europe  Russia
# 0  0.00    0.05       0.10  15.00%  20.00%
# 1  0.25    0.30       0.35  40.00%  45.00%
# 2  0.50    0.55       0.60  65.00%  70.00%
# 3  0.75    0.80       0.85  90.00%  95.00%

df[['Russia', 'Europe']] = df[['Russia', 'Europe']].replace('%$', '', regex=True).astype('float')/100.0
#    Asia  Africa  Australia  Europe  Russia
# 0  0.00    0.05       0.10    0.15    0.20
# 1  0.25    0.30       0.35    0.40    0.45
# 2  0.50    0.55       0.60    0.65    0.70
# 3  0.75    0.80       0.85    0.90    0.95

又或者,

df_percent = df_percent.replace('%$', '', regex=True).astype('float') / 100.0
#    Asia  Africa  Australia  Europe  Russia
# 0  0.00    0.05       0.10    0.15    0.20
# 1  0.25    0.30       0.35    0.40    0.45
# 2  0.50    0.55       0.60    0.65    0.70
# 3  0.75    0.80       0.85    0.90    0.95
#
# Asia         float64
# Africa       float64
# Australia    float64
# Europe       float64
# Russia       float64
# dtype: object

数据结构

字典 <–> df

参考这里 14


字典 -> df

以下面的两个字典为例对如何 15 将字典转换为 dataframe 结构进行说明,

dict1 = {'i': 1, 'love': 2, 'you': 3}
dict2 = {('country2', '2017'): {('prov_a', 'ind_a', 'EUR'): 5.0, ('prov_b', 'ind_b', 'EUR'): 7.0, ('prov_c', 'ind_c', 'EUR'): np.nan},
        ('country2', '2018'): {('prov_a', 'ind_a', 'EUR'): 8.0, ('prov_b', 'ind_b', 'EUR'): np.nan, ('prov_c', 'ind_c', 'EUR'): 15.0},
        ('country2', '2019'): {('prov_a', 'ind_a', 'EUR'): np.nan, ('prov_b', 'ind_b', 'EUR'): 5.0, ('prov_c', 'ind_c', 'EUR'): 60.0}}

pd.Dataframe
示例
  • 关于 dict1
df1 = pd.DataFrame(dict1)
#   File "/Applications/miniconda3/lib/python3.8/site-packages/pandas/core/internals/construction.py", line 664, in _extract_index
#     raise ValueError("If using all scalar values, you must pass an index")
# ValueError: If using all scalar values, you must pass an index

df1 = pd.DataFrame(dict1, index=['idx'])
#      i  love  you
# idx  1     2    3
#
# Index(['i', 'love', 'you'], dtype='object')
#
# Index(['idx'], dtype='object')

df1_to_dict = df1.to_dict()
# {'i': {'idx': 1}, 'love': {'idx': 2}, 'you': {'idx': 3}}


df1_T = pd.DataFrame(dict1, index=['idx']).T
#       idx
# i       1
# love    2
# you     3
#
# Index(['idx'], dtype='object')
#
# Index(['i', 'love', 'you'], dtype='object')

df1_T_to_dict = df1_T.to_dict()
# {'idx': {'i': 1, 'love': 2, 'you': 3}}

  • 关于 dict2
df2 = pd.DataFrame(dict2)
print(df2)
#                 country2
#                      2017  2018  2019
# prov_a ind_a EUR      5.0   8.0   NaN
# prov_b ind_b EUR      7.0   NaN   5.0
# prov_c ind_c EUR      NaN  15.0  60.0
#
# MultiIndex([('country2', '2017'),
#             ('country2', '2018'),
#             ('country2', '2019')],
#            )
#
# MultiIndex([('prov_a', 'ind_a', 'EUR'),
#             ('prov_b', 'ind_b', 'EUR'),
#             ('prov_c', 'ind_c', 'EUR')],
#            )

df2_T = pd.DataFrame(dict2).T
#               prov_a prov_b prov_c
#                ind_a  ind_b  ind_c
#                  EUR    EUR    EUR
# country2 2017    5.0    7.0    NaN
#          2018    8.0    NaN   15.0
#          2019    NaN    5.0   60.0
#
# MultiIndex([('prov_a', 'ind_a', 'EUR'),
#             ('prov_b', 'ind_b', 'EUR'),
#             ('prov_c', 'ind_c', 'EUR')],
#            )
#
# MultiIndex([('country2', '2017'),
#             ('country2', '2018'),
#             ('country2', '2019')],
#            )

pd.Dataframe.from_dict()
语法

参考这里 16

classmethod DataFrame.from_dict(data, orient='columns', dtype=None, columns=None)[source]

示例
  • 关于 dict1
df1 = pd.DataFrame.from_dict(dict1, orient='index')
#       0
# i     1
# love  2
# you   3
#
# RangeIndex(start=0, stop=1, step=1)
#
# Index(['i', 'love', 'you'], dtype='object')

  • 关于 dict2
df2 = pd.DataFrame.from_dict(dict2, orient='index')
#               prov_a prov_b prov_c
#                ind_a  ind_b  ind_c
#                  EUR    EUR    EUR
# country2 2017    5.0    7.0    NaN
#          2018    8.0    NaN   15.0
#          2019    NaN    5.0   60.0
#
# MultiIndex([('prov_a', 'ind_a', 'EUR'),
#             ('prov_b', 'ind_b', 'EUR'),
#             ('prov_c', 'ind_c', 'EUR')],
#            )
#
# MultiIndex([('country2', '2017'),
#             ('country2', '2018'),
#             ('country2', '2019')],
#            )

df2 = pd.DataFrame.from_dict(dict2, orient='columns')
#                  country2
#                      2017  2018  2019
# prov_a ind_a EUR      5.0   8.0   NaN
# prov_b ind_b EUR      7.0   NaN   5.0
# prov_c ind_c EUR      NaN  15.0  60.0
#
# MultiIndex([('country2', '2017'),
#             ('country2', '2018'),
#             ('country2', '2019')],
#            )
#
# MultiIndex([('prov_a', 'ind_a', 'EUR'),
#             ('prov_b', 'ind_b', 'EUR'),
#             ('prov_c', 'ind_c', 'EUR')],
#            )

索引

设置

参考此处 17

  • df.set_index(keys, drop=True, append=False, inplace=False, verify_integrity=False)
df.set_index(["Column"], inplace=True)
# 下面代码实现了将df中的column列作为index

引用

参考这里 18


df[col] 一般用来选择列,[] 中写列名(所以一般数据 columns 都会单独指定,不会用默认数字列名,以免和 index 冲突。)

  • df['a']
    单列,以 pd.Series 的形式输出;
  • df[['a', 'b']]
    多列,以 pd.Dataframe 的形式输出;

df[] 中为数字时,默认选择行,且只能进行切片的选择,不能单独选择(df[0])。
df[] 不能通过索引标签名来选择行 df['one']
df.loc[label] 主要针对 index 选择行,同时支持指定 index,及默认数字 index

  • df.loc['one'], df.loc[1]
    单行,以 pd.Series 的形式输出;
  • df.loc[['one', 'two']], df.loc[1:3]
    多行,以 pd.Dataframe 的形式输出;

ilocloc 的区别,基本 ilocloc 的用法相似,但和 loc 索引不同是不能索引超出数据行数的整数位置。

布尔型索引与 pd.Series 的索引基本一致。


排序

参考这里 1920


以以下示例为例,对如何重塑索引排序进行说明,

df1 = pd.DataFrame(
    data={"data_provider": ["prov_a", "prov_a", "prov_b", "prov_b", "prov_c", "prov_c"],
          "indicator": ["ind_a", "ind_a", "ind_b", "ind_b", "ind_c", "ind_c"],
          "unit": ["EUR", "EUR", "EUR", "EUR", "EUR", "EUR"],
          "year": ["2017", "2018","2019", "2017","2018","2019"],
          "country1": [6, 2, 10, 1, 3, 6],
          "country2": [5, 8, 5, 7, 15, 60]}
)
#   data_provider indicator unit  year  country1  country2
# 0        prov_a     ind_a  EUR  2017         6         5
# 1        prov_a     ind_a  EUR  2018         2         8
# 2        prov_b     ind_b  EUR  2019        10         5
# 3        prov_b     ind_b  EUR  2017         1         7
# 4        prov_c     ind_c  EUR  2018         3        15
# 5        prov_c     ind_c  EUR  2019         6        60
#
# Index(['data_provider', 'indicator', 'unit', 'year', 'country1', 'country2'], dtype='object')
#
# RangeIndex(start=0, stop=6, step=1)


df1MultiIndex = df1.pivot_table(
    index=['data_provider', 'indicator', 'unit'],
    columns='year'
)
#                              country1            country2            
# year                             2017 2018  2019     2017  2018  2019
# data_provider indicator unit                                         
# prov_a        ind_a     EUR       6.0  2.0   NaN      5.0   8.0   NaN
# prov_b        ind_b     EUR       1.0  NaN  10.0      7.0   NaN   5.0
# prov_c        ind_c     EUR       NaN  3.0   6.0      NaN  15.0  60.0
#
# MultiIndex([('country1', '2017'),
#             ('country1', '2018'),
#             ('country1', '2019'),
#             ('country2', '2017'),
#             ('country2', '2018'),
#             ('country2', '2019')],
#            names=[None, 'year'])
#
# MultiIndex([('prov_a', 'ind_a', 'EUR'),
#             ('prov_b', 'ind_b', 'EUR'),
#             ('prov_c', 'ind_c', 'EUR')],
#            names=['data_provider', 'indicator', 'unit'])

sort_index

df.sort_index() 实现按索引排序,默认以从小到大的升序方式排列,如希望按降序排列,传入ascending = False

其语法为,

DataFrame.sort_index(*, axis=0, level=None, ascending=True, inplace=False, kind='quicksort', na_position='last', sort_remaining=True, ignore_index=False, key=None

reindex

DataFrame.reindex(labels=None, index=None, columns=None, axis=None, method=None, copy=None, level=None, fill_value=nan, limit=None, tolerance=None)

参数说明
labelsNew labels/index to conform to the axis that is specified by ‘axis’.
index, columnsThis is also an optional parameter that refers to the new labels/index. It generally prefers an index object for avoiding duplicate data.
axisThis is an optional parameter that axis to target. Can be either the axis name (‘index’, ‘columns’) or numbers (0, 1).
method{None, ‘backfill’/’bfill’, ‘pad’/’ffill’, ‘nearest’}, optional, default None. Used for filling the holes in the reindexed DataFrame. For increasing/decreasing indexes only.
copybool, default True: Whether to return a new object (a copy), even if the passed indexes are the same.
levelint or name: It is used to broadcast across the level, matching Index values on the passed MultiIndex level.
fill_valueIts default value is np.NaN and used to fill existing missing (NaN) values. and any new element needed for successful DataFrame alignment, with this value before computation.
limitIt defines the maximum number of consecutive elements to forward or backward fill.
toleranceThis is also an optional parameter that determines the maximum distance between original and new labels for inexact matches. The values of the index at the matching locations most satisfy the equation abs(index[indexer] – target) <= tolerance.

df
  • index 按默认顺序升序
df1MultiIndex = df1MultiIndex.sort_index(ascending=True)
#                              country1            country2
# year                             2017 2018  2019     2017  2018  2019
# data_provider indicator unit
# prov_a        ind_a     EUR       NaN  3.0   6.0      NaN  15.0  60.0
# prov_b        ind_b     EUR       6.0  2.0   NaN      5.0   8.0   NaN
# prov_c        ind_c     EUR       1.0  NaN  10.0      7.0   NaN   5.0
#
# MultiIndex([('country1', '2017'),
#             ('country1', '2018'),
#             ('country1', '2019'),
#             ('country2', '2017'),
#             ('country2', '2018'),
#             ('country2', '2019')],
#            names=[None, 'year'])
#
# MultiIndex([('prov_a', 'ind_a', 'EUR'),
#             ('prov_b', 'ind_b', 'EUR'),
#             ('prov_c', 'ind_c', 'EUR')],
#            names=['data_provider', 'indicator', 'unit'])

  • index 按默认顺序降序
df1MultiIndex = df1MultiIndex.sort_index(ascending=False)
#                              country1            country2
# year                             2017 2018  2019     2017  2018  2019
# data_provider indicator unit
# prov_c        ind_c     EUR       1.0  NaN  10.0      7.0   NaN   5.0
# prov_b        ind_b     EUR       6.0  2.0   NaN      5.0   8.0   NaN
# prov_a        ind_a     EUR       NaN  3.0   6.0      NaN  15.0  60.0
#
# MultiIndex([('country1', '2017'),
#             ('country1', '2018'),
#             ('country1', '2019'),
#             ('country2', '2017'),
#             ('country2', '2018'),
#             ('country2', '2019')],
#            names=[None, 'year'])
#
# MultiIndex([('prov_c', 'ind_c', 'EUR'),
#             ('prov_b', 'ind_b', 'EUR'),
#             ('prov_a', 'ind_a', 'EUR')],
#            names=['data_provider', 'indicator', 'unit'])

df
  • column1 层按默认顺序升序
df1MultiIndex_sort_level1 = df1MultiIndex.sort_index(level=1, axis=1, ascending=True)

#                             country1 country2  ... country1 country2
# year                             2017     2017  ...     2019     2019
# data_provider indicator unit                    ...
# prov_a        ind_a     EUR       NaN      NaN  ...      6.0     60.0
# prov_b        ind_b     EUR       6.0      5.0  ...      NaN      NaN
# prov_c        ind_c     EUR       1.0      7.0  ...     10.0      5.0
#
# [3 rows x 6 columns]
# MultiIndex([('country1', '2017'),
#             ('country2', '2017'),
#             ('country1', '2018'),
#             ('country2', '2018'),
#             ('country1', '2019'),
#             ('country2', '2019')],
#            names=[None, 'year'])
#
# MultiIndex([('prov_a', 'ind_a', 'EUR'),
#             ('prov_b', 'ind_b', 'EUR'),
#             ('prov_c', 'ind_c', 'EUR')],
#            names=['data_provider', 'indicator', 'unit'])

level=1axis=1, ascending=True,第一层名为 year 的列索引上升序。


  • column0 层按默认顺序降序
df1MultiIndex_sort_level0 = df1MultiIndex.sort_index(level=0, axis=1, ascending=False)

#                              country2            country1
# year                             2019  2018 2017     2019 2018 2017
# data_provider indicator unit
# prov_a        ind_a     EUR      60.0  15.0  NaN      6.0  3.0  NaN
# prov_b        ind_b     EUR       NaN   8.0  5.0      NaN  2.0  6.0
# prov_c        ind_c     EUR       5.0   NaN  7.0     10.0  NaN  1.0
#
# MultiIndex([('country2', '2019'),
#             ('country2', '2018'),
#             ('country2', '2017'),
#             ('country1', '2019'),
#             ('country1', '2018'),
#             ('country1', '2017')],
#            names=[None, 'year'])
#
# MultiIndex([('prov_a', 'ind_a', 'EUR'),
#             ('prov_b', 'ind_b', 'EUR'),
#             ('prov_c', 'ind_c', 'EUR')],
#            names=['data_provider', 'indicator', 'unit'])

level=0axis=1, ascending=False,第一层名为 None 的列索引上降序。


  • column 的名为 year 层按给定顺序 ['2018', '2019', '2017'] 排序
df1MultiIndex_sort_year = df1MultiIndex.reindex(columns=['2018', '2019', '2017'], level='year')
#                              country1            country2           
# year                             2018  2019 2017     2018  2019 2017
# data_provider indicator unit                                        
# prov_a        ind_a     EUR       3.0   6.0  NaN     15.0  60.0  NaN
# prov_b        ind_b     EUR       2.0   NaN  6.0      8.0   NaN  5.0
# prov_c        ind_c     EUR       NaN  10.0  1.0      NaN   5.0  7.0
#
# MultiIndex([('country1', '2018'),
#             ('country1', '2019'),
#             ('country1', '2017'),
#             ('country2', '2018'),
#             ('country2', '2019'),
#             ('country2', '2017')],
#            names=[None, 'year'])
#
# MultiIndex([('prov_a', 'ind_a', 'EUR'),
#             ('prov_b', 'ind_b', 'EUR'),
#             ('prov_c', 'ind_c', 'EUR')],
#            names=['data_provider', 'indicator', 'unit'])

  • column 的每级索引按给定顺序排序。level=0 指定顺序为 ['country1', 'country2'], level=1 指定顺序为 ['2019', '2017', '2018'] 21
Reorder = pd.MultiIndex.from_product([['country1', 'country2'], ['2019', '2017', '2018']])
df1MultiIndex_reindex = df1MultiIndex.reindex(columns=Reorder)
#                              country1           country2
#                                  2019 2017 2018     2019 2017  2018
# data_provider indicator unit
# prov_a        ind_a     EUR       6.0  NaN  3.0     60.0  NaN  15.0
# prov_b        ind_b     EUR       NaN  6.0  2.0      NaN  5.0   8.0
# prov_c        ind_c     EUR      10.0  1.0  NaN      5.0  7.0   NaN
#
# MultiIndex([('country1', '2019'),
#             ('country1', '2017'),
#             ('country1', '2018'),
#             ('country2', '2019'),
#             ('country2', '2017'),
#             ('country2', '2018')],
#            )
#
# MultiIndex([('prov_a', 'ind_a', 'EUR'),
#             ('prov_b', 'ind_b', 'EUR'),
#             ('prov_c', 'ind_c', 'EUR')],
#            names=['data_provider', 'indicator', 'unit'])

Series
  • 某个 index, column1 层按默认顺序降序
df1MultiIndex_prov_b = df1MultiIndex.loc[('prov_b', )]
#                country1           country2
# year               2017 2018 2019     2017 2018 2019
# indicator unit
# ind_b     EUR       6.0  2.0  NaN      5.0  8.0  NaN
#
# MultiIndex([('country1', '2017'),
#             ('country1', '2018'),
#             ('country1', '2019'),
#             ('country2', '2017'),
#             ('country2', '2018'),
#             ('country2', '2019')],
#            names=[None, 'year'])
#
# MultiIndex([('ind_b', 'EUR')],
#            names=['indicator', 'unit'])

df1MultiIndex_sort_prov_b = df1MultiIndex.loc[('prov_b', )].sort_index(level=1, axis=1, ascending=False)
#                country2 country1 country2 country1 country2 country1
# year               2019     2019     2018     2018     2017     2017
# indicator unit
# ind_b     EUR       NaN      NaN      8.0      2.0      5.0      6.0
#
# MultiIndex([('country2', '2019'),
#             ('country1', '2019'),
#             ('country2', '2018'),
#             ('country1', '2018'),
#             ('country2', '2017'),
#             ('country1', '2017')],
#            names=[None, 'year'])
#
# MultiIndex([('ind_b', 'EUR')],
#            names=['indicator', 'unit'])

删除

参考这里 2223

以以下 df 为例对如何删除行 / 列索引进行说明,

import pandas as pd

df = pd.DataFrame(
    data={"data_provider": ["prov_a", "prov_a", "prov_b", "prov_b", "prov_c", "prov_c"],
          "indicator": ["ind_a", "ind_a", "ind_b", "ind_b", "ind_c", "ind_c"],
          "unit": ["EUR", "EUR", "EUR", "EUR", "EUR", "EUR"],
          "year": ["2017", "2018","2019", "2017","2018","2019"],
          "country1": [6, 2, 10, 1, 3, 6],
          "country2": [5, 8, 5, 7, 15, 60]}
)
df = df.pivot_table(
    index=['data_provider', 'indicator', 'unit'],
    columns='year'
)
df.columns.names = ['country', 'year']

# country                      country1            country2            
# year                             2017 2018  2019     2017  2018  2019
# data_provider indicator unit                                         
# prov_a        ind_a     EUR       6.0  2.0   NaN      5.0   8.0   NaN
# prov_b        ind_b     EUR       1.0  NaN  10.0      7.0   NaN   5.0
# prov_c        ind_c     EUR       NaN  3.0   6.0      NaN  15.0  60.0
#
# MultiIndex([('prov_a', 'ind_a', 'EUR'),
#             ('prov_b', 'ind_b', 'EUR'),
#             ('prov_c', 'ind_c', 'EUR')],
#            names=['data_provider', 'indicator', 'unit'])
#
# MultiIndex([('country1', '2017'),
#             ('country1', '2018'),
#             ('country1', '2019'),
#             ('country2', '2017'),
#             ('country2', '2018'),
#             ('country2', '2019')],
#            names=['country', 'year'])

df.drop
  • 根据完整 level 的列标签删除相应索引
df_dropcol = df.drop(('country1', '2017'), axis=1)
# MultiIndex([('country1', '2018'),
#             ('country1', '2019'),
#             ('country2', '2017'),
#             ('country2', '2018'),
#             ('country2', '2019')],
#            names=['country', 'year'])

df_dropcol = df.drop(columns=('country1', '2019'), axis=1)
# MultiIndex([('country1', '2017'),
#             ('country1', '2018'),
#             ('country2', '2017'),
#             ('country2', '2018'),
#             ('country2', '2019')],
#            names=['country', 'year'])
  • 根据列标签的某一 level 中的标签值删除相应索引
df_dropcol = df.drop([('country1', )], axis=1)
# MultiIndex([('country2', '2017'),
#             ('country2', '2018'),
#             ('country2', '2019')],
#            names=['country', 'year'])

df_dropcol = df.drop(columns='country2', axis=1)
# MultiIndex([('country1', '2017'),
#             ('country1', '2018'),
#             ('country1', '2019')],
#            names=['country', 'year'])

df_dropcol = df.drop(columns='2019', level='year')
# MultiIndex([('country1', '2017'),
#             ('country1', '2018'),
#             ('country2', '2017'),
#             ('country2', '2018')],
#            names=['country', 'year'])

df.droplevel
  • 根据列标签的某一 level 中的名称值删除相应水平的索引
df_dropcol = df.droplevel(level='year', axis=1)
# Index(['country1', 'country1', 'country1', 'country2', 'country2', 'country2'], dtype='object', name='country')

df_dropcol = df.droplevel(level='country', axis=1)
# Index(['2017', '2018', '2019', '2017', '2018', '2019'], dtype='object', name='year')

df.drop
  • 根据完整 level 的列标签删除相应索引
df_droprow = df.drop([('prov_c', 'ind_c', 'EUR')])
# MultiIndex([('prov_a', 'ind_a', 'EUR'),
#             ('prov_b', 'ind_b', 'EUR')],
#            names=['data_provider', 'indicator', 'unit'])

df_droprow = df.drop(index=('prov_b', 'ind_b', 'EUR'))
# MultiIndex([('prov_a', 'ind_a', 'EUR'),
#             ('prov_c', 'ind_c', 'EUR')],
#            names=['data_provider', 'indicator', 'unit'])
  • 根据列标签的某一 level 中的标签值删除相应索引
df_droprow = df.drop([('prov_b', )])
# MultiIndex([('prov_a', 'ind_a', 'EUR'),
#             ('prov_c', 'ind_c', 'EUR')],
#            names=['data_provider', 'indicator', 'unit'])

df_droprow = df.drop(index='prov_a')
# MultiIndex([('prov_b', 'ind_b', 'EUR'),
#             ('prov_c', 'ind_c', 'EUR')],
#            names=['data_provider', 'indicator', 'unit'])

df_droprow = df.drop(index='ind_b', level='indicator')
# MultiIndex([('prov_a', 'ind_a', 'EUR'),
#             ('prov_c', 'ind_c', 'EUR')],
#            names=['data_provider', 'indicator', 'unit'])

df.droplevel
  • 根据行标签的某一 level 中的名称值删除相应水平的索引
df_droprow = df.droplevel(level='unit')
# MultiIndex([('prov_a', 'ind_a'),
#             ('prov_b', 'ind_b'),
#             ('prov_c', 'ind_c')],
#            names=['data_provider', 'indicator'])

df_droprow = df.droplevel(level='data_provider')
# MultiIndex([('ind_a', 'EUR'),
#             ('ind_b', 'EUR'),
#             ('ind_c', 'EUR')],
#            names=['indicator', 'unit'])

拷贝

原始数据文件参考此处 4,原始数据内容如下所示:

df = pd.read_csv('dtype_RawData.csv')

#    Customer Number     Customer Name         2016  ... Month   Day Active
# 0             1002  Quest industries  ¥125,000.00  ...     1  2015      Y
# 1           552278    Smith Plumbing  ¥920,000.00  ...     6  2014      Y
# 2            23477   ACME Industrial   ¥50,000.00  ...     3  2016      Y
# 3            24900        Brekke LTD  ¥350,000.00  ...    10  2015      Y
# 4           651029         Harbor Co   ¥15,000.00  ...     2  2014      N
#
# [5 rows x 9 columns]
#
# Customer Number     int64
# Customer Name      object
# 2016               object
# 2017               object
# Percent Growth     object
# Jan Units          object
# Month               int64
# Day                 int64
# Active             object
# dtype: object

沿行

参考这里 24


方法

方法说明
pd.Dataframe(np.repeats(df.values, reps, axis))
pd.concat([df]*reps)
df.loc[df.index.repeat(reps)]

示例

newdf = pd.DataFrame(np.repeat(df.values, 3, axis=0))
newdf.columns = df.columns
#    Customer Number     Customer Name         2016  ... Month   Day Active
# 0             1002  Quest industries  ¥125,000.00  ...     1  2015      Y
# 1             1002  Quest industries  ¥125,000.00  ...     1  2015      Y
# 2             1002  Quest industries  ¥125,000.00  ...     1  2015      Y
# 3           552278    Smith Plumbing  ¥920,000.00  ...     6  2014      Y
# 4           552278    Smith Plumbing  ¥920,000.00  ...     6  2014      Y
# 5           552278    Smith Plumbing  ¥920,000.00  ...     6  2014      Y
# 6            23477   ACME Industrial   ¥50,000.00  ...     3  2016      Y
# 7            23477   ACME Industrial   ¥50,000.00  ...     3  2016      Y
# 8            23477   ACME Industrial   ¥50,000.00  ...     3  2016      Y
# 9            24900        Brekke LTD  ¥350,000.00  ...    10  2015      Y
# 10           24900        Brekke LTD  ¥350,000.00  ...    10  2015      Y
# 11           24900        Brekke LTD  ¥350,000.00  ...    10  2015      Y
# 12          651029         Harbor Co   ¥15,000.00  ...     2  2014      N
# 13          651029         Harbor Co   ¥15,000.00  ...     2  2014      N
# 14          651029         Harbor Co   ¥15,000.00  ...     2  2014      N
#
# [15 rows x 9 columns]
#
# Index(['Customer Number', 'Customer Name', '2016', '2017', 'Percent Growth',
#        'Jan Units', 'Month', 'Day', 'Active'],
#       dtype='object')
#
# RangeIndex(start=0, stop=15, step=1)
newdf = pd.concat([df]*3)
#    Customer Number     Customer Name         2016  ... Month   Day Active
# 0             1002  Quest industries  ¥125,000.00  ...     1  2015      Y
# 1           552278    Smith Plumbing  ¥920,000.00  ...     6  2014      Y
# 2            23477   ACME Industrial   ¥50,000.00  ...     3  2016      Y
# 3            24900        Brekke LTD  ¥350,000.00  ...    10  2015      Y
# 4           651029         Harbor Co   ¥15,000.00  ...     2  2014      N
# 0             1002  Quest industries  ¥125,000.00  ...     1  2015      Y
# 1           552278    Smith Plumbing  ¥920,000.00  ...     6  2014      Y
# 2            23477   ACME Industrial   ¥50,000.00  ...     3  2016      Y
# 3            24900        Brekke LTD  ¥350,000.00  ...    10  2015      Y
# 4           651029         Harbor Co   ¥15,000.00  ...     2  2014      N
# 0             1002  Quest industries  ¥125,000.00  ...     1  2015      Y
# 1           552278    Smith Plumbing  ¥920,000.00  ...     6  2014      Y
# 2            23477   ACME Industrial   ¥50,000.00  ...     3  2016      Y
# 3            24900        Brekke LTD  ¥350,000.00  ...    10  2015      Y
# 4           651029         Harbor Co   ¥15,000.00  ...     2  2014      N
# 
# [15 rows x 9 columns]
#
# Index(['Customer Number', 'Customer Name', '2016', '2017', 'Percent Growth',
#        'Jan Units', 'Month', 'Day', 'Active'],
#       dtype='object')
#
# Int64Index([0, 1, 2, 3, 4, 0, 1, 2, 3, 4, 0, 1, 2, 3, 4], dtype='int64')
newdf = df.loc[df.index.repeat(3)]
#    Customer Number     Customer Name         2016  ... Month   Day Active
# 0             1002  Quest industries  ¥125,000.00  ...     1  2015      Y
# 0             1002  Quest industries  ¥125,000.00  ...     1  2015      Y
# 0             1002  Quest industries  ¥125,000.00  ...     1  2015      Y
# 1           552278    Smith Plumbing  ¥920,000.00  ...     6  2014      Y
# 1           552278    Smith Plumbing  ¥920,000.00  ...     6  2014      Y
# 1           552278    Smith Plumbing  ¥920,000.00  ...     6  2014      Y
# 2            23477   ACME Industrial   ¥50,000.00  ...     3  2016      Y
# 2            23477   ACME Industrial   ¥50,000.00  ...     3  2016      Y
# 2            23477   ACME Industrial   ¥50,000.00  ...     3  2016      Y
# 3            24900        Brekke LTD  ¥350,000.00  ...    10  2015      Y
# 3            24900        Brekke LTD  ¥350,000.00  ...    10  2015      Y
# 3            24900        Brekke LTD  ¥350,000.00  ...    10  2015      Y
# 4           651029         Harbor Co   ¥15,000.00  ...     2  2014      N
# 4           651029         Harbor Co   ¥15,000.00  ...     2  2014      N
# 4           651029         Harbor Co   ¥15,000.00  ...     2  2014      N
# 
# [15 rows x 9 columns]
#
# Index(['Customer Number', 'Customer Name', '2016', '2017', 'Percent Growth',
#        'Jan Units', 'Month', 'Day', 'Active'],
#       dtype='object')
#
# Int64Index([0, 0, 0, 1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4], dtype='int64')

沿列

参考这里


方法

方法说明
df[df.columns.repeat(reps)]

示例

newdf = df[df.columns.repeat(3)]
#   Customer Number  Customer Number  Customer Number  ... Active Active Active
# 0             1002             1002             1002  ...      Y      Y      Y
# 1           552278           552278           552278  ...      Y      Y      Y
# 2            23477            23477            23477  ...      Y      Y      Y
# 3            24900            24900            24900  ...      Y      Y      Y
# 4           651029           651029           651029  ...      N      N      N
# 
# [5 rows x 27 columns]
#
# Index(['Customer Number', 'Customer Number', 'Customer Number',
#        'Customer Name', 'Customer Name', 'Customer Name', '2016', '2016',
#        '2016', '2017', '2017', '2017', 'Percent Growth', 'Percent Growth',
#        'Percent Growth', 'Jan Units', 'Jan Units', 'Jan Units', 'Month',
#        'Month', 'Month', 'Day', 'Day', 'Day', 'Active', 'Active', 'Active'],
#       dtype='object')
#
# RangeIndex(start=0, stop=5, step=1)

频数

指定值

参考此处 25

  • df[df.columnname == value].shape[0]
df[df.education == '9th'].shape[0]
# `education` 列中值为 `9th` 出现的次数

非指定值

参考此处 26

  • df[col_name].value_counts()
  • df.loc[:,col_name].value_counts()
df['label'].value_counts()
# 统计 ’label’ 这一列各个值出现的次数

添加

参考这里 2728


末尾

  • df.loc()
    使用 df.loc() 在已有 Dataframe 的末尾添加一行xin
df.loc[len(df.index)] = [value1, value2, value3, ...]

比如,

df = pd.DataFrame({'points': [10, 12, 12, 14, 13, 18],
                   'rebounds': [7, 7, 8, 13, 7, 4],
                   'assists': [11, 8, 10, 6, 6, 5]})
print(df)
'''
   points  rebounds  assists
0      10         7       11
1      12         7        8
2      12         8       10
3      14        13        6
4      13         7        6
5      18         4        5
'''

#add new row to end of DataFrame
print(len(df.index))
'''
6
'''
df.loc[len(df.index)] = [20, 7, 5]
print(df)
'''
   points  rebounds  assists
0      10         7       11
1      12         7        8
2      12         8       10
3      14        13        6
4      13         7        6
5      18         4        5
6      20         7        5
'''

  • df.append()
    将已有 Dataframe 的几行附加到另一个 Dataframe 的末尾。请注意,两个 Dataframe 应该具有相同的列名。
df = df.append(df2, ignore_index = True)

比如,

df = pd.DataFrame({'points': [10, 12, 12, 14, 13, 18],
                   'rebounds': [7, 7, 8, 13, 7, 4],
                   'assists': [11, 8, 10, 6, 6, 5]})
print(df)
'''
   points  rebounds  assists
0      10         7       11
1      12         7        8
2      12         8       10
3      14        13        6
4      13         7        6
5      18         4        5
'''

df2 = pd.DataFrame({'points': [21, 25, 26],
                    'rebounds': [7, 7, 13],
                    'assists': [11, 3, 3]})
print(df2)
'''
   points  rebounds  assists
0      21         7       11
1      25         7        3
2      26        13        3
'''

#add new row to end of DataFrame
df = df.append(df2, ignore_index=True)
print(df)
'''
   points  rebounds  assists
0      10         7       11
1      12         7        8
2      12         8       10
3      14        13        6
4      13         7        6
5      18         4        5
6      21         7       11
7      25         7        3
8      26        13        3
'''

多行

以下例对本章节进行相关说明,

import pandas as pd
grade =[[100,95,100,99], [90,98,99,100], [88,95,98,88], [99,98,97,87], [96.5,90,96,85]]
df = pd.DataFrame(grade, columns=['Chinese', 'English', 'Math', 'Politics'])
print(df)
#     Chinese  English  Math  Politics
# 0     100.0       95   100        99
# 1      90.0       98    99       100
# 2      88.0       95    98        88
# 3      99.0       98    97        87
# 4      96.5       90    96        85
  1. 同时添加多行,新增行同当前 df 中的列的值相关
from statistics import geometric_mean

df = pd.concat([df, df.apply(lambda x: (x.sum(), x.mean(), geometric_mean(x)), axis=0, result_type='expand')], axis=0)
print(df)
#       Chinese    English        Math    Politics
# 0  100.000000   95.00000  100.000000   99.000000
# 1   90.000000   98.00000   99.000000  100.000000
# 2   88.000000   95.00000   98.000000   88.000000
# 3   99.000000   98.00000   97.000000   87.000000
# 4   96.500000   90.00000   96.000000   85.000000
# 0  473.500000  476.00000  490.000000  459.000000
# 1   94.700000   95.20000   98.000000   91.800000
# 2   94.575259   95.15429   97.989795   91.582229

参考这里 2928

以下例对本章节进行相关说明,

import pandas as pd
grade =[[100,95,100,99], [90,98,99,100], [88,95,98,88], [99,98,97,87], [96.5,90,96,85]]
df = pd.DataFrame(grade, columns=['Chinese', 'English', 'Math', 'Politics'])
print(df)
#     Chinese  English  Math  Politics
# 0     100.0       95   100        99
# 1      90.0       98    99       100
# 2      88.0       95    98        88
# 3      99.0       98    97        87
# 4      96.5       90    96        85

单列

  1. 添加单列,新增列同当前 df 中的列相关
df[['Sum']] = df[['Math']]
print(df)
#    Chinese  English  Math  Politics  Sum
# 0    100.0       95   100        99  100
# 1     90.0       98    99       100   99
# 2     88.0       95    98        88   98
# 3     99.0       98    97        87   97
# 4     96.5       90    96        85   96

可以使用 shift(n) 对新增列进行向前/向后平移 30

   Chinese  English  Math  Politics    Sum
# 0    100.0       95   100        99    NaN
# 1     90.0       98    99       100    NaN
# 2     88.0       95    98        88    NaN
# 3     99.0       98    97        87  100.0
# 4     96.5       90    96        85   99

  1. 添加单列,新增列同当前 df 中的列无关
df['Sum'] = 'Mark'
print(df)
#    Chinese  English  Math  Politics   Sum
# 0    100.0       95   100        99  Mark
# 1     90.0       98    99       100  Mark
# 2     88.0       95    98        88  Mark
# 3     99.0       98    97        87  Mark
# 4     96.5       90    96        85  Mark

多列

  1. 同时添加多列,新增列同当前 df 中的列相关
  • 新增列标签与当前 df 中的列标签不同。
df[['New1', 'New2']] = df[['English', 'Math']]
print(df)
#     Chinese  English  Math  Politics  New1  New2
# 0     100.0       95   100        99    95   100
# 1      90.0       98    99       100    98    99
# 2      88.0       95    98        88    95    98
# 3      99.0       98    97        87    98    97
# 4      96.5       90    96        85    90    96

或者

df[['New1', 'New2']] = df.apply(lambda x: ('yes', 'no'), axis=1, result_type='expand')
print(df)
#    Chinese  English  Math  Politics New1 New2
# 0    100.0       95   100        99  yes   no
# 1     90.0       98    99       100  yes   no
# 2     88.0       95    98        88  yes   no
# 3     99.0       98    97        87  yes   no
# 4     96.5       90    96        85  yes   no
  • 如果新增列标签与当前 df 中的列标签中有相同的,则被忽略。
df[['English', 'Geography']] = df[['English', 'Math']]
print(df)
#     Chinese  English  Math  Politics  Geography
# 0     100.0       95   100        99        100
# 1      90.0       98    99       100         99
# 2      88.0       95    98        88         98
# 3      99.0       98    97        87         97
# 4      96.5       90    96        85         96

或者

df[['English', 'New2']] = df.apply(lambda x: ('yes', 'no'), axis=1, result_type='expand')
print(df)
#    Chinese English  Math  Politics New2
# 0    100.0     yes   100        99   no
# 1     90.0     yes    99       100   no
# 2     88.0     yes    98        88   no
# 3     99.0     yes    97        87   no
# 4     96.5     yes    96        85   no
  1. 同时添加多列,新增列同当前 df 中的列无关
df[['Sum', 'Mean', 'Multiply']] = 'ABC', 'DEF', 'GHI'
print(df)
#    Chinese  English  Math  Politics  Sum Mean Multiply
# 0    100.0       95   100        99  ABC  DEF      GHI
# 1     90.0       98    99       100  ABC  DEF      GHI
# 2     88.0       95    98        88  ABC  DEF      GHI
# 3     99.0       98    97        87  ABC  DEF      GHI
# 4     96.5       90    96        85  ABC  DEF      GHI
  1. 同时添加多列,新增列同当前 df 中的列的值相关
from statistics import geometric_mean

df[['Sum', 'Mean', 'Geomean']] = df.apply(lambda x: (x.sum(), x.mean(), geometric_mean(x)), axis=1, result_type='expand')
print(df)
#    Chinese  English  Math  Politics    Sum    Mean    Geomean
# 0    100.0       95   100        99  394.0  98.500  98.478109
# 1     90.0       98    99       100  387.0  96.750  96.666489
# 2     88.0       95    98        88  369.0  92.250  92.146481
# 3     99.0       98    97        87  381.0  95.250  95.123553
# 4     96.5       90    96        85  367.5  91.875  91.751885


去重

drop_duplicates()

语法

语法如下 31

DataFrame.drop_duplicates(subset=None, *, keep='first', inplace=False, ignore_index=False)
说明
参数subsetcolumn label or sequence of labels, optional

Only consider certain columns for identifying duplicates, by default use all of the columns.
.keep{‘first’, ‘last’, False}, default ‘firstDetermines which duplicates (if any) to keep.
- first : Drop duplicates except for the first occurrence.
- last : Drop duplicates except for the last occurrence.
- False : Drop all duplicates.
inplacebool, default False
Whether to modify the DataFrame rather than creating a new one.
ignore_indexbool, default False
返回df
None

示例

以下面的 df 为例对如何去重进行说明,

df = pd.DataFrame({
    'brand': ['Yum Yum', 'Yum Yum', 'Indomie', 'Indomie', 'Indomie'],
    'style': ['cup', 'cup', 'cup', 'pack', 'pack'],
    'rating': [4, 4, 3.5, 15, 5]
})

#      brand style  rating
# 0  Yum Yum   cup     4.0
# 1  Yum Yum   cup     4.0
# 2  Indomie   cup     3.5
# 3  Indomie  pack    15.0
# 4  Indomie  pack     5.0

df

默认情况下,基于所有的列的值移除所有的重复的行

df.drop_duplicates()

#      brand style  rating
# 0  Yum Yum   cup     4.0
# 2  Indomie   cup     3.5
# 3  Indomie  pack    15.0
# 4  Indomie  pack     5.0

使用 subset 来移除指定列中的重复值

df.drop_duplicates(subset=['brand'])

#      brand style  rating
# 0  Yum Yum   cup     4.0
# 2  Indomie   cup     3.5
df.drop_duplicates(subset=['brand', 'style'], keep='last')

#      brand style  rating
# 1  Yum Yum   cup     4.0
# 2  Indomie   cup     3.5
# 4  Indomie  pack     5.0

修改

参考这里 27

  • df.loc()
    使用 df.loc() 将已有 Dataframe 的某一行的值修改为目标值
df.loc[目标行数索引] = [value1, value2, value3, ...]

比如,

df = pd.DataFrame({'points': [10, 12, 12, 14, 13, 18],
                   'rebounds': [7, 7, 8, 13, 7, 4],
                   'assists': [11, 8, 10, 6, 6, 5]})
print(df)
'''
   points  rebounds  assists
0      10         7       11
1      12         7        8
2      12         8       10
3      14        13        6
4      13         7        6
5      18         4        5
'''

df.loc[4] = ['A', 'B', 'C']
print(df)
'''
  points rebounds assists
0     10        7      11
1     12        7       8
2     12        8      10
3     14       13       6
4      A        B       C
5     18        4       5
'''

特定值

参考这里 32

仅适用于有特定的值需要被处理时

[]: How to set a cell to NaN in a pandas dataframe

[]: How to replace a value in pandas, with NaN?

[]: python——pandas——设置dataFrame某列为空值(NaN)


方法

df.mask

df.replace

示例



重命名

参考这里 333435

此处参考的原始数据如下所示,

data = pd.read_excel(io='analysis.xlsx')
print(data.head(3))
'''
          id        uid  ... diqu  city
0  553217640  6376967.0  ...   广西   南宁市
1  553217639  6376967.0  ...   广西   南宁市
2  553217638  6571870.0  ...  浙江省   杭州市

[3 rows x 11 columns]
'''

方法

赋值

使用 DataFrame.index = [newName]DataFrame.columns = [newName],这两种方法可以轻松实现。但是给出的新的索引标签或者列标签的列表的长度和原有的索引标签或者列标签的列表长度相等,否则会不匹配报错。

这种改变方式是直接改变了原始数据。

如下所示,

'''
print(len(data.columns.to_list()))
11
'''

colReorder = ['user', 'uid', 'title', 'docid', 'time', 'status', 'score', 'qudao', 'kffs', 'province', 'city']
'''
print(len(rename))
11
'''

data.columns = colReorder
print(data.head(3))
'''
        user        uid  ... province  city
0  553217640  6376967.0  ...       广西   南宁市
1  553217639  6376967.0  ...       广西   南宁市
2  553217638  6571870.0  ...      浙江省   杭州市

[3 rows x 11 columns]
'''

rename

语法

DataFrame.rename(mapper=None, index=None, columns=None, axis=None, copy=True, inplace=False, level=None)

除了修改 columns 还可以修改 index,十分方便。

参数说明
mapperindexcolumns可以任选其一使用,可以是将 indexcolumns 结合使用。indexcolumn 直接传入 mapper 或者字典的形式。
可以使用 map 方法进行映射,map 的使用方法就和python 中的 map 几乎一样。
axisintstr,与 mapper 配合使用。可以是轴名称( indexcolumns )或数字( 0 , 1 )。默认为 index
copyboolean,默认为 True,是否复制基础数据。
inplace布尔值,默认为 False,是否返回新的 DataFrame。如果为 True,则忽略复制值。

示例

参考此处 35



  • 部分

有时候,我们只需要将部分列的列名重命名。可以使用字典的方式,更改指定的列名。

### rename dict: key, default field of raw data; value, new field
rename = {
    'id': 'user',
    'diqu': 'province'
}
data = data.rename(columns=rename)
print(data.head(3))
'''
        user        uid  ... province  city
0  553217640  6376967.0  ...       广西   南宁市
1  553217639  6376967.0  ...       广西   南宁市
2  553217638  6571870.0  ...      浙江省   杭州市

[3 rows x 11 columns]
'''

数学运算

此处考虑单层索引的简单数据结构和多层索引的复杂数据结构的情形,对常见的数学运算给出示例以说明。

参考原始数据如下所示:

import pandas as pd

df1 = pd.DataFrame(
    data={"data_provider": ["prov_a", "prov_a", "prov_a", "prov_a", "prov_a", "prov_a"],
          "indicator": ["ind_a", "ind_a", "ind_a", "ind_b", "ind_b", "ind_b"],
          "unit": ["EUR", "EUR", "EUR", "EUR", "EUR", "EUR"],
          "year": ["2017", "2018","2019", "2017","2018","2019"],
          "country1": [1, 2, 3, 2, 4, 6],
          "country2": [4, 5, 6, 40, 50, 60]}
)
#   data_provider indicator unit  year  country1  country2
# 0        prov_a     ind_a  EUR  2017         1         4
# 1        prov_a     ind_a  EUR  2018         2         5
# 2        prov_a     ind_a  EUR  2019         3         6
# 3        prov_a     ind_b  EUR  2017         2        40
# 4        prov_a     ind_b  EUR  2018         4        50
# 5        prov_a     ind_b  EUR  2019         6        60
#
# RangeIndex(start=0, stop=6, step=1)
# Index(['data_provider', 'indicator', 'unit', 'year', 'country1', 'country2'], dtype='object')


import numpy as np
df2 = pd.DataFrame(np.arange(1, 51).reshape(10, 5), columns=['Asia', 'Africa', 'Australia', 'Europe', 'Russia'])
#    Asia  Africa  Australia  Europe  Russia
# 0     1       2          3       4       5
# 1     6       7          8       9      10
# 2    11      12         13      14      15
# 3    16      17         18      19      20
# 4    21      22         23      24      25
# 5    26      27         28      29      30
# 6    31      32         33      34      35
# 7    36      37         38      39      40
# 8    41      42         43      44      45
# 9    46      47         48      49      50
# 
# RangeIndex(start=0, stop=10, step=1)
# Index(['Asia', 'Africa', 'Australia', 'Europe', 'Russia'], dtype='object')


df2.loc[(3, 'Asia')] = np.nan
df2.loc[(5, 'Europe')] = np.nan
df2.loc[(7, 'Russia')] = np.nan
df3 = df2
#    Asia  Africa  Australia  Europe  Russia
# 0   1.0       2          3     4.0     5.0
# 1   6.0       7          8     9.0    10.0
# 2  11.0      12         13    14.0    15.0
# 3   NaN      17         18    19.0    20.0
# 4  21.0      22         23    24.0    25.0
# 5  26.0      27         28     NaN    30.0
# 6  31.0      32         33    34.0    35.0
# 7  36.0      37         38    39.0     NaN
# 8  41.0      42         43    44.0    45.0
# 9  46.0      47         48    49.0    50.0
#
# Index(['Asia', 'Africa', 'Australia', 'Europe', 'Russia'], dtype='object')
#
# RangeIndex(start=0, stop=10, step=1)

运算

divide

参考这里 3637


语法

获取数据帧和其他元素的浮动除法(二元运算符 truediv )。相当于,但支持将 fill_value 替换为其中一个输入中的缺失数据。使用反向版本,rtruedivdataframe/other 在灵活的包装器( addsubmuldivmodpow )中算术运算符:+-*///**

DataFrame.divide(other, axis='columns', level=None, fill_value=None)[source]
参数说明
other标量 ( scalar ),序列 ( sequence ),SeriesDataFrame。任何单个或多个元素数据结构或类似列表的对象。
axis{ 0 or index, 1 or columns }

是否通过索引 ( 0 or index ) 或 列 ( 1columns )进行比较。
对于 Series 输入,轴匹配 Series 索引。
levelintlabel

跨级别广播,匹配传递的 MultiIndex 级别的索引值。
fill_valuefloatNone,默认为 None

在计算之前使用此值填充现有缺失( NaN

值以及成功 DataFrame 对齐所需的任何新元素。如果缺少相应 DataFrame 位置中的数据,则结果将丢失。
返回 DataFrame算术运算的结果。

gmean

参考这里


方法
方法说明
np.exp(np.log(df2.prod(axis=1))/df2.notna().sum(1))如果 df 中存在空值 NaN,在计算时会被跳过。
df.apply(gmean, axis=1)需要先引入 scipy.stats。如果 df 中存在空值 NaN,在计算时无法被忽略。
df.apply(lambda x: np.exp(np.mean(np.log(x))), axis=1)如果 df 中存在空值 NaN,在计算时会被跳过。

subtraction

参考这里


方法

示例

RangeIndex

参考这里 38


divide
  1. 对于 df 中存在 str 或其他类型的数据时,要做数学运算时需注意对 int 或者 float 类型的数据进行针对性筛选。
df1['Normalize'] = df1[df1.columns[-2]].div(df1[df1.columns[-1]])
print(df1)
#   data_provider indicator unit  year  country1  country2  Normalize
# 0        prov_a     ind_a  EUR  2017         1         4       0.25
# 1        prov_a     ind_a  EUR  2018         2         5       0.40
# 2        prov_a     ind_a  EUR  2019         3         6       0.50
# 3        prov_a     ind_b  EUR  2017         2        40       0.05
# 4        prov_a     ind_b  EUR  2018         4        50       0.08
# 5        prov_a     ind_b  EUR  2019         6        60       0.10

  1. 对于 df 中数据类型统统为 strfloat 类型时,需要小心轴方向的抉择。
df2SumAxis0 = df2.sum(axis=0)
# Asia         235
# Africa       245
# Australia    255
# Europe       265
# Russia       275
# dtype: int64
#
# Index(['Asia', 'Africa', 'Australia', 'Europe', 'Russia'], dtype='object')
#
# <class 'pandas.core.series.Series'>


df2SumAxis1 = df2.sum(axis=1)
# 0     15
# 1     40
# 2     65
# 3     90
# 4    115
# 5    140
# 6    165
# 7    190
# 8    215
# 9    240
# dtype: int64
#
# RangeIndex(start=0, stop=10, step=1)
#
# <class 'pandas.core.series.Series'>


df = df2.apply(lambda x: x.div(x.sum()), axis=0)
print(df)
#        Asia    Africa  Australia    Europe    Russia
# 0  0.004255  0.008163   0.011765  0.015094  0.018182
# 1  0.025532  0.028571   0.031373  0.033962  0.036364
# 2  0.046809  0.048980   0.050980  0.052830  0.054545
# 3  0.068085  0.069388   0.070588  0.071698  0.072727
# 4  0.089362  0.089796   0.090196  0.090566  0.090909
# 5  0.110638  0.110204   0.109804  0.109434  0.109091
# 6  0.131915  0.130612   0.129412  0.128302  0.127273
# 7  0.153191  0.151020   0.149020  0.147170  0.145455
# 8  0.174468  0.171429   0.168627  0.166038  0.163636
# 9  0.195745  0.191837   0.188235  0.184906  0.181818


df = df2.apply(lambda x: x.div(x.sum()), axis=1)
print(df)
#        Asia    Africa  Australia    Europe    Russia
# 0  0.066667  0.133333        0.2  0.266667  0.333333
# 1  0.150000  0.175000        0.2  0.225000  0.250000
# 2  0.169231  0.184615        0.2  0.215385  0.230769
# 3  0.177778  0.188889        0.2  0.211111  0.222222
# 4  0.182609  0.191304        0.2  0.208696  0.217391
# 5  0.185714  0.192857        0.2  0.207143  0.214286
# 6  0.187879  0.193939        0.2  0.206061  0.212121
# 7  0.189474  0.194737        0.2  0.205263  0.210526
# 8  0.190698  0.195349        0.2  0.204651  0.209302
# 9  0.191667  0.195833        0.2  0.204167  0.20833

gmean

对于 df 中数据类型统统为 strfloat 类型时,需要小心轴方向的抉择。


axis=1
  • np.exp(np.log(df2.prod(axis=1))/df2.notna().sum(1))
gmean_axis1 = np.exp(np.log(df3.prod(axis=1))/df3.notna().sum(1))
gmean_axis1.name = 'gmean'
result_axis1 = pd.concat([df3, gmean_axis1], axis=1, ignore_index=False)
#    Asia  Africa  Australia  Europe  Russia      gmean
# 0   1.0       2          3     4.0     5.0   2.605171
# 1   6.0       7          8     9.0    10.0   7.872567
# 2  11.0      12         13    14.0    15.0  12.922523
# 3   NaN      17         18    19.0    20.0  18.466146
# 4  21.0      22         23    24.0    25.0  22.956423
# 5  26.0      27         28     NaN    30.0  27.711121
# 6  31.0      32         33    34.0    35.0  32.969664
# 7  36.0      37         38    39.0     NaN  37.483325
# 8  41.0      42         43    44.0    45.0  42.976729
# 9  46.0      47         48    49.0    50.0  47.979156
  • df2.apply(gmean, axis=1)
rom scipy.stats import gmean

gmean_axis1 = df3.apply(gmean, axis=1)
gmean_axis1.name = 'gmean'
result_axis1 = pd.concat([df3, gmean_axis1], axis=1, ignore_index=False)
#    Asia  Africa  Australia  Europe  Russia      gmean
# 0   1.0       2          3     4.0     5.0   2.605171
# 1   6.0       7          8     9.0    10.0   7.872567
# 2  11.0      12         13    14.0    15.0  12.922523
# 3   NaN      17         18    19.0    20.0        NaN
# 4  21.0      22         23    24.0    25.0  22.956423
# 5  26.0      27         28     NaN    30.0        NaN
# 6  31.0      32         33    34.0    35.0  32.969664
# 7  36.0      37         38    39.0     NaN        NaN
# 8  41.0      42         43    44.0    45.0  42.976729
# 9  46.0      47         48    49.0    50.0  47.979156

  • df2.apply(lambda x: np.exp(np.mean(np.log(x))), axis=1)
gmean_axis1 = df3.apply(lambda x: np.exp(np.mean(np.log(x))), axis=1)
gmean_axis1.name = 'gmean'
result_axis1 = pd.concat([df3, gmean_axis1], axis=1, ignore_index=False)
#    Asia  Africa  Australia  Europe  Russia      gmean
# 0   1.0       2          3     4.0     5.0   2.605171
# 1   6.0       7          8     9.0    10.0   7.872567
# 2  11.0      12         13    14.0    15.0  12.922523
# 3   NaN      17         18    19.0    20.0  18.466146
# 4  21.0      22         23    24.0    25.0  22.956423
# 5  26.0      27         28     NaN    30.0  27.711121
# 6  31.0      32         33    34.0    35.0  32.969664
# 7  36.0      37         38    39.0     NaN  37.483325
# 8  41.0      42         43    44.0    45.0  42.976729
# 9  46.0      47         48    49.0    50.0  47.979156

axis=0
  • np.exp(np.log(df2.prod(axis=1))/df2.notna().sum(1))
gmean_axis0 = np.exp(np.log(df3.prod(axis=0))/df3.notna().sum(0))
gmean_axis0.name = 'gmean'
gmean_axis0 = pd.DataFrame(gmean_axis0)
gmean_axis0 = gmean_axis0.transpose()
result_axis0 = pd.concat([df3, gmean_axis0], axis=0, ignore_index=False)
#             Asia     Africa  Australia     Europe     Russia
# 0       1.000000   2.000000        3.0   4.000000   5.000000
# 1       6.000000   7.000000        8.0   9.000000  10.000000
# 2      11.000000  12.000000       13.0  14.000000  15.000000
# 3            NaN  17.000000       18.0  19.000000  20.000000
# 4      21.000000  22.000000       23.0  24.000000  25.000000
# 5      26.000000  27.000000       28.0        NaN  30.000000
# 6      31.000000  32.000000       33.0  34.000000  35.000000
# 7      36.000000  37.000000       38.0  39.000000        NaN
# 8      41.000000  42.000000       43.0  44.000000  45.000000
# 9      46.000000  47.000000       48.0  49.000000  50.000000
# gmean  16.176423   7.715953        NaN  20.561997  21.256376
  • df2.apply(gmean, axis=1)
from scipy.stats import gmean
gmean_axis0 = df3.apply(gmean, axis=0)
gmean_axis0.name = 'gmean'
gmean_axis0 = pd.DataFrame(gmean_axis0)
gmean_axis0 = gmean_axis0.transpose()
result_axis0 = pd.concat([df3, gmean_axis0], axis=0, ignore_index=False)
#        Asia    Africa  Australia  Europe  Russia
# 0       1.0   2.00000   3.000000     4.0     5.0
# 1       6.0   7.00000   8.000000     9.0    10.0
# 2      11.0  12.00000  13.000000    14.0    15.0
# 3       NaN  17.00000  18.000000    19.0    20.0
# 4      21.0  22.00000  23.000000    24.0    25.0
# 5      26.0  27.00000  28.000000     NaN    30.0
# 6      31.0  32.00000  33.000000    34.0    35.0
# 7      36.0  37.00000  38.000000    39.0     NaN
# 8      41.0  42.00000  43.000000    44.0    45.0
# 9      46.0  47.00000  48.000000    49.0    50.0
# gmean   NaN  18.18965  19.821295     NaN     NaN
  • df2.apply(lambda x: np.exp(np.mean(np.log(x))), axis=1)
gmean_axis0 = df3.apply(lambda x: np.exp(np.mean(np.log(x))), axis=0)
gmean_axis0.name = 'gmean'
gmean_axis0 = pd.DataFrame(gmean_axis0)
gmean_axis0 = gmean_axis0.transpose()
result_axis0 = pd.concat([df3, gmean_axis0], axis=0, ignore_index=False)
#             Asia    Africa  Australia     Europe     Russia
# 0       1.000000   2.00000   3.000000   4.000000   5.000000
# 1       6.000000   7.00000   8.000000   9.000000  10.000000
# 2      11.000000  12.00000  13.000000  14.000000  15.000000
# 3            NaN  17.00000  18.000000  19.000000  20.000000
# 4      21.000000  22.00000  23.000000  24.000000  25.000000
# 5      26.000000  27.00000  28.000000        NaN  30.000000
# 6      31.000000  32.00000  33.000000  34.000000  35.000000
# 7      36.000000  37.00000  38.000000  39.000000        NaN
# 8      41.000000  42.00000  43.000000  44.000000  45.000000
# 9      46.000000  47.00000  48.000000  49.000000  50.000000
# gmean  16.176423  18.18965  19.821295  20.561997  21.256376

subtraction
  1. 对于 df 中存在 str 或其他类型的数据时,要做数学运算时需注意对 int 或者 float 类型的数据进行针对性筛选 30
df1['sub'] = df1['country2'] - df1['country1']
#   data_provider indicator unit  year  country1  country2  sub
# 0        prov_a     ind_a  EUR  2017         1         4    3
# 1        prov_a     ind_a  EUR  2018         2         5    3
# 2        prov_a     ind_a  EUR  2019         3         6    3
# 3        prov_a     ind_b  EUR  2017         2        40   38
# 4        prov_a     ind_b  EUR  2018         4        50   46
# 5        prov_a     ind_b  EUR  2019         6        60   54

df1['country1'] = df1['country2'] - df1['country1']
#   data_provider indicator unit  year  country1  country2  sub
# 0        prov_a     ind_a  EUR  2017         3         4    3
# 1        prov_a     ind_a  EUR  2018         3         5    3
# 2        prov_a     ind_a  EUR  2019         3         6    3
# 3        prov_a     ind_b  EUR  2017        38        40   38
# 4        prov_a     ind_b  EUR  2018        46        50   46
# 5        prov_a     ind_b  EUR  2019        54        60   54

SingleIndex

对上文给出的原始数据进行简单的单层索引的构建,如下所示:

df1SingleIndex = df1.pivot_table(
    index=['indicator'],
    columns='year'
)
#           country1           country2
# year          2017 2018 2019     2017 2018 2019
# indicator
# ind_a            1    2    3        4    5    6
# ind_b            2    4    6       40   50   60
#
# Index(['ind_a', 'ind_b'], dtype='object', name='indicator')
#
# MultiIndex([('country1', '2017'),
#             ('country1', '2018'),
#             ('country1', '2019'),
#             ('country2', '2017'),
#             ('country2', '2018'),
#             ('country2', '2019')],
#            names=[None, 'year'])

对于多层索引下的 Series 数据的筛选,可参考 3940


divide

对于 df 中数据类型统统为 strfloat 类型时,需要小心轴方向的抉择。

idx = pd.IndexSlice

ind_a_country1 = df1SingleIndex.loc[['ind_a'], idx['country1', :]]
#           country1
# year          2017 2018 2019
# indicator
# ind_a            1    2    3
#
# Index(['ind_a'], dtype='object', name='indicator')
#
# MultiIndex([('country1', '2017'),
#             ('country1', '2018'),
#             ('country1', '2019')],
#            names=[None, 'year'])

ind_a_2018 = df1SingleIndex.loc[['ind_a'], idx[:, '2018']]
#           country1 country2
# year          2018     2018
# indicator
# ind_a            2        5
#
# Index(['ind_a'], dtype='object', name='indicator')
#
# MultiIndex([('country1', '2018'),
#             ('country2', '2018')],
#            names=[None, 'year'])

df1_2018 = df1SingleIndex.loc[:, idx[:, '2018']]
#          country1 country2
# year          2018     2018
# indicator
# ind_a            2        5
# ind_b            4       50
#
# Index(['ind_a', 'ind_b'], dtype='object', name='indicator')
#
# MultiIndex([('country1', '2018'),
#             ('country2', '2018')],
#            names=[None, 'year'])

df1_2019 = df1SingleIndex.loc[:, idx[:, '2019']]
#           country1 country2
# year          2019     2019
# indicator
# ind_a            3        6
# ind_b            6       60
#
# Index(['ind_a', 'ind_b'], dtype='object', name='indicator')
#
# MultiIndex([('country1', '2019'),
#             ('country2', '2019')],
#            names=[None, 'year'])

df1_2018Div2019 = (df1_2018.div(df1_2019.to_numpy()).rename(columns={'2018': '2018/2019'}))
#            country1  country2
# year      2018/2019 2018/2019
# indicator
# ind_a      0.666667  0.833333
# ind_b      0.666667  0.833333
#
# Index(['ind_a', 'ind_b'], dtype='object', name='indicator')
#
# MultiIndex([('country1', '2018/2019'),
#             ('country2', '2018/2019')],
#            names=[None, 'year'])

df1_out = pd.concat([df1SingleIndex, df1_2018Div2019], axis=1)
#     country1           country2            country1  country2
# year          2017 2018 2019     2017 2018 2019 2018/2019 2018/2019
# indicator
# ind_a            1    2    3        4    5    6  0.666667  0.833333
# ind_b            2    4    6       40   50   60  0.666667  0.833333
#
# Index(['ind_a', 'ind_b'], dtype='object', name='indicator')
#
# MultiIndex([('country1',      '2017'),
#             ('country1',      '2018'),
#             ('country1',      '2019'),
#             ('country2',      '2017'),
#             ('country2',      '2018'),
#             ('country2',      '2019'),
#             ('country1', '2018/2019'),
#             ('country2', '2018/2019')],
#            names=[None, 'year'])
# Index(['ind_a', 'ind_b'], dtype='object', name='indicator')

MultiIndex

对上文给出的原始数据进行简单的多层索引的构建,如下所示:

df1MultiIndex = df1.pivot_table(
    index=['data_provider', 'indicator', 'unit'],
    columns='year'
)
#                              country1           country2
# year                             2017 2018 2019     2017 2018 2019
# data_provider indicator unit
# prov_a        ind_a     EUR         1    2    3        4    5    6
#               ind_b     EUR         2    4    6       40   50   60
#
# MultiIndex([('prov_a', 'ind_a', 'EUR'),
#             ('prov_a', 'ind_b', 'EUR')],
#            names=['data_provider', 'indicator', 'unit'])
#
# MultiIndex([('country1', '2017'),
#             ('country1', '2018'),
#             ('country1', '2019'),
#             ('country2', '2017'),
#             ('country2', '2018'),
#             ('country2', '2019')],
#            names=[None, 'year'])

对于多层索引下的 Series 数据的筛选,可参考 3940


divide

对于 df 中数据类型统统为 strfloat 类型时,需要小心轴方向的抉择。

idx = pd.IndexSlice

ind_a_country1 = df1MultiIndex.loc[idx[:, 'ind_a'], idx['country1', :]]
#                              country1
# year                             2017 2018 2019
# data_provider indicator unit
# prov_a        ind_a     EUR         1    2    3
#
# MultiIndex([('prov_a', 'ind_a', 'EUR')],
#            names=['data_provider', 'indicator', 'unit'])
#
# MultiIndex([('country1', '2017'),
#             ('country1', '2018'),
#             ('country1', '2019')],
#            names=[None, 'year'])

ind_a_2018 = df1MultiIndex.loc[idx[:, 'ind_a'], idx[:, '2018']]
#                              country1 country2
# year                             2018     2018
# data_provider indicator unit
# prov_a        ind_a     EUR         2        5
#
# MultiIndex([('prov_a', 'ind_a', 'EUR')],
#            names=['data_provider', 'indicator', 'unit'])
#
# MultiIndex([('country1', '2018'),
#             ('country2', '2018')],
#            names=[None, 'year'])

df1_2018 = df1MultiIndex.loc[:, idx[:, '2018']]
#                              country1 country2
# year                             2018     2018
# data_provider indicator unit
# prov_a        ind_a     EUR         2        5
#               ind_b     EUR         4       50

df1_2019 = df1MultiIndex.loc[:, idx[:, '2019']]
#                              country1 country2
# year                             2019     2019
# data_provider indicator unit
# prov_a        ind_a     EUR         3        6
#               ind_b     EUR         6       60

df1_2018Div2019 = (df1_2018.div(df1_2019.to_numpy()).rename(columns={'2018': '2018/2019'}))
#                               country1  country2
# year                         2018/2019 2018/2019
# data_provider indicator unit
# prov_a        ind_a     EUR   0.666667  0.833333
#               ind_b     EUR   0.666667  0.833333
#
# MultiIndex([('prov_a', 'ind_a', 'EUR'),
#             ('prov_a', 'ind_b', 'EUR')],
#            names=['data_provider', 'indicator', 'unit'])
#
# MultiIndex([('country1', '2018/2019'),
#             ('country2', '2018/2019')],
#            names=[None, 'year'])

df1_out = pd.concat([df1MultiIndex, df1_2018Div2019], axis=1)
#                              country1            ... country2  country1  country2
# year                             2017 2018 2019  ...     2019 2018/2019 2018/2019
# data_provider indicator unit                     ...
# prov_a        ind_a     EUR         1    2    3  ...        6  0.666667  0.833333
#               ind_b     EUR         2    4    6  ...       60  0.666667  0.833333
#
# [2 rows x 8 columns]
#
# MultiIndex([('prov_a', 'ind_a', 'EUR'),
#             ('prov_a', 'ind_b', 'EUR')],
#            names=['data_provider', 'indicator', 'unit'])
#
# MultiIndex([('country1',      '2017'),
#             ('country1',      '2018'),
#             ('country1',      '2019'),
#             ('country2',      '2017'),
#             ('country2',      '2018'),
#             ('country2',      '2019'),
#             ('country1', '2018/2019'),
#             ('country2', '2018/2019')],
#            names=[None, 'year'])

筛选

请参考同一系列的另一篇文章 41


统计运算

参考这里 42


极值

参考这里 4344

以下面的 df 为例,

import io

data = '''                  
day                code  pe_ratio  pb_ratio  market_cap                                    
2016-03-01  002027.XSHE   34.5035   25.4271   1169.3248
2016-03-02  002027.XSHE   36.0459   26.5638   1221.5966
2016-03-03  002027.XSHE   35.7059   26.3132   1210.0721
2016-03-04  002027.XSHE   36.1309   26.6264   1224.4777
2016-03-07  002027.XSHE   37.1025   27.3424   1257.4049
2016-03-08  002027.XSHE   36.7139   27.0560   1244.2340
2016-03-09  002027.XSHE   35.0986   25.8657   1189.4926
'''

df = pd.read_csv(io.StringIO(data), delim_whitespace=True)
print(df)
#           day         code  pe_ratio  pb_ratio  market_cap
# 0  2016-03-01  002027.XSHE   34.5035   25.4271   1169.3248
# 1  2016-03-02  002027.XSHE   36.0459   26.5638   1221.5966
# 2  2016-03-03  002027.XSHE   35.7059   26.3132   1210.0721
# 3  2016-03-04  002027.XSHE   36.1309   26.6264   1224.4777
# 4  2016-03-07  002027.XSHE   37.1025   27.3424   1257.4049
# 5  2016-03-08  002027.XSHE   36.7139   27.0560   1244.2340
# 6  2016-03-09  002027.XSHE   35.0986   25.8657   1189.4926

方法

先看这两个函数的释义:

  • pandas.DataFrame.idxmax
    Return index of first occurrence of maximum over requested axis.
  • pandas.DataFrame.idxmin
    Return index of first occurrence of minimum over requested axis.

它们的意思是返回最大值、最小值的索引。

函数说明
df.idxmax(axis, skipna)
df.idxmin(axis, skipna)
找出 df 中最大值、最小值所处索引。
可以通过 axis 指定轴方向。可以通过 skipna 决定是否忽略 NaN 值。
df.max(axis=None, skipna=None, level=None, numeric_only=None)
df.min(axis=None, skipna=None, level=None, numeric_only=None)
找出 df 中的最大值、最小值。
参数具体使用方法参照官方示例。

示例

极值
idxmax = df['market_cap'].idxmax()
# 4

idxmin = df['market_cap'].idxmin()
# 0

先获取最大值和最小值的索引,分别保存为 idxmaxidxminidxmaxidxmin 对应了行索引,再打印出这两行即可。

          day         code  pe_ratio  pb_ratio  market_cap
4  2016-03-07  002027.XSHE   37.1025   27.3424   1257.4049
0  2016-03-01  002027.XSHE   34.5035   25.4271   1169.3248

数据比对

参考这里 45


差别

df.compare

df.compare()s.compare() 方法使您可以分别比较两个 DataFrameSeries,并总结它们之间的差异。V1.1.0 中添加了此功能。

局限性是只能比较两个同维且行、列标签值相同的 dfSeries 的差别。

语法如下:

pd.compare(other, align_axis=1, keep_shape=False, keep_equal=False)

其中:

参数说明
other被对比的数据
align_axis=1差异堆叠在列/行上
keep_shape=False不保留相等的值
keep_equal=False不保留所有原始行和列

示例

以以下 dataframe 为例,对如何进行找出 dfSeries 间的差别进行说明,

df1 = pd.DataFrame(
    {
        "col1": ["a", "a", "b", "b", "a"],
        "col2": [1.0, 2.0, 3.0, np.nan, 5.0],
        "col3": [1.0, 2.0, 3.0, 4.0, 5.0]
    },
    columns=["col1", "col2", "col3"],
)
#   col1  col2  col3
# 0    a   1.0   1.0
# 1    a   2.0   2.0
# 2    b   3.0   3.0
# 3    b   NaN   4.0
# 4    a   5.0   5.0


df2 = df1.copy()
df2.loc[0, 'col1'] = 'c'
df2.loc[2, 'col3'] = 4.0
#   col1  col2  col3
# 0    c   1.0   1.0
# 1    a   2.0   2.0
# 2    b   3.0   4.0
# 3    b   NaN   4.0
# 4    a   5.0   5.0


df3 = df2.copy()
df3 = df3.iloc[:3]
df3.loc[2, 'col1'] = 'e'
df3.loc[1, 'col3'] = 6.0
#   col1  col2  col3
# 0    c   1.0   1.0
# 1    a   2.0   6.0
# 2    e   3.0   4.0

同维
compare12 = df1.compare(df2, keep_equal=False, keep_shape=False)
#   col1       col2       col3
#   self other self other self other
# 0    a     c  NaN   NaN  NaN   NaN
# 2  NaN   NaN  NaN   NaN  3.0   4.0
# 3  NaN   NaN  NaN   NaN  NaN   NaN
#
# MultiIndex([('col1',  'self'),
#             ('col1', 'other'),
#             ('col2',  'self'),
#             ('col2', 'other'),
#             ('col3',  'self'),
#             ('col3', 'other')],
#            )
#
# Int64Index([0, 2, 3], dtype='int64')

compare21 = df2.compare(df1)
#   col1       col2       col3
#   self other self other self other
# 0    c     a  NaN   NaN  NaN   NaN
# 2  NaN   NaN  NaN   NaN  4.0   3.0
# 3  NaN   NaN  NaN   NaN  NaN   NaN
#
# MultiIndex([('col1',  'self'),
#             ('col1', 'other'),
#             ('col2',  'self'),
#             ('col2', 'other'),
#             ('col3',  'self'),
#             ('col3', 'other')],
#            )
# Int64Index([0, 2, 3], dtype='int64')
  • align_axis

将差异堆叠在行上:

compare12 = df1.compare(df2, align_axis=0)
#         col1  col2  col3
# 0 self     a   NaN   NaN
#   other    c   NaN   NaN
# 2 self   NaN   NaN   3.0
#   other  NaN   NaN   4.0
# 3 self   NaN   NaN   NaN
#   other  NaN   NaN   NaN
#
# Index(['col1', 'col2', 'col3'], dtype='object')
#
# MultiIndex([(0,  'self'),
#             (0, 'other'),
#             (2,  'self'),
#             (2, 'other'),
#             (3,  'self'),
#             (3, 'other')],
#            )

compare21 = df2.compare(df1, align_axis=0)
#         col1  col2  col3
# 0 self     c   NaN   NaN
#   other    a   NaN   NaN
# 2 self   NaN   NaN   4.0
#   other  NaN   NaN   3.0
# 3 self   NaN   NaN   NaN
#   other  NaN   NaN   NaN

将差异堆叠在列上:

compare12 = df1.compare(df2, align_axis=1)
#   col1       col2       col3
#   self other self other self other
# 0    a     c  NaN   NaN  NaN   NaN
# 2  NaN   NaN  NaN   NaN  3.0   4.0
# 3  NaN   NaN  NaN   NaN  NaN   NaN
#
# Int64Index([0, 2, 3], dtype='int64')
#
# MultiIndex([('col1',  'self'),
#             ('col1', 'other'),
#             ('col2',  'self'),
#             ('col2', 'other'),
#             ('col3',  'self'),
#             ('col3', 'other')],
#            )


compare21 = df2.compare(df1, align_axis=1)
#   col1       col2       col3      
#   self other self other self other
# 0    c     a  NaN   NaN  NaN   NaN
# 2  NaN   NaN  NaN   NaN  4.0   3.0
# 3  NaN   NaN  NaN   NaN  NaN   NaN
  • keep_shape

保留所有原始行和列,

compare12 = df1.compare(df2, keep_shape=True)
#   col1       col2       col3      
#   self other self other self other
# 0    a     c  NaN   NaN  NaN   NaN
# 1  NaN   NaN  NaN   NaN  NaN   NaN
# 2  NaN   NaN  NaN   NaN  3.0   4.0
# 3  NaN   NaN  NaN   NaN  NaN   NaN
# 4  NaN   NaN  NaN   NaN  NaN   NaN
#
# MultiIndex([('col1',  'self'),
#             ('col1', 'other'),
#             ('col2',  'self'),
#             ('col2', 'other'),
#             ('col3',  'self'),
#             ('col3', 'other')],
#            )
#
# RangeIndex(start=0, stop=5, step=1)

不同维
compare23 = df2.compare(df3)
#   File "/Applications/miniconda3/lib/python3.8/site-packages/pandas/core/ops/__init__.py", line 289, in align_method_FRAME
#     raise ValueError(
# ValueError: Can only compare identically-labeled DataFrame objects

df.merge

pd.mergereduce 结合的方法使您可以同时比较多个 DataFrame 并总结它们之间的差异 46474849

好处是不在意 df 的维度,但是行、列标签应当相同。


示例

以以下 dataframe 为例,对如何进行找出多个 df 的差别进行说明,

df1 = pd.DataFrame({
    "品类":["蔬菜","蔬菜","水果","甜点"],
    "商品":["白菜","土豆","苹果","蛋糕"],
    "销量":[10, 20, 30, 40]
})
#    品类  商品  销量
# 0  蔬菜  白菜  10
# 1  蔬菜  土豆  20
# 2  水果  苹果  30
# 3  甜点  蛋糕  40

df2 = pd.DataFrame({
    "品类":["蔬菜","蔬菜","水果","水产", "甜点"],
    "商品":["白菜","土豆","苹果","河虾", "蛋糕"],
    "销售额":[100, 200, 300, 400, 500]
})
#    品类  商品  销售额
# 0  蔬菜  白菜  100
# 1  蔬菜  土豆  200
# 2  水果  苹果  300
# 3  水产  河虾  400
# 4  甜点  蛋糕  500

df3 = pd.DataFrame({
    "品类":["蔬菜","烧烤","蔬菜","甜点"],
    "商品":["白菜","羊肉串","土豆","蛋糕"],
})
#    品类   商品
# 0  蔬菜   白菜
# 1  烧烤  羊肉串
# 2  蔬菜   土豆
# 3  甜点   蛋糕
  • 多个 df 的交集
from functools import reduce
dfs = [df1, df2, df3]
df_intersec = reduce(lambda left, right: pd.merge(left, right, how='inner'), dfs)
#    品类  商品  销量  销售额
# 0  蔬菜  白菜  10  100
# 1  蔬菜  土豆  20  200
# 2  甜点  蛋糕  40  500
  • 2个 df 的指定列的交集
df_intersec = pd.merge(df1, df2, on='商品', how='inner')
#   品类_x  商品  销量 品类_y  销售额
# 0   蔬菜  白菜  10   蔬菜  100
# 1   蔬菜  土豆  20   蔬菜  200
# 2   水果  苹果  30   水果  300
# 3   甜点  蛋糕  40   甜点  500

df_intersec = pd.merge(df1, df2, on='品类', how='inner')
#    品类 商品_x  销量 商品_y  销售额
# 0  蔬菜   白菜  10   白菜  100
# 1  蔬菜   白菜  10   土豆  200
# 2  蔬菜   土豆  20   白菜  100
# 3  蔬菜   土豆  20   土豆  200
# 4  水果   苹果  30   苹果  300
# 5  甜点   蛋糕  40   蛋糕  500

df_intersec = pd.merge(df1, df2, on=['品类', '商品'], how='inner')
#    品类  商品  销量  销售额
# 0  蔬菜  白菜  10  100
# 1  蔬菜  土豆  20  200
# 2  水果  苹果  30  300
# 3  甜点  蛋糕  40  500

intersection

使用 setintersection 的方法使您可以同时比较多个 Series 并总结它们之间的差异 50


示例

以以下 dataframe 为例,对如何进行找出多个 df 的差别进行说明,

df1 = pd.DataFrame({
    "品类":["蔬菜","蔬菜","水果","甜点"],
    "商品":["白菜","土豆","苹果","蛋糕"],
    "销量":[10, 20, 30, 40]
})
#    品类  商品  销量
# 0  蔬菜  白菜  10
# 1  蔬菜  土豆  20
# 2  水果  苹果  30
# 3  甜点  蛋糕  40

df2 = pd.DataFrame({
    "品类":["蔬菜","蔬菜","水果","水产", "甜点"],
    "商品":["白菜","土豆","苹果","河虾", "蛋糕"],
    "销售额":[100, 200, 300, 400, 500]
})
#    品类  商品  销售额
# 0  蔬菜  白菜  100
# 1  蔬菜  土豆  200
# 2  水果  苹果  300
# 3  水产  河虾  400
# 4  甜点  蛋糕  500

df3 = pd.DataFrame({
    "品类":["蔬菜","烧烤","蔬菜","甜点"],
    "商品":["白菜","羊肉串","土豆","蛋糕"],
})
#    品类   商品
# 0  蔬菜   白菜
# 1  烧烤  羊肉串
# 2  蔬菜   土豆
# 3  甜点   蛋糕
  • 求以上 df 的 列标签为 商品 的列的交集,
dfs = [df1['商品'], df2['商品'], df3['商品']]
df_intersec = set(dfs[0]).intersection(*dfs[1:])
# {'土豆', '蛋糕', '白菜'}

需要注意地是,每次求交集的结果中元素的排列顺序可能是会变化的。


相同

df.equals

还可以使用 df1.equals(df2) 来对比两个数据是否一致,测试两个对象是否包含相同的元素。

此功能允许将两个 SeriesDataFrame 相互比较,以查看它们是否具有相同的形状和元素。 相同位置的 NaN 被认为是相等的。 列标题不必具有相同的类型,但是列中的元素必须具有相同的 dtype


示例

以以下 dataframe 为例,对如何比对 dfSeries 进行说明,

df = pd.DataFrame({1: [10], 2: [20]})
# print(df)
#     1   2
# 0  10  20
#
# print(df.columns.dtype)
# int64
#
# print(df.values.dtype)
# int64

diff_col_dtype = pd.DataFrame({1.0: [10], 2.0: [20]})
# print(diff_col_dype)
#    1.0  2.0
# 0   10   20
#
# print(diff_col_dtype.columns.dtype)
# float64
#
# print(diff_col_dtype.values.dtype)
# int64

diff_data_dtype = pd.DataFrame({1: [10.0], 2: [20.0]})
# print(diff_data_dtype)
#       1     2
# 0  10.0  20.0
#
# print(diff_data_dtype.columns.dtype)
# int64
#
# print(diff_data_dtype.values.dtype)
# float64

  • df 和其本身的元素和列标签具有相同的类型和值,它们将返回 True
print(df.equals(df))
# True
  • dfdiff_col_dtype 具有相同的元素类型和值,但列标签具有不同的类型,它们仍将返回True
print(df.equals(diff_col_dtype))
# True
  • dfdiff_data_dtype 为其元素的相同值具有不同的类型,即使它们的列标签具有相同的值和类型,它们也将返回 False
print(df.equals(diff_data_dtype))
# False

对齐

df.align

align 不会组合两个数据帧,但是可以将它们对齐。
如果要确保两个数据帧之间的行和/或列的排列相同,而不改变两个数据帧中包含的任何数据,请使用 DataFrame.align()Series 也支持此方法。


示例

以以下的2个 df 为例对如何对齐 df 进行说明:

df1 = pd.DataFrame([[1,2,3,4], [6,7,8,9]],
                   columns=['D', 'B', 'E', 'A'],
                   index=[1,2])
#    D  B  E  A
# 1  1  2  3  4
# 2  6  7  8  9

df2 = pd.DataFrame([[10,20,30,40], [60,70,80,90], [600,700,800,900]],
                   columns=['A', 'B', 'C', 'D'],
                   index=[2,3,4])
#      A    B    C    D
# 2   10   20   30   40
# 3   60   70   80   90
# 4  600  700  800  900
  • 按列对齐( axis=1 ),并对列标签执行外部联接( join='outer'):
a1, a2 = df1.align(df2, join='outer', axis=1)

print(a1)
#    A  B   C  D  E
# 1  4  2 NaN  1  3
# 2  9  7 NaN  6  8

print(a2)
#      A    B    C    D   E
# 2   10   20   30   40 NaN
# 3   60   70   80   90 NaN
# 4  600  700  800  900 NaN

这里有几点需要注意:

  • df1 中的列已重新排列,以便与 df2 中的列对齐。

  • 有一个标记为 C 的列已添加到 df1,还有一个标记为 E 的列已添加到 df2。这些列已用 NaN 填充。这是因为我们对列标签执行了外部联接。

  • 数据帧内的所有值均未更改。

  • 行标签未对齐;df2 有第 3 行和第 4 行,而 df1 没有。这是因为我们要求对列进行对齐( axis=1 )。

  • 在行和列上对齐,但将 join 参数更改为 right

a1, a2 = df1.align(df2, join='right', axis=None)

print(a1)
#      A    B   C    D
# 2  9.0  7.0 NaN  6.0
# 3  NaN  NaN NaN  NaN
# 4  NaN  NaN NaN  NaN

print(a2)
#      A    B    C    D
# 2   10   20   30   40
# 3   60   70   80   90
# 4  600  700  800  900

合并

pandas 提供了很多合并 pd.Seriespd.Dataframe 的强大功能 51

主要以下面的几个 dfSeries 为例对如何合并 df 进行说明:

df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
                    'B': ['B0', 'B1', 'B2', 'B3'],
                    'C': ['C0', 'C1', 'C2', 'C3'],
                    'D': ['D0', 'D1', 'D2', 'D3']},
                   index=[0, 1, 2, 3])
#     A   B   C   D
# 0  A0  B0  C0  D0
# 1  A1  B1  C1  D1
# 2  A2  B2  C2  D2
# 3  A3  B3  C3  D3

df2 = pd.DataFrame({'A': ['A4', 'A5', 'A6', 'A7'],
                    'B': ['B4', 'B5', 'B6', 'B7'],
                    'C': ['C4', 'C5', 'C6', 'C7'],
                    'D': ['D4', 'D5', 'D6', 'D7']},
                   index=[4, 5, 6, 7])
#     A   B   C   D
# 4  A4  B4  C4  D4
# 5  A5  B5  C5  D5
# 6  A6  B6  C6  D6
# 7  A7  B7  C7  D7

df3 = pd.DataFrame({'A': ['A8', 'A9', 'A10', 'A11'],
                    'B': ['B8', 'B9', 'B10', 'B11'],
                    'C': ['C8', 'C9', 'C10', 'C11'],
                    'D': ['D8', 'D9', 'D10', 'D11']},
                   index=[8, 9, 10, 11])
#       A    B    C    D
# 8    A8   B8   C8   D8
# 9    A9   B9   C9   D9
# 10  A10  B10  C10  D10
# 11  A11  B11  C11  D11

df4 = pd.DataFrame({'B': ['B2', 'B3', 'B6', 'B7'],
                    'D': ['D2', 'D3', 'D6', 'D7'],
                    'F': ['F2', 'F3', 'F6', 'F7']},
                   index=[2, 3, 6, 7])
#     B   D   F
# 2  B2  D2  F2
# 3  B3  D3  F3
# 6  B6  D6  F6
# 7  B7  D7  F7

df5 = pd.DataFrame({'A': ['A2', 'A3', 'A4', 'A7'],
                    'B': ['B2', 'B3', 'B4', 'B7'],
                    'C': ['C2', 'C3', 'C4', 'C7'],
                    'D': ['D2', 'D3', 'D4', 'D7']},
                   index=[2, 3, 4, 7])
#     A   B   C   D
# 2  A2  B2  C2  D2
# 3  A3  B3  C3  D3
# 4  A4  B4  C4  D4
# 7  A7  B7  C7  D7
s1 = pd.Series([0, 1, 2, 3], name='foo')
s2 = pd.Series([0, 1, 2, 3])
s3 = pd.Series([0, 1, 4, 5])

pd.concat

用法

pd.concat(
	objs: Iterable[NDFrame] | Mapping[Hashable, NDFrame],
    axis: Axis = 0,
    join: str = "outer",
    ignore_index: bool = False,
    keys=None,
    levels=None,
    names=None,
    verify_integrity: bool = False,
    sort: bool = False,
    copy: bool = True
    )`
参数说明
ojbsobjs 里的内容必须是 series 格式的,如果想要拼接的是 list 可以将 list 转为 seriespd.Series(list1) ),如果是纵向拼接,list 可以直接用 + 5253
axisaxis 指定连接的轴。
axis=0 表示纵向拼接,即不增加维度,只增加长度。
axis=1 表示横向拼接,即不增加长度,只增加维度。
join{'inner', 'outer'} 为连接方式,表示怎么处理其他轴的 indexouter 表示合并,inner 表示交集。
ignore_indexTrue 表示忽略原本的 index 值,使用 01,…,n-1 来代替。
copy是否进行拷贝。
keys指定最外层的多层次结构的 index

示例

1. 多个 df 合并

df1df2df3 定义了同样的列名和不同的 index。然后他们放在 frames 中构成了一个 dflist,将其作为参数传入 concat 就可以进行 df 的合并。

体现在代码中,如下所示:

frames = [df1, df2, df3]
result = pd.concat(frames)
#       A    B    C    D
# 0    A0   B0   C0   D0
# 1    A1   B1   C1   D1
# 2    A2   B2   C2   D2
# 3    A3   B3   C3   D3
# 4    A4   B4   C4   D4
# 5    A5   B5   C5   D5
# 6    A6   B6   C6   D6
# 7    A7   B7   C7   D7
# 8    A8   B8   C8   D8
# 9    A9   B9   C9   D9
# 10  A10  B10  C10  D10
# 11  A11  B11  C11  D11

columns = result.columns
# Index(['A', 'B', 'C', 'D'], dtype='object')
index = result.index
# Int64Index([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11], dtype='int64')

体现在示意图中,如下所示:

在这里插入图片描述


2. 多个 df 以指定方式合并

使用 keys 可以指定 frames 中不同 frameskey,而后可以通过选择外部的 key 来返回特定的 frame
体现在代码中,如下所示:

frames = [df1, df2, df3]
result = pd.concat(frames, keys=['x', 'y', 'z'])
#         A    B    C    D
# x 0    A0   B0   C0   D0
#   1    A1   B1   C1   D1
#   2    A2   B2   C2   D2
#   3    A3   B3   C3   D3
# y 4    A4   B4   C4   D4
#   5    A5   B5   C5   D5
#   6    A6   B6   C6   D6
#   7    A7   B7   C7   D7
# z 8    A8   B8   C8   D8
#   9    A9   B9   C9   D9
#   10  A10  B10  C10  D10
#   11  A11  B11  C11  D11

columns = result.columns
# Index(['A', 'B', 'C', 'D'], dtype='object')
index = result.index
# MultiIndex([('x',  0),
#             ('x',  1),
#             ('x',  2),
#             ('x',  3),
#             ('y',  4),
#             ('y',  5),
#             ('y',  6),
#             ('y',  7),
#             ('z',  8),
#             ('z',  9),
#             ('z', 10),
#             ('z', 11)],
#            )

newframe = result.loc['y']
#     A   B   C   D
# 4  A4  B4  C4  D4
# 5  A5  B5  C5  D5
# 6  A6  B6  C6  D6
# 7  A7  B7  C7  D7

体现在示意图中,如下所示:

在这里插入图片描述

连接的轴默认是 0,即按行来进行连接。如果要按列来连接,可以指定 axis=1
体现在代码中,如下所示:

frames = [df1, df2, df3]
result = pd.concat(frames, axis=1, keys=['x', 'y', 'z'], sort=False
#       x                   y                   z
#       A    B    C    D    A    B    C    D    A    B    C    D
# 0    A0   B0   C0   D0  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN
# 1    A1   B1   C1   D1  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN
# 2    A2   B2   C2   D2  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN
# 3    A3   B3   C3   D3  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN
# 4   NaN  NaN  NaN  NaN   A4   B4   C4   D4  NaN  NaN  NaN  NaN
# 5   NaN  NaN  NaN  NaN   A5   B5   C5   D5  NaN  NaN  NaN  NaN
# 6   NaN  NaN  NaN  NaN   A6   B6   C6   D6  NaN  NaN  NaN  NaN
# 7   NaN  NaN  NaN  NaN   A7   B7   C7   D7  NaN  NaN  NaN  NaN
# 8   NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN   A8   B8   C8   D8
# 9   NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN   A9   B9   C9   D9
# 10  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  A10  B10  C10  D10
# 11  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  A11  B11  C11  D11

columns = result.columns
# MultiIndex([('x', 'A'),
#             ('x', 'B'),
#             ('x', 'C'),
#             ('x', 'D'),
#             ('y', 'A'),
#             ('y', 'B'),
#             ('y', 'C'),
#             ('y', 'D'),
#             ('z', 'A'),
#             ('z', 'B'),
#             ('z', 'C'),
#             ('z', 'D')],
#            )
index = result.index
# Int64Index([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11], dtype='int64')

newframe = result.loc[:, 'y']
#       A    B    C    D
# 0   NaN  NaN  NaN  NaN
# 1   NaN  NaN  NaN  NaN
# 2   NaN  NaN  NaN  NaN
# 3   NaN  NaN  NaN  NaN
# 4    A4   B4   C4   D4
# 5    A5   B5   C5   D5
# 6    A6   B6   C6   D6
# 7    A7   B7   C7   D7
# 8   NaN  NaN  NaN  NaN
# 9   NaN  NaN  NaN  NaN
# 10  NaN  NaN  NaN  NaN
# 11  NaN  NaN  NaN  NaN

默认的 join 连接方式为 join='outer',合并之后 index 不存在的地方会补全为 NaN。如果是 join='inner' 的情况,则只会选择 index 相同的行进行展示。

frames = [df1, df4]
result = pd.concat(frames, axis=1)
#      A    B    C    D    B    D    F
# 0   A0   B0   C0   D0  NaN  NaN  NaN
# 1   A1   B1   C1   D1  NaN  NaN  NaN
# 2   A2   B2   C2   D2   B2   D2   F2
# 3   A3   B3   C3   D3   B3   D3   F3
# 6  NaN  NaN  NaN  NaN   B6   D6   F6
# 7  NaN  NaN  NaN  NaN   B7   D7   F7
columns = result.columns
# Index(['A', 'B', 'C', 'D', 'B', 'D', 'F'], dtype='object')
index = result.index
# Int64Index([0, 1, 2, 3, 4, 7], dtype='int64')

result = pd.concat([df1, df4], axis=1, join='inner')
#     A   B   C   D   B   D   F
# 2  A2  B2  C2  D2  B2  D2  F2
# 3  A3  B3  C3  D3  B3  D3  F3
columns = result.columns
# Index(['A', 'B', 'C', 'D', 'B', 'D', 'F'], dtype='object')
index = result.index
# Int64Index([2, 3], dtype='int64')

体现在示意图中,如下所示:

在这里插入图片描述

如果合并之后想保存原来 frameindex 相关的数据,那么可以使用 reindex

result = pd.concat(frames, axis=1).reindex(df1.index)
#     A   B   C   D    B    D    F
# 0  A0  B0  C0  D0  NaN  NaN  NaN
# 1  A1  B1  C1  D1  NaN  NaN  NaN
# 2  A2  B2  C2  D2   B2   D2   F2
# 3  A3  B3  C3  D3   B3   D3   F3
columns = result.columns
# Index(['A', 'B', 'C', 'D', 'B', 'D', 'F'], dtype='object')
index = result.index
# Int64Index([0, 1, 2, 3], dtype='int64')

3. 涉及Series的合并

可以合并 dfSeries

result = pd.concat([df1, s1], axis=1)
#     A   B   C   D   X
# 0  A0  B0  C0  D0  X0
# 1  A1  B1  C1  D1  X1
# 2  A2  B2  C2  D2  X2
# 3  A3  B3  C3  D3  X3

columns = result.columns
# Index(['A', 'B', 'C', 'D', 'X'], dtype='object')
index = result.index
# Int64Index([0, 1, 2, 3], dtype='int64')

如果是多个 Series,使用 concat 可以指定列名:

result = pd.concat([s1, s2, s3], axis=1, keys=['red', 'blue', 'yellow'])
#    red  blue  yellow
# 0    0     0       0
# 1    1     1       1
# 2    2     2       4
# 3    3     3       5

columns = result.columns
# Index(['red', 'blue', 'yellow'], dtype='object')
index = result.index
# RangeIndex(start=0, stop=4, step=1)

pd.merge

df 最类似的就是数据库的表格,可以使用 merge 来进行类似数据库操作的 df 合并操作。


用法

pd.merge(left, right, how='inner', on=None, left_on=None, right_on=None,
         left_index=False, right_index=False, sort=True,
         suffixes=('_x', '_y'), copy=True, indicator=False,
         validate=None)
参数说明
leftleftright 是要合并的两个 df 或者 Series
rightleftright 是要合并的两个 df 或者 Series
on代表的是 join 的列或者 index 名。
left_on左连接
right_on右连接
left_index连接之后,选择使用左边的index 或者 column
right_index连接之后,选择使用右边的index 或者 column
how连接的方式,leftrightouterinner。默认 inner
sort是否排序。
suffixes处理重复的列
copy是否拷贝数据

示例

1. on
单列

将两个 df 基于指定列 key 列中的共有值进行合并。
体现在代码中,如下所示:

left = pd.DataFrame(
    {'key': ['K0', 'K1', 'K2', 'K3'],
     'A': ['A0', 'A1', 'A2', 'A3'],
     'B': ['B0', 'B1', 'B2', 'B3']}
)
#   key   A   B
# 0  K0  A0  B0
# 1  K1  A1  B1
# 2  K2  A2  B2
# 3  K3  A3  B3

right = pd.DataFrame(
    {'key': ['K0', 'K1', 'K2', 'K3'],
     'C': ['C0', 'C1', 'C2', 'C3'],
     'D': ['D0', 'D1', 'D2', 'D3']}
)
#   key   C   D
# 0  K0  C0  D0
# 1  K1  C1  D1
# 2  K2  C2  D2
# 3  K3  C3  D3

result = pd.merge(left, right, on='key')
#   key   A   B   C   D
# 0  K0  A0  B0  C0  D0
# 1  K1  A1  B1  C1  D1
# 2  K2  A2  B2  C2  D2
# 3  K3  A3  B3  C3  D3

columns = result.columns
# Index(['key', 'A', 'B', 'C', 'D'], dtype='object')

index = result.index
# Int64Index([0, 1, 2, 3], dtype='int64')

体现在示意图中,如下所示:

在这里插入图片描述

多列

再看一个多个 key 连接的例子。
两个 df 将基于两个指定列的共有值进行合并。
体现在代码中,如下所示:

left = pd.DataFrame(
    {
        'key1': ['K0', 'K0', 'K1', 'K2'],
        'key2': ['K0', 'K1', 'K0', 'K1'],
        'A': ['A0', 'A1', 'A2', 'A3'],
        'B': ['B0', 'B1', 'B2', 'B3']
     }
)
#   key1 key2   A   B
# 0   K0   K0  A0  B0
# 1   K0   K1  A1  B1
# 2   K1   K0  A2  B2
# 3   K2   K1  A3  B3

right = pd.DataFrame(
    {
        'key1': ['K0', 'K1', 'K1', 'K2'],
        'key2': ['K0', 'K0', 'K0', 'K0'],
        'C': ['C0', 'C1', 'C2', 'C3'],
        'D': ['D0', 'D1', 'D2', 'D3']
    }
)
#   key1 key2   C   D
# 0   K0   K0  C0  D0
# 1   K1   K0  C1  D1
# 2   K1   K0  C2  D2
# 3   K2   K0  C3  D3

result = pd.merge(left, right, on=['key1', 'key2'])
#   key1 key2   A   B   C   D
# 0   K0   K0  A0  B0  C0  D0
# 1   K1   K0  A2  B2  C1  D1
# 2   K1   K0  A2  B2  C2  D2

columns = result.columns
# Index(['key1', 'key2', 'A', 'B', 'C', 'D'], dtype='object')

index = result.index
# Int64Index([0, 1, 2], dtype='int64')

体现在示意图中,如下所示:

在这里插入图片描述


将两个 df 基于行索引组名为 key 的共有行索引的值进行合并。
体现在代码中,如下所示:

leftindex = pd.MultiIndex.from_tuples(
    [('K0', 'X0'), ('K0', 'X1'),('K1', 'X2')],
    names=['key', 'X']
)
left = pd.DataFrame(
    {
        'A': ['A0', 'A1', 'A2'],
        'B': ['B0', 'B1', 'B2']
     },
    index=leftindex
)
#          A   B
# key X
# K0  X0  A0  B0
#     X1  A1  B1
# K1  X2  A2  B2
leftcolumns = left.columns
# Index(['A', 'B'], dtype='object')
leftindex = left.index
# MultiIndex([('K0', 'X0'),
#             ('K0', 'X1'),
#             ('K1', 'X2')],
#            names=['key', 'X'])
leftindexnames = left.index.names
# ['key', 'X']
leftkey = left.index.get_level_values(level='key')
# Index(['K0', 'K0', 'K1'], dtype='object', name='key')


rightindex = pd.MultiIndex.from_tuples(
    [('K0', 'Y0'), ('K1', 'Y1'),('K2', 'Y2'), ('K2', 'Y3')],
    names=['key', 'Y']
)
right = pd.DataFrame(
    {
        'C': ['C0', 'C1', 'C2', 'C3'],
        'D': ['D0', 'D1', 'D2', 'D3']},
    index=rightindex
)
#          C   D
# key Y
# K0  Y0  C0  D0
# K1  Y1  C1  D1
# K2  Y2  C2  D2
#     Y3  C3  D3
rightcolumns = right.columns
# Index(['C', 'D'], dtype='object')
rightindex = right.index
# MultiIndex([('K0', 'Y0'),
#             ('K1', 'Y1'),
#             ('K2', 'Y2'),
#             ('K2', 'Y3')],
#            names=['key', 'Y'])
rightindexnames = right.index.names
# ['key', 'Y']
rightkey = right.index.get_level_values(level='key')
# Index(['K0', 'K1', 'K2', 'K2'], dtype='object', name='key')


expectedcommonkeys = list(set(list(leftkey)).intersection(set(list(rightkey))))
# ['K0', 'K1']
result = pd.merge(
    left.reset_index(),
    right.reset_index(),
    on=['key'],
    how='inner'
).set_index(['key', 'X', 'Y'])
#             A   B   C   D
# key X  Y
# K0  X0 Y0  A0  B0  C0  D0
#     X1 Y0  A1  B1  C0  D0
# K1  X2 Y1  A2  B2  C1  D1
columns = result.columns
# Index(['A', 'B', 'C', 'D'], dtype='object')
index = result.index
# MultiIndex([('K0', 'X0', 'Y0'),
#             ('K0', 'X1', 'Y0'),
#             ('K1', 'X2', 'Y1')],
#            names=['key', 'X', 'Y'])
realcommonkeys = (result.index.get_level_values(level='key')).unique()
# Index(['K0', 'K1'], dtype='object', name='key')

体现在示意图中,如下所示:

在这里插入图片描述


行与列

将两个 df 基于指定的行/列索引中的共有值进行合并。
体现在代码中,如下所示:

leftindex = pd.Index(['K0', 'K0', 'K1', 'K2'], name='key1')
left = pd.DataFrame(
    {
        'A': ['A0', 'A1', 'A2', 'A3'],
        'B': ['B0', 'B1', 'B2', 'B3'],
        'key2': ['K0', 'K1', 'K0', 'K1']
    },
    index=leftindex)
#        A   B key2
# key1
# K0    A0  B0   K0
# K0    A1  B1   K1
# K1    A2  B2   K0
# K2    A3  B3   K1
leftcolumns = left.columns
# Index(['A', 'B', 'key2'], dtype='object')
leftindex = left.index
# Index(['K0', 'K0', 'K1', 'K2'], dtype='object', name='key1')

rightindex = pd.Index(['K0', 'K1', 'K2', 'K2'], name='key1')
right = pd.DataFrame(
    {
        'C': ['C0', 'C1', 'C2', 'C3'],
        'D': ['D0', 'D1', 'D2', 'D3'],
        'key2': ['K0', 'K0', 'K0', 'K1']
    },
    index=rightindex
)
#        C   D key2
# key1
# K0    C0  D0   K0
# K1    C1  D1   K0
# K2    C2  D2   K0
# K2    C3  D3   K1
rightcolumns = right.columns
# Index(['C', 'D', 'key2'], dtype='object')
rightindex = right.index
# Index(['K0', 'K1', 'K2', 'K2'], dtype='object', name='key1')

leftkeys = left['key2'].reset_index()
rightkeys = right['key2'].reset_index()
expectedcommonkeys = pd.merge(leftkeys, rightkeys, how='inner')
#   key1 key2
# 0   K0   K0
# 1   K1   K0
# 2   K2   K
result = pd.merge(
    left,
    right,
    on=['key1', 'key2'],
)
#        A   B key2   C   D
# key1
# K0    A0  B0   K0  C0  D0
# K1    A2  B2   K0  C1  D1
# K2    A3  B3   K1  C3  D
columns = result.columns
# Index(['A', 'B', 'key2', 'C', 'D'], dtype='object')
index = result.index
# Index(['K0', 'K1', 'K2'], dtype='object', name='key1')
realcommonkeys = (result['key2'].reset_index())
#   key1 key2
# 0   K0   K0
# 1   K1   K0
# 2   K2   K1

体现在示意图中,如下所示:

在这里插入图片描述


2. how

how 可以指定 merge 方式,和数据库一样,可以指定是内连接、外连接等。
若指定 how=left,即两个 df 将基于 left 这个 df 的指定列的值进行合并。
1.

left = pd.DataFrame(
    {
        'key1': ['K0', 'K0', 'K1', 'K2'],
        'key2': ['K0', 'K1', 'K0', 'K1'],
        'A': ['A0', 'A1', 'A2', 'A3'],
        'B': ['B0', 'B1', 'B2', 'B3']
     }
)
#   key1 key2   A   B
# 0   K0   K0  A0  B0
# 1   K0   K1  A1  B1
# 2   K1   K0  A2  B2
# 3   K2   K1  A3  B3

right = pd.DataFrame(
    {
        'key1': ['K0', 'K1', 'K1', 'K2'],
        'key2': ['K0', 'K0', 'K0', 'K0'],
        'C': ['C0', 'C1', 'C2', 'C3'],
        'D': ['D0', 'D1', 'D2', 'D3']
    }
)
#   key1 key2   C   D
# 0   K0   K0  C0  D0
# 1   K1   K0  C1  D1
# 2   K1   K0  C2  D2
# 3   K2   K0  C3  D3

result = pd.merge(left, right, how='left', on=['key1', 'key2'])
#   key1 key2   A   B    C    D
# 0   K0   K0  A0  B0   C0   D0
# 1   K0   K1  A1  B1  NaN  NaN
# 2   K1   K0  A2  B2   C1   D1
# 3   K1   K0  A2  B2   C2   D2
# 4   K2   K1  A3  B3  NaN  NaN

columns = result.columns
# Index(['key1', 'key2', 'A', 'B', 'C', 'D'], dtype='object')

index = result.index
# Int64Index([0, 1, 2, 3, 4], dtype='int64')

指定 indicator=True,可以具体行的连接方式:

result = pd.merge(left, right, how='left', on=['key1', 'key2'], indicator=True)
#   key1 key2   A   B    C    D     _merge
# 0   K0   K0  A0  B0   C0   D0       both
# 1   K0   K1  A1  B1  NaN  NaN  left_only
# 2   K1   K0  A2  B2   C1   D1       both
# 3   K1   K0  A2  B2   C2   D2       both
# 4   K2   K1  A3  B3  NaN  NaN  left_only

columns = result.columns
# Index(['key1', 'key2', 'A', 'B', 'C', 'D', '_merge'], dtype='object')

index = result.index
# Int64Index([0, 1, 2, 3, 4], dtype='int64')

而如果传入字符串给 indicator,会重命名 indicator 这一列的名字:

result = pd.merge(left, right, how='left', on=['key1', 'key2'], indicator='comment')
#   key1 key2   A   B    C    D    comment
# 0   K0   K0  A0  B0   C0   D0       both
# 1   K0   K1  A1  B1  NaN  NaN  left_only
# 2   K1   K0  A2  B2   C1   D1       both
# 3   K1   K0  A2  B2   C2   D2       both
# 4   K2   K1  A3  B3  NaN  NaN  left_only

columns = result.columns
# Index(['key1', 'key2', 'A', 'B', 'C', 'D', 'comment'], dtype='object')

index = result.index
# Int64Index([0, 1, 2, 3, 4], dtype='int64')

pd.join

用法

pd.join(
        self,
        other: DataFrame | Series,
        on: IndexLabel | None = None,
        how: str = "left",
        lsuffix: str = "",
        rsuffix: str = "",
        sort: bool = False,
        )

示例

两个 df 将基于 left 这个 df 的行索引进行合并。
体现在代码中,如下所示:

left = pd.DataFrame(
    {
        'A': ['A0', 'A1', 'A2'],
        'B': ['B0', 'B1', 'B2']
    },
    index=['K0', 'K1', 'K2']
)
#      A   B
# K0  A0  B0
# K1  A1  B1
# K2  A2  B2

right = pd.DataFrame(
    {
        'C': ['C0', 'C2', 'C3'],
        'D': ['D0', 'D2', 'D3']
    },
    index=['K0', 'K2', 'K3']
)
#      C   D
# K0  C0  D0
# K2  C2  D2
# K3  C3  D3

expectedcolumns = set(left.columns.to_list()) | set(right.columns.to_list())
# {'B', 'A', 'C', 'D'}
expectedindex = set(left.index.to_list()) | (set(left.index.to_list()) & set(right.index.to_list()))
# {'K1', 'K2', 'K0'}

result = left.join(right)
#      A   B    C    D
# K0  A0  B0   C0   D0
# K1  A1  B1  NaN  NaN
# K2  A2  B2   C2   D2

columns = result.columns
# Index(['A', 'B', 'C', 'D'], dtype='object')
index = result.index
# Index(['K0', 'K1', 'K2'], dtype='object')

体现在示意图中,如下所示:

在这里插入图片描述

1. how

可以通过 how 来指定行索引的连接方式。
比如外连接 outer,以 leftright 的索引的合集来进行合并。


单行
left = pd.DataFrame(
    {
        'A': ['A0', 'A1', 'A2'],
        'B': ['B0', 'B1', 'B2']
    },
    index=['K0', 'K1', 'K2']
)
#      A   B
# K0  A0  B0
# K1  A1  B1
# K2  A2  B2

right = pd.DataFrame(
    {
        'C': ['C0', 'C2', 'C3'],
        'D': ['D0', 'D2', 'D3']
    },
    index=['K0', 'K2', 'K3']
)
#      C   D
# K0  C0  D0
# K2  C2  D2
# K3  C3  D3

expectedcolumns = set(left.columns.to_list()) | set(right.columns.to_list())
# {'C', 'B', 'D', 'A'}
expectedindex = set(left.index.to_list()) | set(right.index.to_list())
# {'K2', 'K3', 'K0', 'K1'}

result = left.join(right, how='outer')
#       A    B    C    D
# K0   A0   B0   C0   D0
# K1   A1   B1  NaN  NaN
# K2   A2   B2   C2   D2
# K3  NaN  NaN   C3   D3

columns = result.columns
# Index(['A', 'B', 'C', 'D'], dtype='object')
index = result.index
# Index(['K0', 'K1', 'K2', 'K3'], dtype='object')

2. on
单列

两个 df 基于指定列 key 的共有值进行合并。
体现在代码中,如下所示:

left = pd.DataFrame(
    {
        'k': ['K0', 'K1', 'K2'],
        'v': [1, 2, 3]
    }
)

right = pd.DataFrame(
    {
        'k': ['K0', 'K0', 'K3'],
        'v': [4, 5, 6]
    }
)

expectedindex = set(left['k'].to_list()) & set(right['k'].to_list())
# {'K0'}
result = pd.merge(left, right, on='k')
#     k  v_x  v_y
# 0  K0    1    4
# 1  K0    1    5
columns = result.columns
# Index(['k', 'v_x', 'v_y'], dtype='object')
index = result.index
# Int64Index([0, 1], dtype='int64')

也可以通过 suffixes 自定义重复列的列名,

left = pd.DataFrame(
    {
        'k': ['K0', 'K1', 'K2'],
        'v': [1, 2, 3]
    }
)

right = pd.DataFrame(
    {
        'k': ['K0', 'K0', 'K3'],
        'v': [4, 5, 6]
    }
)

expectedindex = set(left['k'].to_list()) & set(right['k'].to_list())
# {'K0'}

result = pd.merge(left, right, on='k', suffixes=('1_l', '2_r'))
#     k  v1_l  v2_r
# 0  K0     1     4
# 1  K0     1     5
columns = result.columns
# Index(['k', 'v1_l', 'v2_r'], dtype='object')
index = result.index
# Int64Index([0, 1], dtype='int64')

行与列

两个 df 基于指定列 key 的共有值进行合并。
体现在代码中,如下所示:

left = pd.DataFrame(
    {
        'A': ['A0', 'A1', 'A2', 'A3'],
        'B': ['B0', 'B1', 'B2', 'B3'],
        'key': ['K0', 'K1', 'K0', 'K1']
    }
)
#     A   B key
# 0  A0  B0  K0
# 1  A1  B1  K1
# 2  A2  B2  K0
# 3  A3  B3  K1

right = pd.DataFrame(
    {
        'C': ['C0', 'C1'],
        'D': ['D0', 'D1']
    },
    index=['K0', 'K1']
)
#      C   D
# K0  C0  D0
# K1  C1  D1

result = left.join(right, on='key')
#     A   B key   C   D
# 0  A0  B0  K0  C0  D0
# 1  A1  B1  K1  C1  D1
# 2  A2  B2  K0  C0  D0
# 3  A3  B3  K1  C1  D1

columns = result.columns
# Index(['A', 'B', 'key', 'C', 'D'], dtype='object')
index = result.index
# RangeIndex(start=0, stop=4, step=1)

体现在示意图中,如下所示:

在这里插入图片描述


pd.combine_first

用法

obj1.combine_first(obj2)

其中,obj1 为函数调用对象的数据集;obj2 为函数参数对象的数据集。
该函数的作用是函数参数对象中的数据为函数调用对象的缺失数据 “打补丁”, 即填充函数调用对象中的数据缺失值 54


示例

一个 df 中的空值被另外一个 df 中同位置处的非空值所替换,如下所示:

a = pd.DataFrame(
    [
        [np.nan, 3., 5.],
        [-4.6, np.nan, np.nan],
        [np.nan, 7., np.nan]
    ]
)
#      0    1    2
# 0  NaN  3.0  5.0
# 1 -4.6  NaN  NaN
# 2  NaN  7.0  NaN

b = pd.DataFrame(
    [
        [-42.6, np.nan, -8.2],
        [-5., 1.6, 4]
    ],
    index=[1, 2]
)
#       0    1    2
# 1 -42.6  NaN -8.2
# 2  -5.0  1.6  4.0

result = a.combine_first(b)
#      0    1    2
# 0  NaN  3.0  5.0
# 1 -4.6  NaN -8.2
# 2 -5.0  7.0  4.0

体现在示意图中,如下所示:

在这里插入图片描述
或者,也可以使用 df.update 的方式,如下所示:

a.update(b)
#       0    1    2
# 0   NaN  3.0  5.0
# 1 -42.6  NaN -8.2
# 2  -5.0  1.6  4.0

参考链接

#todo:


  1. Pandas API reference ↩︎

  2. pandas 数据类型转换 ↩︎

  3. pandas中查看数据类型的几种方式 ↩︎

  4. 用于说明dtype的原始数据 ↩︎ ↩︎

  5. 7 ways to convert pandas DataFrame column to float ↩︎

  6. How to Convert Strings to Floats in Pandas DataFrame? ↩︎

  7. Converting string to float - python ↩︎

  8. pandas中去除指定字符 ↩︎

  9. pandas将dataframe的小数修改成百分比 ↩︎

  10. AttributeError: ‘DataFrame’ object has no attribute ‘str’ while trying to fix my dataframe ↩︎

  11. Pandas列百分数转化为小数 ↩︎ ↩︎

  12. 将某些浮动数据框列的格式设置为pandas的百分比 ↩︎

  13. pandas DataFrame.fillna()填充缺失函数的使用 ↩︎

  14. Pandas将字典dict转为Dataframe的3种方法总结 ↩︎

  15. Pandas将字典dict转为Dataframe的3种方法总结 ↩︎

  16. pandas.DataFrame.from_dict ↩︎

  17. Python将DataFrame的某一列作为index ↩︎

  18. Dataframe 索引、切片及技巧 ↩︎

  19. Pandas索引排序 df.sort_index()的实现方法 ↩︎

  20. Pandas DataFrame reindex() Function ↩︎

  21. How to Reindex MultiIndex Dataframe on Multiple Levels? ↩︎

  22. pandas.DataFrame.drop ↩︎

  23. pandas.DataFrame.droplevel ↩︎

  24. 如何在Pandas中复制行? ↩︎

  25. python统计频数_Python Pandas计算特定值的出现次数 ↩︎

  26. pandas统计某一列中各个值的出现次数 ↩︎

  27. 如何向 Pandas DataFrame 添加行 ↩︎ ↩︎

  28. python pandas同时对多列进行赋值 ↩︎ ↩︎

  29. DataFrame 新增列的五种方法 ↩︎

  30. DataFrame数据的平移和绝对值方法小记 ↩︎ ↩︎

  31. pandas.DataFrame.drop_duplicates ↩︎

  32. How to set a cell to NaN in a pandas dataframe ↩︎

  33. pandas DataFrame数据重命名列名的几种方式 ↩︎

  34. 将Series转DataFrame并修改列名 ↩︎

  35. pandas中DataFrame修改index、columns名的方法 ↩︎ ↩︎

  36. pandas.DataFrame.divide ↩︎

  37. [dataframe的divide运算](Python pandas.DataFrame.divide函数方法的使用) ↩︎

  38. DataFrame各列对应除以最后一列 ↩︎

  39. pandas dataframe select columns in multiindex [duplicate] ↩︎ ↩︎

  40. pandas dataframe select columns in multiindex [duplicate] ↩︎ ↩︎

  41. Python | Pandas | 筛选 ↩︎

  42. Pandas知识点-统计运算函数 ↩︎

  43. pandas获取最大值/最小值对应的index ↩︎

  44. pandas的妙用:快速定位最大值最小值 ↩︎

  45. pandas 数据对比 ↩︎

  46. 【python】对多个dataframe取交集 ↩︎

  47. Pandas中两个dataframe的交集和差集 ↩︎

  48. python中如何对三个、四个dataframe求交集? ↩︎

  49. 如何在两个pandas dataframe里找出某一列或者几列的值相同的行? ↩︎

  50. Python如何对N个列表求交集? ↩︎

  51. Pandas高级教程之:Dataframe的合并 ↩︎

  52. python–concat:series拼接 ↩︎

  53. python 两个series 合并 dataframe ↩︎

  54. pandas之combine_first合并重叠数据(修补) ↩︎

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值