本文对 Pandas
的使用进行不完全总结 1。
Updated: 2023 / 08 / 05
数据类型
参考这里 2
数据类型的问题一般都是出了问题之后才会发现的,所以有了一些经验之后就会拿到数据之后,就直接看数据类型,是否与自己想要处理的数据格式一致,这样可以从一开始避免一些尴尬的问题出现。如果需要进行数据预处理,则一般通过数据类型的转化。
此处介绍 pandas
里面的数据类型( data types
也就是常用的 dtype
),以及 pandas
与numpy
之间的数据对应关系:
这里需要强调的是 Pandas
中 object
类型实际上可以包括多种不同的类型,比如一列数据里,既有整型、浮点型,也有字符串类型,这些都会被标识为 object
,所以在处理数据时,可能需要额外的一些方法提前将这些字段做清洗,比如 str.replace()
, float()
, int()
, astype()
,apply()
等等 3。
接下来,主要介绍 object
,int64
,float64
,datetime64
,bool
等几种类型,category
与 timedelta
两种类型会进行另外介绍。
原始数据文件参考此处 4,原始数据内容如下所示:
data = pd.read_csv('dtype_RawData.csv')
# 读取到的原始数据如下所示:
#
# Customer Number Customer Name 2016 ... Month Day Active
# 0 1002 Quest industries ¥125,000.00 ... 1 2015 Y
# 1 552278 Smith Plumbing ¥920,000.00 ... 6 2014 Y
# 2 23477 ACME Industrial ¥50,000.00 ... 3 2016 Y
# 3 24900 Brekke LTD ¥350,000.00 ... 10 2015 Y
# 4 651029 Harbor Co ¥15,000.00 ... 2 2014 N
#
# [5 rows x 9 columns]
信息查询
- 数据有哪些列以及每列的数据的数据类型
print(data.columns)
Index(['Customer Number', 'Customer Name', '2016', '2017', 'Percent Growth',
# 'Jan Units', 'Month', 'Day', 'Active'],
# dtype='object')
print(type(data.dtypes))
# <class 'pandas.core.series.Series'>
print(data.dtypes)
# Customer Number int64
# Customer Name object
# 2016 object
# 2017 object
# Percent Growth object
# Jan Units object
# Month int64
# Day int64
# Active object
# dtype: object
print(df.dtypes.index)
# Index(['Customer Number', 'Customer Name', '2016', '2017', 'Percent Growth',
# 'Jan Units', 'Month', 'Day', 'Active'],
# dtype='object')
print(df.dtypes.values)
# [dtype('int64') dtype('O') dtype('O') dtype('O') dtype('O') dtype('O')
# dtype('int64') dtype('int64') dtype('O')]
print(df.dtypes.index[4])
print(df.dtypes[df.dtypes.index[4]])
# Percent Growth
# object
原始数据存在名为 Customer Number
, Customer Name
, 2016
, 2017
, Percent Growth
, Jan Units
, Month
, Day
, Active
的列,共9列。
每一列的数据的类型如 data.dtypes
所展示的。
- 获取数组的基本信息(维度、列名称、数据格式、所占空间等)
data.info()
'''
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Customer Number 5 non-null int64
1 Customer Name 5 non-null object
2 2016 5 non-null object
3 2017 5 non-null object
4 Percent Growth 5 non-null object
5 Jan Units 5 non-null object
6 Month 5 non-null int64
7 Day 5 non-null int64
8 Active 5 non-null object
dtypes: int64(3), object(6)
memory usage: 488.0+ bytes
None
'''
结合上面得到的信息,
Customer Number
列的数据是int64
,并不是float64
格式2016
,2017
两列的数据是object
,并不是float64
或者int64
格式Percent
以及Jan Units
也是objects
,而不是数字格式Month
,Day
应该转化为datetime64[ns]
格式Active
列应该是布尔值
假如想得到 2016
年与 2017
年的数据总和,可以尝试,但并不会得到我们需要的答案,因为这两列中的数据类型是 object
,执行 +
运算操作之后,得到是一个更加长的字符串,如下所示:
print(data['2016'] + data['2017'])
'''
0 ¥125,000.00¥162,500.00
1 ¥920,000.00¥1,012,000.00
2 ¥50,000.00¥62,500.00
3 ¥350,000.00¥490,000.00
4 ¥15,000.00¥12,750.00
dtype: object
'''
- 获取数组的维度
print(data.shape)
# (5, 9)
说明对应的数据有 5
行、9
列。
类型转换
如果不做数据清洗,很难进行下一步的数据分析,为了进行数据格式的转化,pandas
里面有3种比较常用的方法:
astype()
强制转化数据类型- 通过创建用户自定义的函数进行数据转化
- 通过
pandas
或者numpy
提供的to_nueric()
以及to_datetime()
等现有函数转化数据类型
astype
可以通过 astype()
将第一列的数据转化为浮点数 float
类型,
data['Day'].astype("float")
print(data['Day'].dtypes)
# int64
但是这样的操作并没有改变原始的数据框,而只是返回的一个拷贝。
想要真正的改变数据框,通常需要通过赋值来进行,如下,
data['Day'] = data['Day'].astype("float")
print(data['Day'].dtypes)
# float64
astype
基本也就是两种用作用:将数字转化为单纯字符串;将单纯数字的字符串转化为数字。
含有其他的非数字的字符串是不能通过 astype
进行转化的,比如 2016
,2017
,Percent Growth
,Jan Units
这几列带有特殊符号的数据类型为 object
的列。
因此需要引入其他的方法进行转化,也就有了下面的自定义函数方法 ——
自定义函数
可以通过下面的自定义函数对含有特殊字符的列进行转化,并通过 pandas
中的 apply
函数将 2016
列中的数据全部转化为 float64
类型。
def converter(val):
'''
通过 `replace` 函数将 `$` 以及逗号去掉,然后字符串转化为浮点数。
'''
new = val.replace(",", "").replace("¥", "")
return float(new)
data['2016'] = data['2016'].apply(converter)
print(data['2016'])
print(data['2016'].dtype)
'''
0 125000.0
1 920000.0
2 50000.0
3 350000.0
4 15000.0
Name: 2016, dtype: float64
float64
'''
也当然可以通过 lambda
函数将这个比较简单的函数一行带过,如下所示:
data['2016'] = data['2016'].apply(lambda x: x.replace(",", "").replace("¥", "")).astype('float64')
print(data['2016'])
print(data['2016'].dtype)
'''
0 125000.0
1 920000.0
2 50000.0
3 350000.0
4 15000.0
Name: 2016, dtype: float64
float64
'''
现有函数
numpy.where
利用 numpy.where(表达式)
函数将 Active
列转化为布尔值,如下:
import numpy as np
print(data['Active'])
print(data['Active'].dtype)
'''
0 Y
1 Y
2 Y
3 Y
4 N
Name: Active, dtype: object
object
'''
data["Active"] = np.where(data["Active"] == "Y", True, False)
print(data['Active'])
print(data['Active'].dtype)
'''
Name: Active, dtype: object
object
0 True
1 True
2 True
3 True
4 False
Name: Active, dtype: bool
bool
'''
pandas.to_numeric
print(data['Jan Units'])
print(data['Jan Units'].dtype)
'''
0 500
1 700
2 125
3 75
4 Closed
Name: Jan Units, dtype: object
object
'''
data['Jan Units'] = pd.to_numeric(data['Jan Units'], errors='coerce')
print(data['Jan Units'])
print(data['Jan Units'].dtype)
'''
0 500.0
1 700.0
2 125.0
3 75.0
4 NaN
Name: Jan Units, dtype: float64
float64
'''
pandas.to_datetime
data['Year'] = 2022
data['Start Date'] = pd.to_datetime(data[['Month', 'Day', 'Year']])
print(data['Start Date'])
print(data['Start Date'].dtype)
'''
0 2022-01-20
1 2022-06-14
2 2022-03-16
3 2022-10-15
4 2022-02-14
Name: Start Date, dtype: datetime64[ns]
datetime64[ns]
'''
格式
以下面的 df
为例,对修改 df
内数值的格式进行相应说明:
import numpy as np
df = pd.DataFrame(np.arange(0, 1, 0.05).reshape(4, 5), columns=['Asia', 'Africa', 'Australia', 'Europe', 'Russia'])
# Asia Africa Australia Europe Russia
# 0 0.00 0.05 0.10 0.15 0.20
# 1 0.25 0.30 0.35 0.40 0.45
# 2 0.50 0.55 0.60 0.65 0.70
# 3 0.75 0.80 0.85 0.90 0.95
#
# Index(['Asia', 'Africa', 'Australia', 'Europe', 'Russia'], dtype='object')
#
# RangeIndex(start=0, stop=4, step=1)
#
# Asia float64
# Africa float64
# Australia float64
# Europe float64
# Russia float64
# dtype: object
特殊字符
参考这里 8
百分比
方法
用途 | 方法 | 说明 |
---|---|---|
数字 ➡️ 百分比 | df.applymap(lambda x: '%.2f%%' % (x*100)) | 适用于 dataframe ,但是不适用于 Series |
df[col].apply(lambda x: '{:.2%}'.format(x)) df[col].map(lambda n: '{:.2%}'.format(n)) | 仅适用于 Series | |
百分比 ➡️ 数字 | df[col].str.strip('%').astype(float)/100 11 | 仅适用于 Series |
df.replace('%$', '', regex=True).astype('float')/100.0 df[['col']].replace('%$', '', regex=True).astype('float')/100.0 | 适用于 dataframe ,但不适用于 Series |
缺失值
Pandas
中将如下类型定义为缺失值:
NaN
,#N/A
、nan
、NULL
、None
。
对于缺失值,可以使用 pd.fillna()
作相应的处理 13。
语法
pandas.DataFrame.fillna(value=None, method=None, axis=None, inplace=False, limit=None, downcast=None, **kwargs)
参数 | 说明 |
---|---|
value | 用于填充的空值的值。 |
method | {'backfill', 'bfill', 'pad', 'ffill', None}, default None 。定义了填充空值的方法pad / ffill 表示用前面行/列的值,填充当前行/列的空值backfill / bfill 表示用后面行/列的值,填充当前行/列的空值。 |
axis | 轴。 0 或 index ,表示按行删除;1 或 columns ,表示按列删除。 |
inplace | 是否原地替换。布尔值,默认为 False 。如果为 True ,则在原DataFrame 上进行操作,返回值为 None 。 |
limit | int ,default None 。如果 method 被指定,对于连续的空值,这段连续区域,最多填充前 limit 个空值(如果存在多段连续区域,每段最多填充前 limit 个空值)。如果method未被指定, 在该 axis 下,最多填充前 limit 个空值(不论空值连续区间是否间断) |
downcast | dict , default is None ,字典中的项为,为类型向下转换规则。或者为字符串 infer ,此时会在合适的等价类型之间进行向下转换,比如 float64 to int64 if possible 。 |
示例
构造一个含有 nan
的 dataframe
,
dic = {"a": [8, np.nan, np.nan, 6, 9], "b": pd.Series([np.nan, 6, np.nan, np.nan, 10]),
"c": ["qq", np.nan, np.nan, np.nan, 88]}
df = pd.DataFrame(dic)
# a b c
# 0 8.0 NaN qq
# 1 NaN 6.0 NaN
# 2 NaN NaN NaN
# 3 6.0 NaN NaN
# 4 9.0 10.0 88
查询在 df
中哪些是空值,以及,哪些为非空值,
print(pd.isna(df))
# a b c
# 0 False True False
# 1 True False True
# 2 True True True
# 3 False True True
# 4 False False False
print(pd.notna(df))
# a b c
# 0 True False True
# 1 False True False
# 2 False False False
# 3 True False False
# 4 True True True
下面,开始向空值以指定的方式进行填充。
df.fillna(value)
inplace
默认为 False
,表示 pd.fillna()
返回一个新的 DataFrame
,原 df
不变。如果设置为 True
,表示在原 df
上填充,返回 None
。
df.fillna(value)
,则 df
中的所有 nan
都用单一的 value
替换并返回。
dffill10 = df.fillna(
10,
method=None,
axis=1,
limit=2,
inplace=False
)
# a b c
# 0 8.0 10 qq
# 1 10.0 6.0 10.0
# 2 10.0 10.0 NaN
# 3 6.0 10.0 10.0
# 4 9.0 10.0 88.0
df.fillna(dict)
以字典填充 nan
。字典的键值是列名,字典的值是待填充到相应列的填充值。
dic = {"a": 0, "b": 1, "c": 2}
dffill10 = df.fillna(
dic,
axis=0,
limit=2,
inplace=False
)
# a b c
# 0 8.0 1.0 qq
# 1 0.0 6.0 2
# 2 0.0 1.0 2
# 3 6.0 NaN NaN
# 4 9.0 10.0 88
df.fillna(df)
以 df
填充 nan
。
filldf = pd.DataFrame(np.arange(10).reshape((5, 2)), columns=["a", "b"])
# a b
# 0 0 1
# 1 2 3
# 2 4 5
# 3 6 7
# 4 8 9
df = df.fillna(filldf, inplace=False)
# a b c
# 0 8.0 1.0 qq
# 1 2.0 6.0 NaN
# 2 4.0 5.0 NaN
# 3 6.0 7.0 NaN
# 4 9.0 10.0 88
在 filldf
和 df
的列名相同的列中,filldf
中的值会填充在 df
的同一位置为 nan
的值。
示例
数字 -> 百分比
Series
df['Russia'] = df['Russia'].map(lambda n: '{:.2%}'.format(n))
# Asia Africa Australia Europe Russia
# 0 0.00 0.05 0.10 0.15 20.00%
# 1 0.25 0.30 0.35 0.40 45.00%
# 2 0.50 0.55 0.60 0.65 70.00%
# 3 0.75 0.80 0.85 0.90 95.00%
#
# Asia float64
# Africa float64
# Australia float64
# Europe float64
# Russia object
# dtype: object
或者,
df['Russia'] = df['Russia'].apply(lambda n: '{:.2%}'.format(n))
又或者,
df[['Russia']] = df[['Russia']].applymap(lambda x: '%.2f%%' % (x*100))
df
df_percent = df.applymap(lambda x: '%.2f%%' % (x*100))
# Asia Africa Australia Europe Russia
# 0 0.00% 5.00% 10.00% 15.00% 20.00%
# 1 25.00% 30.00% 35.00% 40.00% 45.00%
# 2 50.00% 55.00% 60.00% 65.00% 70.00%
# 3 75.00% 80.00% 85.00% 90.00% 95.00%
#
# Asia object
# Africa object
# Australia object
# Europe object
# Russia object
# dtype: object
百分比 -> 数字
Series
df['Russia'] = df['Russia'].str.strip('%').astype(float)/100
# Asia Africa Australia Europe Russia
# 0 0.00 0.05 0.10 0.15 0.20
# 1 0.25 0.30 0.35 0.40 0.45
# 2 0.50 0.55 0.60 0.65 0.70
# 3 0.75 0.80 0.85 0.90 0.95
#
# Asia float64
# Africa float64
# Australia float64
# Europe float64
# Russia float64
# dtype: object
df
df[['Russia', 'Europe']] = df[['Russia', 'Europe']].applymap(lambda x: '%.2f%%' % (x*100))
# Asia Africa Australia Europe Russia
# 0 0.00 0.05 0.10 15.00% 20.00%
# 1 0.25 0.30 0.35 40.00% 45.00%
# 2 0.50 0.55 0.60 65.00% 70.00%
# 3 0.75 0.80 0.85 90.00% 95.00%
df[['Russia', 'Europe']] = df[['Russia', 'Europe']].replace('%$', '', regex=True).astype('float')/100.0
# Asia Africa Australia Europe Russia
# 0 0.00 0.05 0.10 0.15 0.20
# 1 0.25 0.30 0.35 0.40 0.45
# 2 0.50 0.55 0.60 0.65 0.70
# 3 0.75 0.80 0.85 0.90 0.95
又或者,
df_percent = df_percent.replace('%$', '', regex=True).astype('float') / 100.0
# Asia Africa Australia Europe Russia
# 0 0.00 0.05 0.10 0.15 0.20
# 1 0.25 0.30 0.35 0.40 0.45
# 2 0.50 0.55 0.60 0.65 0.70
# 3 0.75 0.80 0.85 0.90 0.95
#
# Asia float64
# Africa float64
# Australia float64
# Europe float64
# Russia float64
# dtype: object
数据结构
字典 <–> df
参考这里 14
字典 -> df
以下面的两个字典为例对如何 15 将字典转换为 dataframe
结构进行说明,
dict1 = {'i': 1, 'love': 2, 'you': 3}
dict2 = {('country2', '2017'): {('prov_a', 'ind_a', 'EUR'): 5.0, ('prov_b', 'ind_b', 'EUR'): 7.0, ('prov_c', 'ind_c', 'EUR'): np.nan},
('country2', '2018'): {('prov_a', 'ind_a', 'EUR'): 8.0, ('prov_b', 'ind_b', 'EUR'): np.nan, ('prov_c', 'ind_c', 'EUR'): 15.0},
('country2', '2019'): {('prov_a', 'ind_a', 'EUR'): np.nan, ('prov_b', 'ind_b', 'EUR'): 5.0, ('prov_c', 'ind_c', 'EUR'): 60.0}}
pd.Dataframe
示例
- 关于
dict1
df1 = pd.DataFrame(dict1)
# File "/Applications/miniconda3/lib/python3.8/site-packages/pandas/core/internals/construction.py", line 664, in _extract_index
# raise ValueError("If using all scalar values, you must pass an index")
# ValueError: If using all scalar values, you must pass an index
df1 = pd.DataFrame(dict1, index=['idx'])
# i love you
# idx 1 2 3
#
# Index(['i', 'love', 'you'], dtype='object')
#
# Index(['idx'], dtype='object')
df1_to_dict = df1.to_dict()
# {'i': {'idx': 1}, 'love': {'idx': 2}, 'you': {'idx': 3}}
df1_T = pd.DataFrame(dict1, index=['idx']).T
# idx
# i 1
# love 2
# you 3
#
# Index(['idx'], dtype='object')
#
# Index(['i', 'love', 'you'], dtype='object')
df1_T_to_dict = df1_T.to_dict()
# {'idx': {'i': 1, 'love': 2, 'you': 3}}
- 关于
dict2
df2 = pd.DataFrame(dict2)
print(df2)
# country2
# 2017 2018 2019
# prov_a ind_a EUR 5.0 8.0 NaN
# prov_b ind_b EUR 7.0 NaN 5.0
# prov_c ind_c EUR NaN 15.0 60.0
#
# MultiIndex([('country2', '2017'),
# ('country2', '2018'),
# ('country2', '2019')],
# )
#
# MultiIndex([('prov_a', 'ind_a', 'EUR'),
# ('prov_b', 'ind_b', 'EUR'),
# ('prov_c', 'ind_c', 'EUR')],
# )
df2_T = pd.DataFrame(dict2).T
# prov_a prov_b prov_c
# ind_a ind_b ind_c
# EUR EUR EUR
# country2 2017 5.0 7.0 NaN
# 2018 8.0 NaN 15.0
# 2019 NaN 5.0 60.0
#
# MultiIndex([('prov_a', 'ind_a', 'EUR'),
# ('prov_b', 'ind_b', 'EUR'),
# ('prov_c', 'ind_c', 'EUR')],
# )
#
# MultiIndex([('country2', '2017'),
# ('country2', '2018'),
# ('country2', '2019')],
# )
pd.Dataframe.from_dict()
语法
参考这里 16
classmethod DataFrame.from_dict(data, orient='columns', dtype=None, columns=None)[source]
示例
- 关于
dict1
df1 = pd.DataFrame.from_dict(dict1, orient='index')
# 0
# i 1
# love 2
# you 3
#
# RangeIndex(start=0, stop=1, step=1)
#
# Index(['i', 'love', 'you'], dtype='object')
- 关于
dict2
df2 = pd.DataFrame.from_dict(dict2, orient='index')
# prov_a prov_b prov_c
# ind_a ind_b ind_c
# EUR EUR EUR
# country2 2017 5.0 7.0 NaN
# 2018 8.0 NaN 15.0
# 2019 NaN 5.0 60.0
#
# MultiIndex([('prov_a', 'ind_a', 'EUR'),
# ('prov_b', 'ind_b', 'EUR'),
# ('prov_c', 'ind_c', 'EUR')],
# )
#
# MultiIndex([('country2', '2017'),
# ('country2', '2018'),
# ('country2', '2019')],
# )
df2 = pd.DataFrame.from_dict(dict2, orient='columns')
# country2
# 2017 2018 2019
# prov_a ind_a EUR 5.0 8.0 NaN
# prov_b ind_b EUR 7.0 NaN 5.0
# prov_c ind_c EUR NaN 15.0 60.0
#
# MultiIndex([('country2', '2017'),
# ('country2', '2018'),
# ('country2', '2019')],
# )
#
# MultiIndex([('prov_a', 'ind_a', 'EUR'),
# ('prov_b', 'ind_b', 'EUR'),
# ('prov_c', 'ind_c', 'EUR')],
# )
索引
设置
参考此处 17
-
df.set_index(keys, drop=True, append=False, inplace=False, verify_integrity=False)
df.set_index(["Column"], inplace=True)
# 下面代码实现了将df中的column列作为index
引用
参考这里 18
列
df[col]
一般用来选择列,[]
中写列名(所以一般数据 columns
都会单独指定,不会用默认数字列名,以免和 index
冲突。)
df['a']
单列,以pd.Series
的形式输出;df[['a', 'b']]
多列,以pd.Dataframe
的形式输出;
行
df[]
中为数字时,默认选择行,且只能进行切片的选择,不能单独选择(df[0]
)。
df[]
不能通过索引标签名来选择行 df['one']
。
df.loc[label]
主要针对 index
选择行,同时支持指定 index
,及默认数字 index
。
df.loc['one']
,df.loc[1]
单行,以pd.Series
的形式输出;df.loc[['one', 'two']]
,df.loc[1:3]
多行,以pd.Dataframe
的形式输出;
iloc
和 loc
的区别,基本 iloc
与 loc
的用法相似,但和 loc
索引不同是不能索引超出数据行数的整数位置。
布尔型索引与 pd.Series
的索引基本一致。
排序
以以下示例为例,对如何重塑索引排序进行说明,
df1 = pd.DataFrame(
data={"data_provider": ["prov_a", "prov_a", "prov_b", "prov_b", "prov_c", "prov_c"],
"indicator": ["ind_a", "ind_a", "ind_b", "ind_b", "ind_c", "ind_c"],
"unit": ["EUR", "EUR", "EUR", "EUR", "EUR", "EUR"],
"year": ["2017", "2018","2019", "2017","2018","2019"],
"country1": [6, 2, 10, 1, 3, 6],
"country2": [5, 8, 5, 7, 15, 60]}
)
# data_provider indicator unit year country1 country2
# 0 prov_a ind_a EUR 2017 6 5
# 1 prov_a ind_a EUR 2018 2 8
# 2 prov_b ind_b EUR 2019 10 5
# 3 prov_b ind_b EUR 2017 1 7
# 4 prov_c ind_c EUR 2018 3 15
# 5 prov_c ind_c EUR 2019 6 60
#
# Index(['data_provider', 'indicator', 'unit', 'year', 'country1', 'country2'], dtype='object')
#
# RangeIndex(start=0, stop=6, step=1)
df1MultiIndex = df1.pivot_table(
index=['data_provider', 'indicator', 'unit'],
columns='year'
)
# country1 country2
# year 2017 2018 2019 2017 2018 2019
# data_provider indicator unit
# prov_a ind_a EUR 6.0 2.0 NaN 5.0 8.0 NaN
# prov_b ind_b EUR 1.0 NaN 10.0 7.0 NaN 5.0
# prov_c ind_c EUR NaN 3.0 6.0 NaN 15.0 60.0
#
# MultiIndex([('country1', '2017'),
# ('country1', '2018'),
# ('country1', '2019'),
# ('country2', '2017'),
# ('country2', '2018'),
# ('country2', '2019')],
# names=[None, 'year'])
#
# MultiIndex([('prov_a', 'ind_a', 'EUR'),
# ('prov_b', 'ind_b', 'EUR'),
# ('prov_c', 'ind_c', 'EUR')],
# names=['data_provider', 'indicator', 'unit'])
sort_index
df.sort_index()
实现按索引排序,默认以从小到大的升序方式排列,如希望按降序排列,传入ascending = False
。
其语法为,
DataFrame.sort_index(*, axis=0, level=None, ascending=True, inplace=False, kind='quicksort', na_position='last', sort_remaining=True, ignore_index=False, key=None
reindex
DataFrame.reindex(labels=None, index=None, columns=None, axis=None, method=None, copy=None, level=None, fill_value=nan, limit=None, tolerance=None)
参数 | 说明 |
---|---|
labels | New labels/index to conform to the axis that is specified by ‘axis’. |
index , columns | This is also an optional parameter that refers to the new labels/index. It generally prefers an index object for avoiding duplicate data. |
axis | This is an optional parameter that axis to target. Can be either the axis name (‘index’, ‘columns’) or numbers (0, 1). |
method | {None, ‘backfill’/’bfill’, ‘pad’/’ffill’, ‘nearest’} , optional, default None . Used for filling the holes in the reindexed DataFrame. For increasing/decreasing indexes only. |
copy | bool , default True : Whether to return a new object (a copy), even if the passed indexes are the same. |
level | int or name: It is used to broadcast across the level, matching Index values on the passed MultiIndex level. |
fill_value | Its default value is np.NaN and used to fill existing missing (NaN) values. and any new element needed for successful DataFrame alignment, with this value before computation. |
limit | It defines the maximum number of consecutive elements to forward or backward fill. |
tolerance | This is also an optional parameter that determines the maximum distance between original and new labels for inexact matches. The values of the index at the matching locations most satisfy the equation abs(index[indexer] – target) <= tolerance. |
行
df
-
index
按默认顺序升序
df1MultiIndex = df1MultiIndex.sort_index(ascending=True)
# country1 country2
# year 2017 2018 2019 2017 2018 2019
# data_provider indicator unit
# prov_a ind_a EUR NaN 3.0 6.0 NaN 15.0 60.0
# prov_b ind_b EUR 6.0 2.0 NaN 5.0 8.0 NaN
# prov_c ind_c EUR 1.0 NaN 10.0 7.0 NaN 5.0
#
# MultiIndex([('country1', '2017'),
# ('country1', '2018'),
# ('country1', '2019'),
# ('country2', '2017'),
# ('country2', '2018'),
# ('country2', '2019')],
# names=[None, 'year'])
#
# MultiIndex([('prov_a', 'ind_a', 'EUR'),
# ('prov_b', 'ind_b', 'EUR'),
# ('prov_c', 'ind_c', 'EUR')],
# names=['data_provider', 'indicator', 'unit'])
-
index
按默认顺序降序
df1MultiIndex = df1MultiIndex.sort_index(ascending=False)
# country1 country2
# year 2017 2018 2019 2017 2018 2019
# data_provider indicator unit
# prov_c ind_c EUR 1.0 NaN 10.0 7.0 NaN 5.0
# prov_b ind_b EUR 6.0 2.0 NaN 5.0 8.0 NaN
# prov_a ind_a EUR NaN 3.0 6.0 NaN 15.0 60.0
#
# MultiIndex([('country1', '2017'),
# ('country1', '2018'),
# ('country1', '2019'),
# ('country2', '2017'),
# ('country2', '2018'),
# ('country2', '2019')],
# names=[None, 'year'])
#
# MultiIndex([('prov_c', 'ind_c', 'EUR'),
# ('prov_b', 'ind_b', 'EUR'),
# ('prov_a', 'ind_a', 'EUR')],
# names=['data_provider', 'indicator', 'unit'])
列
df
-
column
第1
层按默认顺序升序
df1MultiIndex_sort_level1 = df1MultiIndex.sort_index(level=1, axis=1, ascending=True)
# country1 country2 ... country1 country2
# year 2017 2017 ... 2019 2019
# data_provider indicator unit ...
# prov_a ind_a EUR NaN NaN ... 6.0 60.0
# prov_b ind_b EUR 6.0 5.0 ... NaN NaN
# prov_c ind_c EUR 1.0 7.0 ... 10.0 5.0
#
# [3 rows x 6 columns]
# MultiIndex([('country1', '2017'),
# ('country2', '2017'),
# ('country1', '2018'),
# ('country2', '2018'),
# ('country1', '2019'),
# ('country2', '2019')],
# names=[None, 'year'])
#
# MultiIndex([('prov_a', 'ind_a', 'EUR'),
# ('prov_b', 'ind_b', 'EUR'),
# ('prov_c', 'ind_c', 'EUR')],
# names=['data_provider', 'indicator', 'unit'])
level=1
、axis=1
, ascending=True
,第一层名为 year
的列索引上升序。
-
column
第0
层按默认顺序降序
df1MultiIndex_sort_level0 = df1MultiIndex.sort_index(level=0, axis=1, ascending=False)
# country2 country1
# year 2019 2018 2017 2019 2018 2017
# data_provider indicator unit
# prov_a ind_a EUR 60.0 15.0 NaN 6.0 3.0 NaN
# prov_b ind_b EUR NaN 8.0 5.0 NaN 2.0 6.0
# prov_c ind_c EUR 5.0 NaN 7.0 10.0 NaN 1.0
#
# MultiIndex([('country2', '2019'),
# ('country2', '2018'),
# ('country2', '2017'),
# ('country1', '2019'),
# ('country1', '2018'),
# ('country1', '2017')],
# names=[None, 'year'])
#
# MultiIndex([('prov_a', 'ind_a', 'EUR'),
# ('prov_b', 'ind_b', 'EUR'),
# ('prov_c', 'ind_c', 'EUR')],
# names=['data_provider', 'indicator', 'unit'])
level=0
、axis=1
, ascending=False
,第一层名为 None
的列索引上降序。
-
column
的名为year
层按给定顺序['2018', '2019', '2017']
排序
df1MultiIndex_sort_year = df1MultiIndex.reindex(columns=['2018', '2019', '2017'], level='year')
# country1 country2
# year 2018 2019 2017 2018 2019 2017
# data_provider indicator unit
# prov_a ind_a EUR 3.0 6.0 NaN 15.0 60.0 NaN
# prov_b ind_b EUR 2.0 NaN 6.0 8.0 NaN 5.0
# prov_c ind_c EUR NaN 10.0 1.0 NaN 5.0 7.0
#
# MultiIndex([('country1', '2018'),
# ('country1', '2019'),
# ('country1', '2017'),
# ('country2', '2018'),
# ('country2', '2019'),
# ('country2', '2017')],
# names=[None, 'year'])
#
# MultiIndex([('prov_a', 'ind_a', 'EUR'),
# ('prov_b', 'ind_b', 'EUR'),
# ('prov_c', 'ind_c', 'EUR')],
# names=['data_provider', 'indicator', 'unit'])
-
column
的每级索引按给定顺序排序。level=0
指定顺序为['country1', 'country2']
,level=1
指定顺序为['2019', '2017', '2018']
21。
Reorder = pd.MultiIndex.from_product([['country1', 'country2'], ['2019', '2017', '2018']])
df1MultiIndex_reindex = df1MultiIndex.reindex(columns=Reorder)
# country1 country2
# 2019 2017 2018 2019 2017 2018
# data_provider indicator unit
# prov_a ind_a EUR 6.0 NaN 3.0 60.0 NaN 15.0
# prov_b ind_b EUR NaN 6.0 2.0 NaN 5.0 8.0
# prov_c ind_c EUR 10.0 1.0 NaN 5.0 7.0 NaN
#
# MultiIndex([('country1', '2019'),
# ('country1', '2017'),
# ('country1', '2018'),
# ('country2', '2019'),
# ('country2', '2017'),
# ('country2', '2018')],
# )
#
# MultiIndex([('prov_a', 'ind_a', 'EUR'),
# ('prov_b', 'ind_b', 'EUR'),
# ('prov_c', 'ind_c', 'EUR')],
# names=['data_provider', 'indicator', 'unit'])
Series
- 某个
index
,column
第1
层按默认顺序降序
df1MultiIndex_prov_b = df1MultiIndex.loc[('prov_b', )]
# country1 country2
# year 2017 2018 2019 2017 2018 2019
# indicator unit
# ind_b EUR 6.0 2.0 NaN 5.0 8.0 NaN
#
# MultiIndex([('country1', '2017'),
# ('country1', '2018'),
# ('country1', '2019'),
# ('country2', '2017'),
# ('country2', '2018'),
# ('country2', '2019')],
# names=[None, 'year'])
#
# MultiIndex([('ind_b', 'EUR')],
# names=['indicator', 'unit'])
df1MultiIndex_sort_prov_b = df1MultiIndex.loc[('prov_b', )].sort_index(level=1, axis=1, ascending=False)
# country2 country1 country2 country1 country2 country1
# year 2019 2019 2018 2018 2017 2017
# indicator unit
# ind_b EUR NaN NaN 8.0 2.0 5.0 6.0
#
# MultiIndex([('country2', '2019'),
# ('country1', '2019'),
# ('country2', '2018'),
# ('country1', '2018'),
# ('country2', '2017'),
# ('country1', '2017')],
# names=[None, 'year'])
#
# MultiIndex([('ind_b', 'EUR')],
# names=['indicator', 'unit'])
删除
以以下 df
为例对如何删除行 / 列索引进行说明,
import pandas as pd
df = pd.DataFrame(
data={"data_provider": ["prov_a", "prov_a", "prov_b", "prov_b", "prov_c", "prov_c"],
"indicator": ["ind_a", "ind_a", "ind_b", "ind_b", "ind_c", "ind_c"],
"unit": ["EUR", "EUR", "EUR", "EUR", "EUR", "EUR"],
"year": ["2017", "2018","2019", "2017","2018","2019"],
"country1": [6, 2, 10, 1, 3, 6],
"country2": [5, 8, 5, 7, 15, 60]}
)
df = df.pivot_table(
index=['data_provider', 'indicator', 'unit'],
columns='year'
)
df.columns.names = ['country', 'year']
# country country1 country2
# year 2017 2018 2019 2017 2018 2019
# data_provider indicator unit
# prov_a ind_a EUR 6.0 2.0 NaN 5.0 8.0 NaN
# prov_b ind_b EUR 1.0 NaN 10.0 7.0 NaN 5.0
# prov_c ind_c EUR NaN 3.0 6.0 NaN 15.0 60.0
#
# MultiIndex([('prov_a', 'ind_a', 'EUR'),
# ('prov_b', 'ind_b', 'EUR'),
# ('prov_c', 'ind_c', 'EUR')],
# names=['data_provider', 'indicator', 'unit'])
#
# MultiIndex([('country1', '2017'),
# ('country1', '2018'),
# ('country1', '2019'),
# ('country2', '2017'),
# ('country2', '2018'),
# ('country2', '2019')],
# names=['country', 'year'])
列
df.drop
- 根据完整
level
的列标签删除相应索引
df_dropcol = df.drop(('country1', '2017'), axis=1)
# MultiIndex([('country1', '2018'),
# ('country1', '2019'),
# ('country2', '2017'),
# ('country2', '2018'),
# ('country2', '2019')],
# names=['country', 'year'])
df_dropcol = df.drop(columns=('country1', '2019'), axis=1)
# MultiIndex([('country1', '2017'),
# ('country1', '2018'),
# ('country2', '2017'),
# ('country2', '2018'),
# ('country2', '2019')],
# names=['country', 'year'])
- 根据列标签的某一
level
中的标签值删除相应索引
df_dropcol = df.drop([('country1', )], axis=1)
# MultiIndex([('country2', '2017'),
# ('country2', '2018'),
# ('country2', '2019')],
# names=['country', 'year'])
df_dropcol = df.drop(columns='country2', axis=1)
# MultiIndex([('country1', '2017'),
# ('country1', '2018'),
# ('country1', '2019')],
# names=['country', 'year'])
df_dropcol = df.drop(columns='2019', level='year')
# MultiIndex([('country1', '2017'),
# ('country1', '2018'),
# ('country2', '2017'),
# ('country2', '2018')],
# names=['country', 'year'])
df.droplevel
- 根据列标签的某一
level
中的名称值删除相应水平的索引
df_dropcol = df.droplevel(level='year', axis=1)
# Index(['country1', 'country1', 'country1', 'country2', 'country2', 'country2'], dtype='object', name='country')
df_dropcol = df.droplevel(level='country', axis=1)
# Index(['2017', '2018', '2019', '2017', '2018', '2019'], dtype='object', name='year')
行
df.drop
- 根据完整
level
的列标签删除相应索引
df_droprow = df.drop([('prov_c', 'ind_c', 'EUR')])
# MultiIndex([('prov_a', 'ind_a', 'EUR'),
# ('prov_b', 'ind_b', 'EUR')],
# names=['data_provider', 'indicator', 'unit'])
df_droprow = df.drop(index=('prov_b', 'ind_b', 'EUR'))
# MultiIndex([('prov_a', 'ind_a', 'EUR'),
# ('prov_c', 'ind_c', 'EUR')],
# names=['data_provider', 'indicator', 'unit'])
- 根据列标签的某一
level
中的标签值删除相应索引
df_droprow = df.drop([('prov_b', )])
# MultiIndex([('prov_a', 'ind_a', 'EUR'),
# ('prov_c', 'ind_c', 'EUR')],
# names=['data_provider', 'indicator', 'unit'])
df_droprow = df.drop(index='prov_a')
# MultiIndex([('prov_b', 'ind_b', 'EUR'),
# ('prov_c', 'ind_c', 'EUR')],
# names=['data_provider', 'indicator', 'unit'])
df_droprow = df.drop(index='ind_b', level='indicator')
# MultiIndex([('prov_a', 'ind_a', 'EUR'),
# ('prov_c', 'ind_c', 'EUR')],
# names=['data_provider', 'indicator', 'unit'])
df.droplevel
- 根据行标签的某一
level
中的名称值删除相应水平的索引
df_droprow = df.droplevel(level='unit')
# MultiIndex([('prov_a', 'ind_a'),
# ('prov_b', 'ind_b'),
# ('prov_c', 'ind_c')],
# names=['data_provider', 'indicator'])
df_droprow = df.droplevel(level='data_provider')
# MultiIndex([('ind_a', 'EUR'),
# ('ind_b', 'EUR'),
# ('ind_c', 'EUR')],
# names=['indicator', 'unit'])
拷贝
原始数据文件参考此处 4,原始数据内容如下所示:
df = pd.read_csv('dtype_RawData.csv')
# Customer Number Customer Name 2016 ... Month Day Active
# 0 1002 Quest industries ¥125,000.00 ... 1 2015 Y
# 1 552278 Smith Plumbing ¥920,000.00 ... 6 2014 Y
# 2 23477 ACME Industrial ¥50,000.00 ... 3 2016 Y
# 3 24900 Brekke LTD ¥350,000.00 ... 10 2015 Y
# 4 651029 Harbor Co ¥15,000.00 ... 2 2014 N
#
# [5 rows x 9 columns]
#
# Customer Number int64
# Customer Name object
# 2016 object
# 2017 object
# Percent Growth object
# Jan Units object
# Month int64
# Day int64
# Active object
# dtype: object
沿行
参考这里 24
方法
方法 | 说明 |
---|---|
pd.Dataframe(np.repeats(df.values, reps, axis)) | |
pd.concat([df]*reps) | |
df.loc[df.index.repeat(reps)] |
示例
newdf = pd.DataFrame(np.repeat(df.values, 3, axis=0))
newdf.columns = df.columns
# Customer Number Customer Name 2016 ... Month Day Active
# 0 1002 Quest industries ¥125,000.00 ... 1 2015 Y
# 1 1002 Quest industries ¥125,000.00 ... 1 2015 Y
# 2 1002 Quest industries ¥125,000.00 ... 1 2015 Y
# 3 552278 Smith Plumbing ¥920,000.00 ... 6 2014 Y
# 4 552278 Smith Plumbing ¥920,000.00 ... 6 2014 Y
# 5 552278 Smith Plumbing ¥920,000.00 ... 6 2014 Y
# 6 23477 ACME Industrial ¥50,000.00 ... 3 2016 Y
# 7 23477 ACME Industrial ¥50,000.00 ... 3 2016 Y
# 8 23477 ACME Industrial ¥50,000.00 ... 3 2016 Y
# 9 24900 Brekke LTD ¥350,000.00 ... 10 2015 Y
# 10 24900 Brekke LTD ¥350,000.00 ... 10 2015 Y
# 11 24900 Brekke LTD ¥350,000.00 ... 10 2015 Y
# 12 651029 Harbor Co ¥15,000.00 ... 2 2014 N
# 13 651029 Harbor Co ¥15,000.00 ... 2 2014 N
# 14 651029 Harbor Co ¥15,000.00 ... 2 2014 N
#
# [15 rows x 9 columns]
#
# Index(['Customer Number', 'Customer Name', '2016', '2017', 'Percent Growth',
# 'Jan Units', 'Month', 'Day', 'Active'],
# dtype='object')
#
# RangeIndex(start=0, stop=15, step=1)
newdf = pd.concat([df]*3)
# Customer Number Customer Name 2016 ... Month Day Active
# 0 1002 Quest industries ¥125,000.00 ... 1 2015 Y
# 1 552278 Smith Plumbing ¥920,000.00 ... 6 2014 Y
# 2 23477 ACME Industrial ¥50,000.00 ... 3 2016 Y
# 3 24900 Brekke LTD ¥350,000.00 ... 10 2015 Y
# 4 651029 Harbor Co ¥15,000.00 ... 2 2014 N
# 0 1002 Quest industries ¥125,000.00 ... 1 2015 Y
# 1 552278 Smith Plumbing ¥920,000.00 ... 6 2014 Y
# 2 23477 ACME Industrial ¥50,000.00 ... 3 2016 Y
# 3 24900 Brekke LTD ¥350,000.00 ... 10 2015 Y
# 4 651029 Harbor Co ¥15,000.00 ... 2 2014 N
# 0 1002 Quest industries ¥125,000.00 ... 1 2015 Y
# 1 552278 Smith Plumbing ¥920,000.00 ... 6 2014 Y
# 2 23477 ACME Industrial ¥50,000.00 ... 3 2016 Y
# 3 24900 Brekke LTD ¥350,000.00 ... 10 2015 Y
# 4 651029 Harbor Co ¥15,000.00 ... 2 2014 N
#
# [15 rows x 9 columns]
#
# Index(['Customer Number', 'Customer Name', '2016', '2017', 'Percent Growth',
# 'Jan Units', 'Month', 'Day', 'Active'],
# dtype='object')
#
# Int64Index([0, 1, 2, 3, 4, 0, 1, 2, 3, 4, 0, 1, 2, 3, 4], dtype='int64')
newdf = df.loc[df.index.repeat(3)]
# Customer Number Customer Name 2016 ... Month Day Active
# 0 1002 Quest industries ¥125,000.00 ... 1 2015 Y
# 0 1002 Quest industries ¥125,000.00 ... 1 2015 Y
# 0 1002 Quest industries ¥125,000.00 ... 1 2015 Y
# 1 552278 Smith Plumbing ¥920,000.00 ... 6 2014 Y
# 1 552278 Smith Plumbing ¥920,000.00 ... 6 2014 Y
# 1 552278 Smith Plumbing ¥920,000.00 ... 6 2014 Y
# 2 23477 ACME Industrial ¥50,000.00 ... 3 2016 Y
# 2 23477 ACME Industrial ¥50,000.00 ... 3 2016 Y
# 2 23477 ACME Industrial ¥50,000.00 ... 3 2016 Y
# 3 24900 Brekke LTD ¥350,000.00 ... 10 2015 Y
# 3 24900 Brekke LTD ¥350,000.00 ... 10 2015 Y
# 3 24900 Brekke LTD ¥350,000.00 ... 10 2015 Y
# 4 651029 Harbor Co ¥15,000.00 ... 2 2014 N
# 4 651029 Harbor Co ¥15,000.00 ... 2 2014 N
# 4 651029 Harbor Co ¥15,000.00 ... 2 2014 N
#
# [15 rows x 9 columns]
#
# Index(['Customer Number', 'Customer Name', '2016', '2017', 'Percent Growth',
# 'Jan Units', 'Month', 'Day', 'Active'],
# dtype='object')
#
# Int64Index([0, 0, 0, 1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4], dtype='int64')
沿列
参考这里
方法
方法 | 说明 |
---|---|
df[df.columns.repeat(reps)] |
示例
newdf = df[df.columns.repeat(3)]
# Customer Number Customer Number Customer Number ... Active Active Active
# 0 1002 1002 1002 ... Y Y Y
# 1 552278 552278 552278 ... Y Y Y
# 2 23477 23477 23477 ... Y Y Y
# 3 24900 24900 24900 ... Y Y Y
# 4 651029 651029 651029 ... N N N
#
# [5 rows x 27 columns]
#
# Index(['Customer Number', 'Customer Number', 'Customer Number',
# 'Customer Name', 'Customer Name', 'Customer Name', '2016', '2016',
# '2016', '2017', '2017', '2017', 'Percent Growth', 'Percent Growth',
# 'Percent Growth', 'Jan Units', 'Jan Units', 'Jan Units', 'Month',
# 'Month', 'Month', 'Day', 'Day', 'Day', 'Active', 'Active', 'Active'],
# dtype='object')
#
# RangeIndex(start=0, stop=5, step=1)
频数
指定值
参考此处 25
-
df[df.columnname == value].shape[0]
df[df.education == '9th'].shape[0]
# `education` 列中值为 `9th` 出现的次数
非指定值
参考此处 26
-
df[col_name].value_counts()
-
df.loc[:,col_name].value_counts()
df['label'].value_counts()
# 统计 ’label’ 这一列各个值出现的次数
添加
行
末尾
-
df.loc()
使用df.loc()
在已有Dataframe
的末尾添加一行xin
df.loc[len(df.index)] = [value1, value2, value3, ...]
比如,
df = pd.DataFrame({'points': [10, 12, 12, 14, 13, 18],
'rebounds': [7, 7, 8, 13, 7, 4],
'assists': [11, 8, 10, 6, 6, 5]})
print(df)
'''
points rebounds assists
0 10 7 11
1 12 7 8
2 12 8 10
3 14 13 6
4 13 7 6
5 18 4 5
'''
#add new row to end of DataFrame
print(len(df.index))
'''
6
'''
df.loc[len(df.index)] = [20, 7, 5]
print(df)
'''
points rebounds assists
0 10 7 11
1 12 7 8
2 12 8 10
3 14 13 6
4 13 7 6
5 18 4 5
6 20 7 5
'''
-
df.append()
将已有Dataframe
的几行附加到另一个Dataframe
的末尾。请注意,两个Dataframe
应该具有相同的列名。
df = df.append(df2, ignore_index = True)
比如,
df = pd.DataFrame({'points': [10, 12, 12, 14, 13, 18],
'rebounds': [7, 7, 8, 13, 7, 4],
'assists': [11, 8, 10, 6, 6, 5]})
print(df)
'''
points rebounds assists
0 10 7 11
1 12 7 8
2 12 8 10
3 14 13 6
4 13 7 6
5 18 4 5
'''
df2 = pd.DataFrame({'points': [21, 25, 26],
'rebounds': [7, 7, 13],
'assists': [11, 3, 3]})
print(df2)
'''
points rebounds assists
0 21 7 11
1 25 7 3
2 26 13 3
'''
#add new row to end of DataFrame
df = df.append(df2, ignore_index=True)
print(df)
'''
points rebounds assists
0 10 7 11
1 12 7 8
2 12 8 10
3 14 13 6
4 13 7 6
5 18 4 5
6 21 7 11
7 25 7 3
8 26 13 3
'''
多行
以下例对本章节进行相关说明,
import pandas as pd
grade =[[100,95,100,99], [90,98,99,100], [88,95,98,88], [99,98,97,87], [96.5,90,96,85]]
df = pd.DataFrame(grade, columns=['Chinese', 'English', 'Math', 'Politics'])
print(df)
# Chinese English Math Politics
# 0 100.0 95 100 99
# 1 90.0 98 99 100
# 2 88.0 95 98 88
# 3 99.0 98 97 87
# 4 96.5 90 96 85
- 同时添加多行,新增行同当前
df
中的列的值相关
from statistics import geometric_mean
df = pd.concat([df, df.apply(lambda x: (x.sum(), x.mean(), geometric_mean(x)), axis=0, result_type='expand')], axis=0)
print(df)
# Chinese English Math Politics
# 0 100.000000 95.00000 100.000000 99.000000
# 1 90.000000 98.00000 99.000000 100.000000
# 2 88.000000 95.00000 98.000000 88.000000
# 3 99.000000 98.00000 97.000000 87.000000
# 4 96.500000 90.00000 96.000000 85.000000
# 0 473.500000 476.00000 490.000000 459.000000
# 1 94.700000 95.20000 98.000000 91.800000
# 2 94.575259 95.15429 97.989795 91.582229
列
以下例对本章节进行相关说明,
import pandas as pd
grade =[[100,95,100,99], [90,98,99,100], [88,95,98,88], [99,98,97,87], [96.5,90,96,85]]
df = pd.DataFrame(grade, columns=['Chinese', 'English', 'Math', 'Politics'])
print(df)
# Chinese English Math Politics
# 0 100.0 95 100 99
# 1 90.0 98 99 100
# 2 88.0 95 98 88
# 3 99.0 98 97 87
# 4 96.5 90 96 85
单列
- 添加单列,新增列同当前
df
中的列相关
df[['Sum']] = df[['Math']]
print(df)
# Chinese English Math Politics Sum
# 0 100.0 95 100 99 100
# 1 90.0 98 99 100 99
# 2 88.0 95 98 88 98
# 3 99.0 98 97 87 97
# 4 96.5 90 96 85 96
可以使用 shift(n)
对新增列进行向前/向后平移 30。
Chinese English Math Politics Sum
# 0 100.0 95 100 99 NaN
# 1 90.0 98 99 100 NaN
# 2 88.0 95 98 88 NaN
# 3 99.0 98 97 87 100.0
# 4 96.5 90 96 85 99
- 添加单列,新增列同当前
df
中的列无关
df['Sum'] = 'Mark'
print(df)
# Chinese English Math Politics Sum
# 0 100.0 95 100 99 Mark
# 1 90.0 98 99 100 Mark
# 2 88.0 95 98 88 Mark
# 3 99.0 98 97 87 Mark
# 4 96.5 90 96 85 Mark
多列
- 同时添加多列,新增列同当前
df
中的列相关
- 新增列标签与当前
df
中的列标签不同。
df[['New1', 'New2']] = df[['English', 'Math']]
print(df)
# Chinese English Math Politics New1 New2
# 0 100.0 95 100 99 95 100
# 1 90.0 98 99 100 98 99
# 2 88.0 95 98 88 95 98
# 3 99.0 98 97 87 98 97
# 4 96.5 90 96 85 90 96
或者
df[['New1', 'New2']] = df.apply(lambda x: ('yes', 'no'), axis=1, result_type='expand')
print(df)
# Chinese English Math Politics New1 New2
# 0 100.0 95 100 99 yes no
# 1 90.0 98 99 100 yes no
# 2 88.0 95 98 88 yes no
# 3 99.0 98 97 87 yes no
# 4 96.5 90 96 85 yes no
- 如果新增列标签与当前
df
中的列标签中有相同的,则被忽略。
df[['English', 'Geography']] = df[['English', 'Math']]
print(df)
# Chinese English Math Politics Geography
# 0 100.0 95 100 99 100
# 1 90.0 98 99 100 99
# 2 88.0 95 98 88 98
# 3 99.0 98 97 87 97
# 4 96.5 90 96 85 96
或者
df[['English', 'New2']] = df.apply(lambda x: ('yes', 'no'), axis=1, result_type='expand')
print(df)
# Chinese English Math Politics New2
# 0 100.0 yes 100 99 no
# 1 90.0 yes 99 100 no
# 2 88.0 yes 98 88 no
# 3 99.0 yes 97 87 no
# 4 96.5 yes 96 85 no
- 同时添加多列,新增列同当前
df
中的列无关
df[['Sum', 'Mean', 'Multiply']] = 'ABC', 'DEF', 'GHI'
print(df)
# Chinese English Math Politics Sum Mean Multiply
# 0 100.0 95 100 99 ABC DEF GHI
# 1 90.0 98 99 100 ABC DEF GHI
# 2 88.0 95 98 88 ABC DEF GHI
# 3 99.0 98 97 87 ABC DEF GHI
# 4 96.5 90 96 85 ABC DEF GHI
- 同时添加多列,新增列同当前
df
中的列的值相关
from statistics import geometric_mean
df[['Sum', 'Mean', 'Geomean']] = df.apply(lambda x: (x.sum(), x.mean(), geometric_mean(x)), axis=1, result_type='expand')
print(df)
# Chinese English Math Politics Sum Mean Geomean
# 0 100.0 95 100 99 394.0 98.500 98.478109
# 1 90.0 98 99 100 387.0 96.750 96.666489
# 2 88.0 95 98 88 369.0 92.250 92.146481
# 3 99.0 98 97 87 381.0 95.250 95.123553
# 4 96.5 90 96 85 367.5 91.875 91.751885
去重
drop_duplicates()
语法
语法如下 31,
DataFrame.drop_duplicates(subset=None, *, keep='first', inplace=False, ignore_index=False)
说明 | ||
---|---|---|
参数 | subset | column label or sequence of labels, optional Only consider certain columns for identifying duplicates, by default use all of the columns. |
. | keep{‘first’, ‘last’, False} , default ‘first | Determines which duplicates (if any) to keep. - first : Drop duplicates except for the first occurrence. - last : Drop duplicates except for the last occurrence.- False : Drop all duplicates. |
inplace | bool , default False Whether to modify the DataFrame rather than creating a new one. | |
ignore_index | bool , default False | |
返回 | df | |
None |
示例
以下面的 df
为例对如何去重进行说明,
df = pd.DataFrame({
'brand': ['Yum Yum', 'Yum Yum', 'Indomie', 'Indomie', 'Indomie'],
'style': ['cup', 'cup', 'cup', 'pack', 'pack'],
'rating': [4, 4, 3.5, 15, 5]
})
# brand style rating
# 0 Yum Yum cup 4.0
# 1 Yum Yum cup 4.0
# 2 Indomie cup 3.5
# 3 Indomie pack 15.0
# 4 Indomie pack 5.0
df
默认情况下,基于所有的列的值移除所有的重复的行
df.drop_duplicates()
# brand style rating
# 0 Yum Yum cup 4.0
# 2 Indomie cup 3.5
# 3 Indomie pack 15.0
# 4 Indomie pack 5.0
列
使用 subset
来移除指定列中的重复值
df.drop_duplicates(subset=['brand'])
# brand style rating
# 0 Yum Yum cup 4.0
# 2 Indomie cup 3.5
df.drop_duplicates(subset=['brand', 'style'], keep='last')
# brand style rating
# 1 Yum Yum cup 4.0
# 2 Indomie cup 3.5
# 4 Indomie pack 5.0
修改
行
参考这里 27
-
df.loc()
使用df.loc()
将已有Dataframe
的某一行的值修改为目标值
df.loc[目标行数索引] = [value1, value2, value3, ...]
比如,
df = pd.DataFrame({'points': [10, 12, 12, 14, 13, 18],
'rebounds': [7, 7, 8, 13, 7, 4],
'assists': [11, 8, 10, 6, 6, 5]})
print(df)
'''
points rebounds assists
0 10 7 11
1 12 7 8
2 12 8 10
3 14 13 6
4 13 7 6
5 18 4 5
'''
df.loc[4] = ['A', 'B', 'C']
print(df)
'''
points rebounds assists
0 10 7 11
1 12 7 8
2 12 8 10
3 14 13 6
4 A B C
5 18 4 5
'''
特定值
参考这里 32
仅适用于有特定的值需要被处理时
[]: How to set a cell to NaN in a pandas dataframe
[]: How to replace a value in pandas, with NaN?
[]: python——pandas——设置dataFrame某列为空值(NaN)
方法
df.mask
df.replace
示例
重命名
此处参考的原始数据如下所示,
data = pd.read_excel(io='analysis.xlsx')
print(data.head(3))
'''
id uid ... diqu city
0 553217640 6376967.0 ... 广西 南宁市
1 553217639 6376967.0 ... 广西 南宁市
2 553217638 6571870.0 ... 浙江省 杭州市
[3 rows x 11 columns]
'''
方法
赋值
使用 DataFrame.index = [newName]
,DataFrame.columns = [newName]
,这两种方法可以轻松实现。但是给出的新的索引标签或者列标签的列表的长度和原有的索引标签或者列标签的列表长度相等,否则会不匹配报错。
这种改变方式是直接改变了原始数据。
如下所示,
'''
print(len(data.columns.to_list()))
11
'''
colReorder = ['user', 'uid', 'title', 'docid', 'time', 'status', 'score', 'qudao', 'kffs', 'province', 'city']
'''
print(len(rename))
11
'''
data.columns = colReorder
print(data.head(3))
'''
user uid ... province city
0 553217640 6376967.0 ... 广西 南宁市
1 553217639 6376967.0 ... 广西 南宁市
2 553217638 6571870.0 ... 浙江省 杭州市
[3 rows x 11 columns]
'''
rename
语法
DataFrame.rename(mapper=None, index=None, columns=None, axis=None, copy=True, inplace=False, level=None)
除了修改 columns
还可以修改 index
,十分方便。
参数 | 说明 |
---|---|
mapper ,index ,columns | 可以任选其一使用,可以是将 index 和 columns 结合使用。index 和 column 直接传入 mapper 或者字典的形式。可以使用 map 方法进行映射,map 的使用方法就和python 中的 map 几乎一样。 |
axis | int 或 str ,与 mapper 配合使用。可以是轴名称( index ,columns )或数字( 0 , 1 )。默认为 index 。 |
copy | boolean ,默认为 True ,是否复制基础数据。 |
inplace | 布尔值,默认为 False ,是否返回新的 DataFrame 。如果为 True ,则忽略复制值。 |
示例
参考此处 35
行
略
列
- 部分
有时候,我们只需要将部分列的列名重命名。可以使用字典的方式,更改指定的列名。
### rename dict: key, default field of raw data; value, new field
rename = {
'id': 'user',
'diqu': 'province'
}
data = data.rename(columns=rename)
print(data.head(3))
'''
user uid ... province city
0 553217640 6376967.0 ... 广西 南宁市
1 553217639 6376967.0 ... 广西 南宁市
2 553217638 6571870.0 ... 浙江省 杭州市
[3 rows x 11 columns]
'''
数学运算
此处考虑单层索引的简单数据结构和多层索引的复杂数据结构的情形,对常见的数学运算给出示例以说明。
参考原始数据如下所示:
import pandas as pd
df1 = pd.DataFrame(
data={"data_provider": ["prov_a", "prov_a", "prov_a", "prov_a", "prov_a", "prov_a"],
"indicator": ["ind_a", "ind_a", "ind_a", "ind_b", "ind_b", "ind_b"],
"unit": ["EUR", "EUR", "EUR", "EUR", "EUR", "EUR"],
"year": ["2017", "2018","2019", "2017","2018","2019"],
"country1": [1, 2, 3, 2, 4, 6],
"country2": [4, 5, 6, 40, 50, 60]}
)
# data_provider indicator unit year country1 country2
# 0 prov_a ind_a EUR 2017 1 4
# 1 prov_a ind_a EUR 2018 2 5
# 2 prov_a ind_a EUR 2019 3 6
# 3 prov_a ind_b EUR 2017 2 40
# 4 prov_a ind_b EUR 2018 4 50
# 5 prov_a ind_b EUR 2019 6 60
#
# RangeIndex(start=0, stop=6, step=1)
# Index(['data_provider', 'indicator', 'unit', 'year', 'country1', 'country2'], dtype='object')
import numpy as np
df2 = pd.DataFrame(np.arange(1, 51).reshape(10, 5), columns=['Asia', 'Africa', 'Australia', 'Europe', 'Russia'])
# Asia Africa Australia Europe Russia
# 0 1 2 3 4 5
# 1 6 7 8 9 10
# 2 11 12 13 14 15
# 3 16 17 18 19 20
# 4 21 22 23 24 25
# 5 26 27 28 29 30
# 6 31 32 33 34 35
# 7 36 37 38 39 40
# 8 41 42 43 44 45
# 9 46 47 48 49 50
#
# RangeIndex(start=0, stop=10, step=1)
# Index(['Asia', 'Africa', 'Australia', 'Europe', 'Russia'], dtype='object')
df2.loc[(3, 'Asia')] = np.nan
df2.loc[(5, 'Europe')] = np.nan
df2.loc[(7, 'Russia')] = np.nan
df3 = df2
# Asia Africa Australia Europe Russia
# 0 1.0 2 3 4.0 5.0
# 1 6.0 7 8 9.0 10.0
# 2 11.0 12 13 14.0 15.0
# 3 NaN 17 18 19.0 20.0
# 4 21.0 22 23 24.0 25.0
# 5 26.0 27 28 NaN 30.0
# 6 31.0 32 33 34.0 35.0
# 7 36.0 37 38 39.0 NaN
# 8 41.0 42 43 44.0 45.0
# 9 46.0 47 48 49.0 50.0
#
# Index(['Asia', 'Africa', 'Australia', 'Europe', 'Russia'], dtype='object')
#
# RangeIndex(start=0, stop=10, step=1)
运算
divide
语法
获取数据帧和其他元素的浮动除法(二元运算符 truediv
)。相当于,但支持将 fill_value
替换为其中一个输入中的缺失数据。使用反向版本,rtruediv
。dataframe/other
在灵活的包装器( add
,sub
,mul
,div
,mod
,pow
)中算术运算符:+
,-
,*
,/
,//
,%
,**
。
DataFrame.divide(other, axis='columns', level=None, fill_value=None)[source]
参数 | 说明 |
---|---|
other | 标量 ( scalar ),序列 ( sequence ),Series 或 DataFrame 。任何单个或多个元素数据结构或类似列表的对象。 |
axis | { 0 or index , 1 or columns }是否通过索引 ( 0 or index ) 或 列 ( 1 或 columns )进行比较。对于 Series 输入,轴匹配 Series 索引。 |
level | int 或 label 跨级别广播,匹配传递的 MultiIndex 级别的索引值。 |
fill_value | float 或 None ,默认为 None 在计算之前使用此值填充现有缺失( NaN )值以及成功 DataFrame 对齐所需的任何新元素。如果缺少相应 DataFrame 位置中的数据,则结果将丢失。 |
返回 DataFrame | 算术运算的结果。 |
gmean
参考这里
方法
方法 | 说明 |
---|---|
np.exp(np.log(df2.prod(axis=1))/df2.notna().sum(1)) | 如果 df 中存在空值 NaN ,在计算时会被跳过。 |
df.apply(gmean, axis=1) | 需要先引入 scipy.stats 。如果 df 中存在空值 NaN ,在计算时无法被忽略。 |
df.apply(lambda x: np.exp(np.mean(np.log(x))), axis=1) | 如果 df 中存在空值 NaN ,在计算时会被跳过。 |
subtraction
参考这里
方法
示例
RangeIndex
参考这里 38
divide
- 对于
df
中存在str
或其他类型的数据时,要做数学运算时需注意对int
或者float
类型的数据进行针对性筛选。
df1['Normalize'] = df1[df1.columns[-2]].div(df1[df1.columns[-1]])
print(df1)
# data_provider indicator unit year country1 country2 Normalize
# 0 prov_a ind_a EUR 2017 1 4 0.25
# 1 prov_a ind_a EUR 2018 2 5 0.40
# 2 prov_a ind_a EUR 2019 3 6 0.50
# 3 prov_a ind_b EUR 2017 2 40 0.05
# 4 prov_a ind_b EUR 2018 4 50 0.08
# 5 prov_a ind_b EUR 2019 6 60 0.10
- 对于
df
中数据类型统统为str
或float
类型时,需要小心轴方向的抉择。
df2SumAxis0 = df2.sum(axis=0)
# Asia 235
# Africa 245
# Australia 255
# Europe 265
# Russia 275
# dtype: int64
#
# Index(['Asia', 'Africa', 'Australia', 'Europe', 'Russia'], dtype='object')
#
# <class 'pandas.core.series.Series'>
df2SumAxis1 = df2.sum(axis=1)
# 0 15
# 1 40
# 2 65
# 3 90
# 4 115
# 5 140
# 6 165
# 7 190
# 8 215
# 9 240
# dtype: int64
#
# RangeIndex(start=0, stop=10, step=1)
#
# <class 'pandas.core.series.Series'>
df = df2.apply(lambda x: x.div(x.sum()), axis=0)
print(df)
# Asia Africa Australia Europe Russia
# 0 0.004255 0.008163 0.011765 0.015094 0.018182
# 1 0.025532 0.028571 0.031373 0.033962 0.036364
# 2 0.046809 0.048980 0.050980 0.052830 0.054545
# 3 0.068085 0.069388 0.070588 0.071698 0.072727
# 4 0.089362 0.089796 0.090196 0.090566 0.090909
# 5 0.110638 0.110204 0.109804 0.109434 0.109091
# 6 0.131915 0.130612 0.129412 0.128302 0.127273
# 7 0.153191 0.151020 0.149020 0.147170 0.145455
# 8 0.174468 0.171429 0.168627 0.166038 0.163636
# 9 0.195745 0.191837 0.188235 0.184906 0.181818
df = df2.apply(lambda x: x.div(x.sum()), axis=1)
print(df)
# Asia Africa Australia Europe Russia
# 0 0.066667 0.133333 0.2 0.266667 0.333333
# 1 0.150000 0.175000 0.2 0.225000 0.250000
# 2 0.169231 0.184615 0.2 0.215385 0.230769
# 3 0.177778 0.188889 0.2 0.211111 0.222222
# 4 0.182609 0.191304 0.2 0.208696 0.217391
# 5 0.185714 0.192857 0.2 0.207143 0.214286
# 6 0.187879 0.193939 0.2 0.206061 0.212121
# 7 0.189474 0.194737 0.2 0.205263 0.210526
# 8 0.190698 0.195349 0.2 0.204651 0.209302
# 9 0.191667 0.195833 0.2 0.204167 0.20833
gmean
对于 df
中数据类型统统为 str
或 float
类型时,需要小心轴方向的抉择。
axis=1
-
np.exp(np.log(df2.prod(axis=1))/df2.notna().sum(1))
gmean_axis1 = np.exp(np.log(df3.prod(axis=1))/df3.notna().sum(1))
gmean_axis1.name = 'gmean'
result_axis1 = pd.concat([df3, gmean_axis1], axis=1, ignore_index=False)
# Asia Africa Australia Europe Russia gmean
# 0 1.0 2 3 4.0 5.0 2.605171
# 1 6.0 7 8 9.0 10.0 7.872567
# 2 11.0 12 13 14.0 15.0 12.922523
# 3 NaN 17 18 19.0 20.0 18.466146
# 4 21.0 22 23 24.0 25.0 22.956423
# 5 26.0 27 28 NaN 30.0 27.711121
# 6 31.0 32 33 34.0 35.0 32.969664
# 7 36.0 37 38 39.0 NaN 37.483325
# 8 41.0 42 43 44.0 45.0 42.976729
# 9 46.0 47 48 49.0 50.0 47.979156
-
df2.apply(gmean, axis=1)
rom scipy.stats import gmean
gmean_axis1 = df3.apply(gmean, axis=1)
gmean_axis1.name = 'gmean'
result_axis1 = pd.concat([df3, gmean_axis1], axis=1, ignore_index=False)
# Asia Africa Australia Europe Russia gmean
# 0 1.0 2 3 4.0 5.0 2.605171
# 1 6.0 7 8 9.0 10.0 7.872567
# 2 11.0 12 13 14.0 15.0 12.922523
# 3 NaN 17 18 19.0 20.0 NaN
# 4 21.0 22 23 24.0 25.0 22.956423
# 5 26.0 27 28 NaN 30.0 NaN
# 6 31.0 32 33 34.0 35.0 32.969664
# 7 36.0 37 38 39.0 NaN NaN
# 8 41.0 42 43 44.0 45.0 42.976729
# 9 46.0 47 48 49.0 50.0 47.979156
-
df2.apply(lambda x: np.exp(np.mean(np.log(x))), axis=1)
gmean_axis1 = df3.apply(lambda x: np.exp(np.mean(np.log(x))), axis=1)
gmean_axis1.name = 'gmean'
result_axis1 = pd.concat([df3, gmean_axis1], axis=1, ignore_index=False)
# Asia Africa Australia Europe Russia gmean
# 0 1.0 2 3 4.0 5.0 2.605171
# 1 6.0 7 8 9.0 10.0 7.872567
# 2 11.0 12 13 14.0 15.0 12.922523
# 3 NaN 17 18 19.0 20.0 18.466146
# 4 21.0 22 23 24.0 25.0 22.956423
# 5 26.0 27 28 NaN 30.0 27.711121
# 6 31.0 32 33 34.0 35.0 32.969664
# 7 36.0 37 38 39.0 NaN 37.483325
# 8 41.0 42 43 44.0 45.0 42.976729
# 9 46.0 47 48 49.0 50.0 47.979156
axis=0
-
np.exp(np.log(df2.prod(axis=1))/df2.notna().sum(1))
gmean_axis0 = np.exp(np.log(df3.prod(axis=0))/df3.notna().sum(0))
gmean_axis0.name = 'gmean'
gmean_axis0 = pd.DataFrame(gmean_axis0)
gmean_axis0 = gmean_axis0.transpose()
result_axis0 = pd.concat([df3, gmean_axis0], axis=0, ignore_index=False)
# Asia Africa Australia Europe Russia
# 0 1.000000 2.000000 3.0 4.000000 5.000000
# 1 6.000000 7.000000 8.0 9.000000 10.000000
# 2 11.000000 12.000000 13.0 14.000000 15.000000
# 3 NaN 17.000000 18.0 19.000000 20.000000
# 4 21.000000 22.000000 23.0 24.000000 25.000000
# 5 26.000000 27.000000 28.0 NaN 30.000000
# 6 31.000000 32.000000 33.0 34.000000 35.000000
# 7 36.000000 37.000000 38.0 39.000000 NaN
# 8 41.000000 42.000000 43.0 44.000000 45.000000
# 9 46.000000 47.000000 48.0 49.000000 50.000000
# gmean 16.176423 7.715953 NaN 20.561997 21.256376
-
df2.apply(gmean, axis=1)
from scipy.stats import gmean
gmean_axis0 = df3.apply(gmean, axis=0)
gmean_axis0.name = 'gmean'
gmean_axis0 = pd.DataFrame(gmean_axis0)
gmean_axis0 = gmean_axis0.transpose()
result_axis0 = pd.concat([df3, gmean_axis0], axis=0, ignore_index=False)
# Asia Africa Australia Europe Russia
# 0 1.0 2.00000 3.000000 4.0 5.0
# 1 6.0 7.00000 8.000000 9.0 10.0
# 2 11.0 12.00000 13.000000 14.0 15.0
# 3 NaN 17.00000 18.000000 19.0 20.0
# 4 21.0 22.00000 23.000000 24.0 25.0
# 5 26.0 27.00000 28.000000 NaN 30.0
# 6 31.0 32.00000 33.000000 34.0 35.0
# 7 36.0 37.00000 38.000000 39.0 NaN
# 8 41.0 42.00000 43.000000 44.0 45.0
# 9 46.0 47.00000 48.000000 49.0 50.0
# gmean NaN 18.18965 19.821295 NaN NaN
-
df2.apply(lambda x: np.exp(np.mean(np.log(x))), axis=1)
gmean_axis0 = df3.apply(lambda x: np.exp(np.mean(np.log(x))), axis=0)
gmean_axis0.name = 'gmean'
gmean_axis0 = pd.DataFrame(gmean_axis0)
gmean_axis0 = gmean_axis0.transpose()
result_axis0 = pd.concat([df3, gmean_axis0], axis=0, ignore_index=False)
# Asia Africa Australia Europe Russia
# 0 1.000000 2.00000 3.000000 4.000000 5.000000
# 1 6.000000 7.00000 8.000000 9.000000 10.000000
# 2 11.000000 12.00000 13.000000 14.000000 15.000000
# 3 NaN 17.00000 18.000000 19.000000 20.000000
# 4 21.000000 22.00000 23.000000 24.000000 25.000000
# 5 26.000000 27.00000 28.000000 NaN 30.000000
# 6 31.000000 32.00000 33.000000 34.000000 35.000000
# 7 36.000000 37.00000 38.000000 39.000000 NaN
# 8 41.000000 42.00000 43.000000 44.000000 45.000000
# 9 46.000000 47.00000 48.000000 49.000000 50.000000
# gmean 16.176423 18.18965 19.821295 20.561997 21.256376
subtraction
- 对于
df
中存在str
或其他类型的数据时,要做数学运算时需注意对int
或者float
类型的数据进行针对性筛选 30。
df1['sub'] = df1['country2'] - df1['country1']
# data_provider indicator unit year country1 country2 sub
# 0 prov_a ind_a EUR 2017 1 4 3
# 1 prov_a ind_a EUR 2018 2 5 3
# 2 prov_a ind_a EUR 2019 3 6 3
# 3 prov_a ind_b EUR 2017 2 40 38
# 4 prov_a ind_b EUR 2018 4 50 46
# 5 prov_a ind_b EUR 2019 6 60 54
df1['country1'] = df1['country2'] - df1['country1']
# data_provider indicator unit year country1 country2 sub
# 0 prov_a ind_a EUR 2017 3 4 3
# 1 prov_a ind_a EUR 2018 3 5 3
# 2 prov_a ind_a EUR 2019 3 6 3
# 3 prov_a ind_b EUR 2017 38 40 38
# 4 prov_a ind_b EUR 2018 46 50 46
# 5 prov_a ind_b EUR 2019 54 60 54
SingleIndex
对上文给出的原始数据进行简单的单层索引的构建,如下所示:
df1SingleIndex = df1.pivot_table(
index=['indicator'],
columns='year'
)
# country1 country2
# year 2017 2018 2019 2017 2018 2019
# indicator
# ind_a 1 2 3 4 5 6
# ind_b 2 4 6 40 50 60
#
# Index(['ind_a', 'ind_b'], dtype='object', name='indicator')
#
# MultiIndex([('country1', '2017'),
# ('country1', '2018'),
# ('country1', '2019'),
# ('country2', '2017'),
# ('country2', '2018'),
# ('country2', '2019')],
# names=[None, 'year'])
对于多层索引下的 Series
数据的筛选,可参考 39’ 40。
divide
对于 df
中数据类型统统为 str
或 float
类型时,需要小心轴方向的抉择。
idx = pd.IndexSlice
ind_a_country1 = df1SingleIndex.loc[['ind_a'], idx['country1', :]]
# country1
# year 2017 2018 2019
# indicator
# ind_a 1 2 3
#
# Index(['ind_a'], dtype='object', name='indicator')
#
# MultiIndex([('country1', '2017'),
# ('country1', '2018'),
# ('country1', '2019')],
# names=[None, 'year'])
ind_a_2018 = df1SingleIndex.loc[['ind_a'], idx[:, '2018']]
# country1 country2
# year 2018 2018
# indicator
# ind_a 2 5
#
# Index(['ind_a'], dtype='object', name='indicator')
#
# MultiIndex([('country1', '2018'),
# ('country2', '2018')],
# names=[None, 'year'])
df1_2018 = df1SingleIndex.loc[:, idx[:, '2018']]
# country1 country2
# year 2018 2018
# indicator
# ind_a 2 5
# ind_b 4 50
#
# Index(['ind_a', 'ind_b'], dtype='object', name='indicator')
#
# MultiIndex([('country1', '2018'),
# ('country2', '2018')],
# names=[None, 'year'])
df1_2019 = df1SingleIndex.loc[:, idx[:, '2019']]
# country1 country2
# year 2019 2019
# indicator
# ind_a 3 6
# ind_b 6 60
#
# Index(['ind_a', 'ind_b'], dtype='object', name='indicator')
#
# MultiIndex([('country1', '2019'),
# ('country2', '2019')],
# names=[None, 'year'])
df1_2018Div2019 = (df1_2018.div(df1_2019.to_numpy()).rename(columns={'2018': '2018/2019'}))
# country1 country2
# year 2018/2019 2018/2019
# indicator
# ind_a 0.666667 0.833333
# ind_b 0.666667 0.833333
#
# Index(['ind_a', 'ind_b'], dtype='object', name='indicator')
#
# MultiIndex([('country1', '2018/2019'),
# ('country2', '2018/2019')],
# names=[None, 'year'])
df1_out = pd.concat([df1SingleIndex, df1_2018Div2019], axis=1)
# country1 country2 country1 country2
# year 2017 2018 2019 2017 2018 2019 2018/2019 2018/2019
# indicator
# ind_a 1 2 3 4 5 6 0.666667 0.833333
# ind_b 2 4 6 40 50 60 0.666667 0.833333
#
# Index(['ind_a', 'ind_b'], dtype='object', name='indicator')
#
# MultiIndex([('country1', '2017'),
# ('country1', '2018'),
# ('country1', '2019'),
# ('country2', '2017'),
# ('country2', '2018'),
# ('country2', '2019'),
# ('country1', '2018/2019'),
# ('country2', '2018/2019')],
# names=[None, 'year'])
# Index(['ind_a', 'ind_b'], dtype='object', name='indicator')
MultiIndex
对上文给出的原始数据进行简单的多层索引的构建,如下所示:
df1MultiIndex = df1.pivot_table(
index=['data_provider', 'indicator', 'unit'],
columns='year'
)
# country1 country2
# year 2017 2018 2019 2017 2018 2019
# data_provider indicator unit
# prov_a ind_a EUR 1 2 3 4 5 6
# ind_b EUR 2 4 6 40 50 60
#
# MultiIndex([('prov_a', 'ind_a', 'EUR'),
# ('prov_a', 'ind_b', 'EUR')],
# names=['data_provider', 'indicator', 'unit'])
#
# MultiIndex([('country1', '2017'),
# ('country1', '2018'),
# ('country1', '2019'),
# ('country2', '2017'),
# ('country2', '2018'),
# ('country2', '2019')],
# names=[None, 'year'])
对于多层索引下的 Series
数据的筛选,可参考 39’ 40。
divide
对于 df
中数据类型统统为 str
或 float
类型时,需要小心轴方向的抉择。
idx = pd.IndexSlice
ind_a_country1 = df1MultiIndex.loc[idx[:, 'ind_a'], idx['country1', :]]
# country1
# year 2017 2018 2019
# data_provider indicator unit
# prov_a ind_a EUR 1 2 3
#
# MultiIndex([('prov_a', 'ind_a', 'EUR')],
# names=['data_provider', 'indicator', 'unit'])
#
# MultiIndex([('country1', '2017'),
# ('country1', '2018'),
# ('country1', '2019')],
# names=[None, 'year'])
ind_a_2018 = df1MultiIndex.loc[idx[:, 'ind_a'], idx[:, '2018']]
# country1 country2
# year 2018 2018
# data_provider indicator unit
# prov_a ind_a EUR 2 5
#
# MultiIndex([('prov_a', 'ind_a', 'EUR')],
# names=['data_provider', 'indicator', 'unit'])
#
# MultiIndex([('country1', '2018'),
# ('country2', '2018')],
# names=[None, 'year'])
df1_2018 = df1MultiIndex.loc[:, idx[:, '2018']]
# country1 country2
# year 2018 2018
# data_provider indicator unit
# prov_a ind_a EUR 2 5
# ind_b EUR 4 50
df1_2019 = df1MultiIndex.loc[:, idx[:, '2019']]
# country1 country2
# year 2019 2019
# data_provider indicator unit
# prov_a ind_a EUR 3 6
# ind_b EUR 6 60
df1_2018Div2019 = (df1_2018.div(df1_2019.to_numpy()).rename(columns={'2018': '2018/2019'}))
# country1 country2
# year 2018/2019 2018/2019
# data_provider indicator unit
# prov_a ind_a EUR 0.666667 0.833333
# ind_b EUR 0.666667 0.833333
#
# MultiIndex([('prov_a', 'ind_a', 'EUR'),
# ('prov_a', 'ind_b', 'EUR')],
# names=['data_provider', 'indicator', 'unit'])
#
# MultiIndex([('country1', '2018/2019'),
# ('country2', '2018/2019')],
# names=[None, 'year'])
df1_out = pd.concat([df1MultiIndex, df1_2018Div2019], axis=1)
# country1 ... country2 country1 country2
# year 2017 2018 2019 ... 2019 2018/2019 2018/2019
# data_provider indicator unit ...
# prov_a ind_a EUR 1 2 3 ... 6 0.666667 0.833333
# ind_b EUR 2 4 6 ... 60 0.666667 0.833333
#
# [2 rows x 8 columns]
#
# MultiIndex([('prov_a', 'ind_a', 'EUR'),
# ('prov_a', 'ind_b', 'EUR')],
# names=['data_provider', 'indicator', 'unit'])
#
# MultiIndex([('country1', '2017'),
# ('country1', '2018'),
# ('country1', '2019'),
# ('country2', '2017'),
# ('country2', '2018'),
# ('country2', '2019'),
# ('country1', '2018/2019'),
# ('country2', '2018/2019')],
# names=[None, 'year'])
筛选
请参考同一系列的另一篇文章 41。
统计运算
参考这里 42
极值
以下面的 df
为例,
import io
data = '''
day code pe_ratio pb_ratio market_cap
2016-03-01 002027.XSHE 34.5035 25.4271 1169.3248
2016-03-02 002027.XSHE 36.0459 26.5638 1221.5966
2016-03-03 002027.XSHE 35.7059 26.3132 1210.0721
2016-03-04 002027.XSHE 36.1309 26.6264 1224.4777
2016-03-07 002027.XSHE 37.1025 27.3424 1257.4049
2016-03-08 002027.XSHE 36.7139 27.0560 1244.2340
2016-03-09 002027.XSHE 35.0986 25.8657 1189.4926
'''
df = pd.read_csv(io.StringIO(data), delim_whitespace=True)
print(df)
# day code pe_ratio pb_ratio market_cap
# 0 2016-03-01 002027.XSHE 34.5035 25.4271 1169.3248
# 1 2016-03-02 002027.XSHE 36.0459 26.5638 1221.5966
# 2 2016-03-03 002027.XSHE 35.7059 26.3132 1210.0721
# 3 2016-03-04 002027.XSHE 36.1309 26.6264 1224.4777
# 4 2016-03-07 002027.XSHE 37.1025 27.3424 1257.4049
# 5 2016-03-08 002027.XSHE 36.7139 27.0560 1244.2340
# 6 2016-03-09 002027.XSHE 35.0986 25.8657 1189.4926
方法
先看这两个函数的释义:
pandas.DataFrame.idxmax
Return index of first occurrence of maximum over requested axis.pandas.DataFrame.idxmin
Return index of first occurrence of minimum over requested axis.
它们的意思是返回最大值、最小值的索引。
函数 | 说明 |
---|---|
df.idxmax(axis, skipna) df.idxmin(axis, skipna) | 找出 df 中最大值、最小值所处索引。可以通过 axis 指定轴方向。可以通过 skipna 决定是否忽略 NaN 值。 |
df.max(axis=None, skipna=None, level=None, numeric_only=None) df.min(axis=None, skipna=None, level=None, numeric_only=None) | 找出 df 中的最大值、最小值。参数具体使用方法参照官方示例。 |
示例
极值
idxmax = df['market_cap'].idxmax()
# 4
idxmin = df['market_cap'].idxmin()
# 0
先获取最大值和最小值的索引,分别保存为 idxmax
和 idxmin
,idxmax
和 idxmin
对应了行索引,再打印出这两行即可。
day code pe_ratio pb_ratio market_cap
4 2016-03-07 002027.XSHE 37.1025 27.3424 1257.4049
0 2016-03-01 002027.XSHE 34.5035 25.4271 1169.3248
数据比对
参考这里 45
差别
df.compare
df.compare()
和 s.compare()
方法使您可以分别比较两个 DataFrame
或 Series
,并总结它们之间的差异。V1.1.0
中添加了此功能。
局限性是只能比较两个同维且行、列标签值相同的 df
或 Series
的差别。
语法如下:
pd.compare(other, align_axis=1, keep_shape=False, keep_equal=False)
其中:
参数 | 说明 |
---|---|
other | 被对比的数据 |
align_axis=1 | 差异堆叠在列/行上 |
keep_shape=False | 不保留相等的值 |
keep_equal=False | 不保留所有原始行和列 |
示例
以以下 dataframe
为例,对如何进行找出 df
或 Series
间的差别进行说明,
df1 = pd.DataFrame(
{
"col1": ["a", "a", "b", "b", "a"],
"col2": [1.0, 2.0, 3.0, np.nan, 5.0],
"col3": [1.0, 2.0, 3.0, 4.0, 5.0]
},
columns=["col1", "col2", "col3"],
)
# col1 col2 col3
# 0 a 1.0 1.0
# 1 a 2.0 2.0
# 2 b 3.0 3.0
# 3 b NaN 4.0
# 4 a 5.0 5.0
df2 = df1.copy()
df2.loc[0, 'col1'] = 'c'
df2.loc[2, 'col3'] = 4.0
# col1 col2 col3
# 0 c 1.0 1.0
# 1 a 2.0 2.0
# 2 b 3.0 4.0
# 3 b NaN 4.0
# 4 a 5.0 5.0
df3 = df2.copy()
df3 = df3.iloc[:3]
df3.loc[2, 'col1'] = 'e'
df3.loc[1, 'col3'] = 6.0
# col1 col2 col3
# 0 c 1.0 1.0
# 1 a 2.0 6.0
# 2 e 3.0 4.0
同维
compare12 = df1.compare(df2, keep_equal=False, keep_shape=False)
# col1 col2 col3
# self other self other self other
# 0 a c NaN NaN NaN NaN
# 2 NaN NaN NaN NaN 3.0 4.0
# 3 NaN NaN NaN NaN NaN NaN
#
# MultiIndex([('col1', 'self'),
# ('col1', 'other'),
# ('col2', 'self'),
# ('col2', 'other'),
# ('col3', 'self'),
# ('col3', 'other')],
# )
#
# Int64Index([0, 2, 3], dtype='int64')
compare21 = df2.compare(df1)
# col1 col2 col3
# self other self other self other
# 0 c a NaN NaN NaN NaN
# 2 NaN NaN NaN NaN 4.0 3.0
# 3 NaN NaN NaN NaN NaN NaN
#
# MultiIndex([('col1', 'self'),
# ('col1', 'other'),
# ('col2', 'self'),
# ('col2', 'other'),
# ('col3', 'self'),
# ('col3', 'other')],
# )
# Int64Index([0, 2, 3], dtype='int64')
-
align_axis
将差异堆叠在行上:
compare12 = df1.compare(df2, align_axis=0)
# col1 col2 col3
# 0 self a NaN NaN
# other c NaN NaN
# 2 self NaN NaN 3.0
# other NaN NaN 4.0
# 3 self NaN NaN NaN
# other NaN NaN NaN
#
# Index(['col1', 'col2', 'col3'], dtype='object')
#
# MultiIndex([(0, 'self'),
# (0, 'other'),
# (2, 'self'),
# (2, 'other'),
# (3, 'self'),
# (3, 'other')],
# )
compare21 = df2.compare(df1, align_axis=0)
# col1 col2 col3
# 0 self c NaN NaN
# other a NaN NaN
# 2 self NaN NaN 4.0
# other NaN NaN 3.0
# 3 self NaN NaN NaN
# other NaN NaN NaN
将差异堆叠在列上:
compare12 = df1.compare(df2, align_axis=1)
# col1 col2 col3
# self other self other self other
# 0 a c NaN NaN NaN NaN
# 2 NaN NaN NaN NaN 3.0 4.0
# 3 NaN NaN NaN NaN NaN NaN
#
# Int64Index([0, 2, 3], dtype='int64')
#
# MultiIndex([('col1', 'self'),
# ('col1', 'other'),
# ('col2', 'self'),
# ('col2', 'other'),
# ('col3', 'self'),
# ('col3', 'other')],
# )
compare21 = df2.compare(df1, align_axis=1)
# col1 col2 col3
# self other self other self other
# 0 c a NaN NaN NaN NaN
# 2 NaN NaN NaN NaN 4.0 3.0
# 3 NaN NaN NaN NaN NaN NaN
-
keep_shape
保留所有原始行和列,
compare12 = df1.compare(df2, keep_shape=True)
# col1 col2 col3
# self other self other self other
# 0 a c NaN NaN NaN NaN
# 1 NaN NaN NaN NaN NaN NaN
# 2 NaN NaN NaN NaN 3.0 4.0
# 3 NaN NaN NaN NaN NaN NaN
# 4 NaN NaN NaN NaN NaN NaN
#
# MultiIndex([('col1', 'self'),
# ('col1', 'other'),
# ('col2', 'self'),
# ('col2', 'other'),
# ('col3', 'self'),
# ('col3', 'other')],
# )
#
# RangeIndex(start=0, stop=5, step=1)
不同维
compare23 = df2.compare(df3)
# File "/Applications/miniconda3/lib/python3.8/site-packages/pandas/core/ops/__init__.py", line 289, in align_method_FRAME
# raise ValueError(
# ValueError: Can only compare identically-labeled DataFrame objects
df.merge
pd.merge
和 reduce
结合的方法使您可以同时比较多个 DataFrame
并总结它们之间的差异 46’ 47’ 48’ 49。
好处是不在意 df
的维度,但是行、列标签应当相同。
示例
以以下 dataframe
为例,对如何进行找出多个 df
的差别进行说明,
df1 = pd.DataFrame({
"品类":["蔬菜","蔬菜","水果","甜点"],
"商品":["白菜","土豆","苹果","蛋糕"],
"销量":[10, 20, 30, 40]
})
# 品类 商品 销量
# 0 蔬菜 白菜 10
# 1 蔬菜 土豆 20
# 2 水果 苹果 30
# 3 甜点 蛋糕 40
df2 = pd.DataFrame({
"品类":["蔬菜","蔬菜","水果","水产", "甜点"],
"商品":["白菜","土豆","苹果","河虾", "蛋糕"],
"销售额":[100, 200, 300, 400, 500]
})
# 品类 商品 销售额
# 0 蔬菜 白菜 100
# 1 蔬菜 土豆 200
# 2 水果 苹果 300
# 3 水产 河虾 400
# 4 甜点 蛋糕 500
df3 = pd.DataFrame({
"品类":["蔬菜","烧烤","蔬菜","甜点"],
"商品":["白菜","羊肉串","土豆","蛋糕"],
})
# 品类 商品
# 0 蔬菜 白菜
# 1 烧烤 羊肉串
# 2 蔬菜 土豆
# 3 甜点 蛋糕
- 多个
df
的交集
from functools import reduce
dfs = [df1, df2, df3]
df_intersec = reduce(lambda left, right: pd.merge(left, right, how='inner'), dfs)
# 品类 商品 销量 销售额
# 0 蔬菜 白菜 10 100
# 1 蔬菜 土豆 20 200
# 2 甜点 蛋糕 40 500
- 2个
df
的指定列的交集
df_intersec = pd.merge(df1, df2, on='商品', how='inner')
# 品类_x 商品 销量 品类_y 销售额
# 0 蔬菜 白菜 10 蔬菜 100
# 1 蔬菜 土豆 20 蔬菜 200
# 2 水果 苹果 30 水果 300
# 3 甜点 蛋糕 40 甜点 500
df_intersec = pd.merge(df1, df2, on='品类', how='inner')
# 品类 商品_x 销量 商品_y 销售额
# 0 蔬菜 白菜 10 白菜 100
# 1 蔬菜 白菜 10 土豆 200
# 2 蔬菜 土豆 20 白菜 100
# 3 蔬菜 土豆 20 土豆 200
# 4 水果 苹果 30 苹果 300
# 5 甜点 蛋糕 40 蛋糕 500
df_intersec = pd.merge(df1, df2, on=['品类', '商品'], how='inner')
# 品类 商品 销量 销售额
# 0 蔬菜 白菜 10 100
# 1 蔬菜 土豆 20 200
# 2 水果 苹果 30 300
# 3 甜点 蛋糕 40 500
intersection
使用 set
的 intersection
的方法使您可以同时比较多个 Series
并总结它们之间的差异 50。
示例
以以下 dataframe
为例,对如何进行找出多个 df
的差别进行说明,
df1 = pd.DataFrame({
"品类":["蔬菜","蔬菜","水果","甜点"],
"商品":["白菜","土豆","苹果","蛋糕"],
"销量":[10, 20, 30, 40]
})
# 品类 商品 销量
# 0 蔬菜 白菜 10
# 1 蔬菜 土豆 20
# 2 水果 苹果 30
# 3 甜点 蛋糕 40
df2 = pd.DataFrame({
"品类":["蔬菜","蔬菜","水果","水产", "甜点"],
"商品":["白菜","土豆","苹果","河虾", "蛋糕"],
"销售额":[100, 200, 300, 400, 500]
})
# 品类 商品 销售额
# 0 蔬菜 白菜 100
# 1 蔬菜 土豆 200
# 2 水果 苹果 300
# 3 水产 河虾 400
# 4 甜点 蛋糕 500
df3 = pd.DataFrame({
"品类":["蔬菜","烧烤","蔬菜","甜点"],
"商品":["白菜","羊肉串","土豆","蛋糕"],
})
# 品类 商品
# 0 蔬菜 白菜
# 1 烧烤 羊肉串
# 2 蔬菜 土豆
# 3 甜点 蛋糕
- 求以上
df
的 列标签为商品
的列的交集,
dfs = [df1['商品'], df2['商品'], df3['商品']]
df_intersec = set(dfs[0]).intersection(*dfs[1:])
# {'土豆', '蛋糕', '白菜'}
需要注意地是,每次求交集的结果中元素的排列顺序可能是会变化的。
相同
df.equals
还可以使用 df1.equals(df2)
来对比两个数据是否一致,测试两个对象是否包含相同的元素。
此功能允许将两个 Series
或 DataFrame
相互比较,以查看它们是否具有相同的形状和元素。 相同位置的 NaN
被认为是相等的。 列标题不必具有相同的类型,但是列中的元素必须具有相同的 dtype
。
示例
以以下 dataframe
为例,对如何比对 df
或 Series
进行说明,
df = pd.DataFrame({1: [10], 2: [20]})
# print(df)
# 1 2
# 0 10 20
#
# print(df.columns.dtype)
# int64
#
# print(df.values.dtype)
# int64
diff_col_dtype = pd.DataFrame({1.0: [10], 2.0: [20]})
# print(diff_col_dype)
# 1.0 2.0
# 0 10 20
#
# print(diff_col_dtype.columns.dtype)
# float64
#
# print(diff_col_dtype.values.dtype)
# int64
diff_data_dtype = pd.DataFrame({1: [10.0], 2: [20.0]})
# print(diff_data_dtype)
# 1 2
# 0 10.0 20.0
#
# print(diff_data_dtype.columns.dtype)
# int64
#
# print(diff_data_dtype.values.dtype)
# float64
-
df
和其本身的元素和列标签具有相同的类型和值,它们将返回True
。
print(df.equals(df))
# True
-
df
和diff_col_dtype
具有相同的元素类型和值,但列标签具有不同的类型,它们仍将返回True
。
print(df.equals(diff_col_dtype))
# True
-
df
和diff_data_dtype
为其元素的相同值具有不同的类型,即使它们的列标签具有相同的值和类型,它们也将返回False
。
print(df.equals(diff_data_dtype))
# False
对齐
df.align
align
不会组合两个数据帧,但是可以将它们对齐。
如果要确保两个数据帧之间的行和/或列的排列相同,而不改变两个数据帧中包含的任何数据,请使用 DataFrame.align()
。Series
也支持此方法。
示例
以以下的2个 df
为例对如何对齐 df
进行说明:
df1 = pd.DataFrame([[1,2,3,4], [6,7,8,9]],
columns=['D', 'B', 'E', 'A'],
index=[1,2])
# D B E A
# 1 1 2 3 4
# 2 6 7 8 9
df2 = pd.DataFrame([[10,20,30,40], [60,70,80,90], [600,700,800,900]],
columns=['A', 'B', 'C', 'D'],
index=[2,3,4])
# A B C D
# 2 10 20 30 40
# 3 60 70 80 90
# 4 600 700 800 900
- 按列对齐(
axis=1
),并对列标签执行外部联接(join='outer'
):
a1, a2 = df1.align(df2, join='outer', axis=1)
print(a1)
# A B C D E
# 1 4 2 NaN 1 3
# 2 9 7 NaN 6 8
print(a2)
# A B C D E
# 2 10 20 30 40 NaN
# 3 60 70 80 90 NaN
# 4 600 700 800 900 NaN
这里有几点需要注意:
-
df1
中的列已重新排列,以便与df2
中的列对齐。 -
有一个标记为
C
的列已添加到df1
,还有一个标记为E
的列已添加到df2
。这些列已用NaN
填充。这是因为我们对列标签执行了外部联接。 -
数据帧内的所有值均未更改。
-
行标签未对齐;
df2
有第3
行和第4
行,而df1
没有。这是因为我们要求对列进行对齐(axis=1
)。 -
在行和列上对齐,但将
join
参数更改为right
a1, a2 = df1.align(df2, join='right', axis=None)
print(a1)
# A B C D
# 2 9.0 7.0 NaN 6.0
# 3 NaN NaN NaN NaN
# 4 NaN NaN NaN NaN
print(a2)
# A B C D
# 2 10 20 30 40
# 3 60 70 80 90
# 4 600 700 800 900
合并
pandas
提供了很多合并 pd.Series
和 pd.Dataframe
的强大功能 51。
主要以下面的几个 df
和 Series
为例对如何合并 df
进行说明:
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3'],
'C': ['C0', 'C1', 'C2', 'C3'],
'D': ['D0', 'D1', 'D2', 'D3']},
index=[0, 1, 2, 3])
# A B C D
# 0 A0 B0 C0 D0
# 1 A1 B1 C1 D1
# 2 A2 B2 C2 D2
# 3 A3 B3 C3 D3
df2 = pd.DataFrame({'A': ['A4', 'A5', 'A6', 'A7'],
'B': ['B4', 'B5', 'B6', 'B7'],
'C': ['C4', 'C5', 'C6', 'C7'],
'D': ['D4', 'D5', 'D6', 'D7']},
index=[4, 5, 6, 7])
# A B C D
# 4 A4 B4 C4 D4
# 5 A5 B5 C5 D5
# 6 A6 B6 C6 D6
# 7 A7 B7 C7 D7
df3 = pd.DataFrame({'A': ['A8', 'A9', 'A10', 'A11'],
'B': ['B8', 'B9', 'B10', 'B11'],
'C': ['C8', 'C9', 'C10', 'C11'],
'D': ['D8', 'D9', 'D10', 'D11']},
index=[8, 9, 10, 11])
# A B C D
# 8 A8 B8 C8 D8
# 9 A9 B9 C9 D9
# 10 A10 B10 C10 D10
# 11 A11 B11 C11 D11
df4 = pd.DataFrame({'B': ['B2', 'B3', 'B6', 'B7'],
'D': ['D2', 'D3', 'D6', 'D7'],
'F': ['F2', 'F3', 'F6', 'F7']},
index=[2, 3, 6, 7])
# B D F
# 2 B2 D2 F2
# 3 B3 D3 F3
# 6 B6 D6 F6
# 7 B7 D7 F7
df5 = pd.DataFrame({'A': ['A2', 'A3', 'A4', 'A7'],
'B': ['B2', 'B3', 'B4', 'B7'],
'C': ['C2', 'C3', 'C4', 'C7'],
'D': ['D2', 'D3', 'D4', 'D7']},
index=[2, 3, 4, 7])
# A B C D
# 2 A2 B2 C2 D2
# 3 A3 B3 C3 D3
# 4 A4 B4 C4 D4
# 7 A7 B7 C7 D7
s1 = pd.Series([0, 1, 2, 3], name='foo')
s2 = pd.Series([0, 1, 2, 3])
s3 = pd.Series([0, 1, 4, 5])
pd.concat
用法
pd.concat(
objs: Iterable[NDFrame] | Mapping[Hashable, NDFrame],
axis: Axis = 0,
join: str = "outer",
ignore_index: bool = False,
keys=None,
levels=None,
names=None,
verify_integrity: bool = False,
sort: bool = False,
copy: bool = True
)`
参数 | 说明 |
---|---|
ojbs | objs 里的内容必须是 series 格式的,如果想要拼接的是 list 可以将 list 转为 series ( pd.Series(list1) ),如果是纵向拼接,list 可以直接用 + 52’ 53。 |
axis | axis 指定连接的轴。axis=0 表示纵向拼接,即不增加维度,只增加长度。axis=1 表示横向拼接,即不增加长度,只增加维度。 |
join | {'inner', 'outer'} 为连接方式,表示怎么处理其他轴的 index 。outer 表示合并,inner 表示交集。 |
ignore_index | True 表示忽略原本的 index 值,使用 0 ,1 ,…,n-1 来代替。 |
copy | 是否进行拷贝。 |
keys | 指定最外层的多层次结构的 index |
示例
1. 多个 df 合并
df1
、df2
、df3
定义了同样的列名和不同的 index
。然后他们放在 frames
中构成了一个 df
的 list
,将其作为参数传入 concat
就可以进行 df
的合并。
体现在代码中,如下所示:
frames = [df1, df2, df3]
result = pd.concat(frames)
# A B C D
# 0 A0 B0 C0 D0
# 1 A1 B1 C1 D1
# 2 A2 B2 C2 D2
# 3 A3 B3 C3 D3
# 4 A4 B4 C4 D4
# 5 A5 B5 C5 D5
# 6 A6 B6 C6 D6
# 7 A7 B7 C7 D7
# 8 A8 B8 C8 D8
# 9 A9 B9 C9 D9
# 10 A10 B10 C10 D10
# 11 A11 B11 C11 D11
columns = result.columns
# Index(['A', 'B', 'C', 'D'], dtype='object')
index = result.index
# Int64Index([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11], dtype='int64')
体现在示意图中,如下所示:
2. 多个 df 以指定方式合并
使用 keys
可以指定 frames
中不同 frames
的 key
,而后可以通过选择外部的 key
来返回特定的 frame
。
体现在代码中,如下所示:
frames = [df1, df2, df3]
result = pd.concat(frames, keys=['x', 'y', 'z'])
# A B C D
# x 0 A0 B0 C0 D0
# 1 A1 B1 C1 D1
# 2 A2 B2 C2 D2
# 3 A3 B3 C3 D3
# y 4 A4 B4 C4 D4
# 5 A5 B5 C5 D5
# 6 A6 B6 C6 D6
# 7 A7 B7 C7 D7
# z 8 A8 B8 C8 D8
# 9 A9 B9 C9 D9
# 10 A10 B10 C10 D10
# 11 A11 B11 C11 D11
columns = result.columns
# Index(['A', 'B', 'C', 'D'], dtype='object')
index = result.index
# MultiIndex([('x', 0),
# ('x', 1),
# ('x', 2),
# ('x', 3),
# ('y', 4),
# ('y', 5),
# ('y', 6),
# ('y', 7),
# ('z', 8),
# ('z', 9),
# ('z', 10),
# ('z', 11)],
# )
newframe = result.loc['y']
# A B C D
# 4 A4 B4 C4 D4
# 5 A5 B5 C5 D5
# 6 A6 B6 C6 D6
# 7 A7 B7 C7 D7
体现在示意图中,如下所示:
连接的轴默认是 0
,即按行来进行连接。如果要按列来连接,可以指定 axis=1
。
体现在代码中,如下所示:
frames = [df1, df2, df3]
result = pd.concat(frames, axis=1, keys=['x', 'y', 'z'], sort=False
# x y z
# A B C D A B C D A B C D
# 0 A0 B0 C0 D0 NaN NaN NaN NaN NaN NaN NaN NaN
# 1 A1 B1 C1 D1 NaN NaN NaN NaN NaN NaN NaN NaN
# 2 A2 B2 C2 D2 NaN NaN NaN NaN NaN NaN NaN NaN
# 3 A3 B3 C3 D3 NaN NaN NaN NaN NaN NaN NaN NaN
# 4 NaN NaN NaN NaN A4 B4 C4 D4 NaN NaN NaN NaN
# 5 NaN NaN NaN NaN A5 B5 C5 D5 NaN NaN NaN NaN
# 6 NaN NaN NaN NaN A6 B6 C6 D6 NaN NaN NaN NaN
# 7 NaN NaN NaN NaN A7 B7 C7 D7 NaN NaN NaN NaN
# 8 NaN NaN NaN NaN NaN NaN NaN NaN A8 B8 C8 D8
# 9 NaN NaN NaN NaN NaN NaN NaN NaN A9 B9 C9 D9
# 10 NaN NaN NaN NaN NaN NaN NaN NaN A10 B10 C10 D10
# 11 NaN NaN NaN NaN NaN NaN NaN NaN A11 B11 C11 D11
columns = result.columns
# MultiIndex([('x', 'A'),
# ('x', 'B'),
# ('x', 'C'),
# ('x', 'D'),
# ('y', 'A'),
# ('y', 'B'),
# ('y', 'C'),
# ('y', 'D'),
# ('z', 'A'),
# ('z', 'B'),
# ('z', 'C'),
# ('z', 'D')],
# )
index = result.index
# Int64Index([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11], dtype='int64')
newframe = result.loc[:, 'y']
# A B C D
# 0 NaN NaN NaN NaN
# 1 NaN NaN NaN NaN
# 2 NaN NaN NaN NaN
# 3 NaN NaN NaN NaN
# 4 A4 B4 C4 D4
# 5 A5 B5 C5 D5
# 6 A6 B6 C6 D6
# 7 A7 B7 C7 D7
# 8 NaN NaN NaN NaN
# 9 NaN NaN NaN NaN
# 10 NaN NaN NaN NaN
# 11 NaN NaN NaN NaN
默认的 join
连接方式为 join='outer'
,合并之后 index
不存在的地方会补全为 NaN
。如果是 join='inner'
的情况,则只会选择 index
相同的行进行展示。
frames = [df1, df4]
result = pd.concat(frames, axis=1)
# A B C D B D F
# 0 A0 B0 C0 D0 NaN NaN NaN
# 1 A1 B1 C1 D1 NaN NaN NaN
# 2 A2 B2 C2 D2 B2 D2 F2
# 3 A3 B3 C3 D3 B3 D3 F3
# 6 NaN NaN NaN NaN B6 D6 F6
# 7 NaN NaN NaN NaN B7 D7 F7
columns = result.columns
# Index(['A', 'B', 'C', 'D', 'B', 'D', 'F'], dtype='object')
index = result.index
# Int64Index([0, 1, 2, 3, 4, 7], dtype='int64')
result = pd.concat([df1, df4], axis=1, join='inner')
# A B C D B D F
# 2 A2 B2 C2 D2 B2 D2 F2
# 3 A3 B3 C3 D3 B3 D3 F3
columns = result.columns
# Index(['A', 'B', 'C', 'D', 'B', 'D', 'F'], dtype='object')
index = result.index
# Int64Index([2, 3], dtype='int64')
体现在示意图中,如下所示:
如果合并之后想保存原来 frame
的 index
相关的数据,那么可以使用 reindex
。
result = pd.concat(frames, axis=1).reindex(df1.index)
# A B C D B D F
# 0 A0 B0 C0 D0 NaN NaN NaN
# 1 A1 B1 C1 D1 NaN NaN NaN
# 2 A2 B2 C2 D2 B2 D2 F2
# 3 A3 B3 C3 D3 B3 D3 F3
columns = result.columns
# Index(['A', 'B', 'C', 'D', 'B', 'D', 'F'], dtype='object')
index = result.index
# Int64Index([0, 1, 2, 3], dtype='int64')
3. 涉及Series的合并
可以合并 df
和 Series
。
result = pd.concat([df1, s1], axis=1)
# A B C D X
# 0 A0 B0 C0 D0 X0
# 1 A1 B1 C1 D1 X1
# 2 A2 B2 C2 D2 X2
# 3 A3 B3 C3 D3 X3
columns = result.columns
# Index(['A', 'B', 'C', 'D', 'X'], dtype='object')
index = result.index
# Int64Index([0, 1, 2, 3], dtype='int64')
如果是多个 Series
,使用 concat
可以指定列名:
result = pd.concat([s1, s2, s3], axis=1, keys=['red', 'blue', 'yellow'])
# red blue yellow
# 0 0 0 0
# 1 1 1 1
# 2 2 2 4
# 3 3 3 5
columns = result.columns
# Index(['red', 'blue', 'yellow'], dtype='object')
index = result.index
# RangeIndex(start=0, stop=4, step=1)
pd.merge
和 df
最类似的就是数据库的表格,可以使用 merge
来进行类似数据库操作的 df
合并操作。
用法
pd.merge(left, right, how='inner', on=None, left_on=None, right_on=None,
left_index=False, right_index=False, sort=True,
suffixes=('_x', '_y'), copy=True, indicator=False,
validate=None)
参数 | 说明 |
---|---|
left | left 和 right 是要合并的两个 df 或者 Series 。 |
right | left 和 right 是要合并的两个 df 或者 Series 。 |
on | 代表的是 join 的列或者 index 名。 |
left_on | 左连接 |
right_on | 右连接 |
left_index | 连接之后,选择使用左边的index 或者 column 。 |
right_index | 连接之后,选择使用右边的index 或者 column 。 |
how | 连接的方式,left 、right 、outer 、inner 。默认 inner 。 |
sort | 是否排序。 |
suffixes | 处理重复的列 |
copy | 是否拷贝数据 |
示例
1. on
列
单列
将两个 df
基于指定列 key
列中的共有值进行合并。
体现在代码中,如下所示:
left = pd.DataFrame(
{'key': ['K0', 'K1', 'K2', 'K3'],
'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3']}
)
# key A B
# 0 K0 A0 B0
# 1 K1 A1 B1
# 2 K2 A2 B2
# 3 K3 A3 B3
right = pd.DataFrame(
{'key': ['K0', 'K1', 'K2', 'K3'],
'C': ['C0', 'C1', 'C2', 'C3'],
'D': ['D0', 'D1', 'D2', 'D3']}
)
# key C D
# 0 K0 C0 D0
# 1 K1 C1 D1
# 2 K2 C2 D2
# 3 K3 C3 D3
result = pd.merge(left, right, on='key')
# key A B C D
# 0 K0 A0 B0 C0 D0
# 1 K1 A1 B1 C1 D1
# 2 K2 A2 B2 C2 D2
# 3 K3 A3 B3 C3 D3
columns = result.columns
# Index(['key', 'A', 'B', 'C', 'D'], dtype='object')
index = result.index
# Int64Index([0, 1, 2, 3], dtype='int64')
体现在示意图中,如下所示:
多列
再看一个多个 key
连接的例子。
两个 df
将基于两个指定列的共有值进行合并。
体现在代码中,如下所示:
left = pd.DataFrame(
{
'key1': ['K0', 'K0', 'K1', 'K2'],
'key2': ['K0', 'K1', 'K0', 'K1'],
'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3']
}
)
# key1 key2 A B
# 0 K0 K0 A0 B0
# 1 K0 K1 A1 B1
# 2 K1 K0 A2 B2
# 3 K2 K1 A3 B3
right = pd.DataFrame(
{
'key1': ['K0', 'K1', 'K1', 'K2'],
'key2': ['K0', 'K0', 'K0', 'K0'],
'C': ['C0', 'C1', 'C2', 'C3'],
'D': ['D0', 'D1', 'D2', 'D3']
}
)
# key1 key2 C D
# 0 K0 K0 C0 D0
# 1 K1 K0 C1 D1
# 2 K1 K0 C2 D2
# 3 K2 K0 C3 D3
result = pd.merge(left, right, on=['key1', 'key2'])
# key1 key2 A B C D
# 0 K0 K0 A0 B0 C0 D0
# 1 K1 K0 A2 B2 C1 D1
# 2 K1 K0 A2 B2 C2 D2
columns = result.columns
# Index(['key1', 'key2', 'A', 'B', 'C', 'D'], dtype='object')
index = result.index
# Int64Index([0, 1, 2], dtype='int64')
体现在示意图中,如下所示:
行
将两个 df
基于行索引组名为 key
的共有行索引的值进行合并。
体现在代码中,如下所示:
leftindex = pd.MultiIndex.from_tuples(
[('K0', 'X0'), ('K0', 'X1'),('K1', 'X2')],
names=['key', 'X']
)
left = pd.DataFrame(
{
'A': ['A0', 'A1', 'A2'],
'B': ['B0', 'B1', 'B2']
},
index=leftindex
)
# A B
# key X
# K0 X0 A0 B0
# X1 A1 B1
# K1 X2 A2 B2
leftcolumns = left.columns
# Index(['A', 'B'], dtype='object')
leftindex = left.index
# MultiIndex([('K0', 'X0'),
# ('K0', 'X1'),
# ('K1', 'X2')],
# names=['key', 'X'])
leftindexnames = left.index.names
# ['key', 'X']
leftkey = left.index.get_level_values(level='key')
# Index(['K0', 'K0', 'K1'], dtype='object', name='key')
rightindex = pd.MultiIndex.from_tuples(
[('K0', 'Y0'), ('K1', 'Y1'),('K2', 'Y2'), ('K2', 'Y3')],
names=['key', 'Y']
)
right = pd.DataFrame(
{
'C': ['C0', 'C1', 'C2', 'C3'],
'D': ['D0', 'D1', 'D2', 'D3']},
index=rightindex
)
# C D
# key Y
# K0 Y0 C0 D0
# K1 Y1 C1 D1
# K2 Y2 C2 D2
# Y3 C3 D3
rightcolumns = right.columns
# Index(['C', 'D'], dtype='object')
rightindex = right.index
# MultiIndex([('K0', 'Y0'),
# ('K1', 'Y1'),
# ('K2', 'Y2'),
# ('K2', 'Y3')],
# names=['key', 'Y'])
rightindexnames = right.index.names
# ['key', 'Y']
rightkey = right.index.get_level_values(level='key')
# Index(['K0', 'K1', 'K2', 'K2'], dtype='object', name='key')
expectedcommonkeys = list(set(list(leftkey)).intersection(set(list(rightkey))))
# ['K0', 'K1']
result = pd.merge(
left.reset_index(),
right.reset_index(),
on=['key'],
how='inner'
).set_index(['key', 'X', 'Y'])
# A B C D
# key X Y
# K0 X0 Y0 A0 B0 C0 D0
# X1 Y0 A1 B1 C0 D0
# K1 X2 Y1 A2 B2 C1 D1
columns = result.columns
# Index(['A', 'B', 'C', 'D'], dtype='object')
index = result.index
# MultiIndex([('K0', 'X0', 'Y0'),
# ('K0', 'X1', 'Y0'),
# ('K1', 'X2', 'Y1')],
# names=['key', 'X', 'Y'])
realcommonkeys = (result.index.get_level_values(level='key')).unique()
# Index(['K0', 'K1'], dtype='object', name='key')
体现在示意图中,如下所示:
行与列
将两个 df
基于指定的行/列索引中的共有值进行合并。
体现在代码中,如下所示:
leftindex = pd.Index(['K0', 'K0', 'K1', 'K2'], name='key1')
left = pd.DataFrame(
{
'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3'],
'key2': ['K0', 'K1', 'K0', 'K1']
},
index=leftindex)
# A B key2
# key1
# K0 A0 B0 K0
# K0 A1 B1 K1
# K1 A2 B2 K0
# K2 A3 B3 K1
leftcolumns = left.columns
# Index(['A', 'B', 'key2'], dtype='object')
leftindex = left.index
# Index(['K0', 'K0', 'K1', 'K2'], dtype='object', name='key1')
rightindex = pd.Index(['K0', 'K1', 'K2', 'K2'], name='key1')
right = pd.DataFrame(
{
'C': ['C0', 'C1', 'C2', 'C3'],
'D': ['D0', 'D1', 'D2', 'D3'],
'key2': ['K0', 'K0', 'K0', 'K1']
},
index=rightindex
)
# C D key2
# key1
# K0 C0 D0 K0
# K1 C1 D1 K0
# K2 C2 D2 K0
# K2 C3 D3 K1
rightcolumns = right.columns
# Index(['C', 'D', 'key2'], dtype='object')
rightindex = right.index
# Index(['K0', 'K1', 'K2', 'K2'], dtype='object', name='key1')
leftkeys = left['key2'].reset_index()
rightkeys = right['key2'].reset_index()
expectedcommonkeys = pd.merge(leftkeys, rightkeys, how='inner')
# key1 key2
# 0 K0 K0
# 1 K1 K0
# 2 K2 K
result = pd.merge(
left,
right,
on=['key1', 'key2'],
)
# A B key2 C D
# key1
# K0 A0 B0 K0 C0 D0
# K1 A2 B2 K0 C1 D1
# K2 A3 B3 K1 C3 D
columns = result.columns
# Index(['A', 'B', 'key2', 'C', 'D'], dtype='object')
index = result.index
# Index(['K0', 'K1', 'K2'], dtype='object', name='key1')
realcommonkeys = (result['key2'].reset_index())
# key1 key2
# 0 K0 K0
# 1 K1 K0
# 2 K2 K1
体现在示意图中,如下所示:
2. how
how
可以指定 merge
方式,和数据库一样,可以指定是内连接、外连接等。
若指定 how=left
,即两个 df
将基于 left
这个 df
的指定列的值进行合并。
1.
left = pd.DataFrame(
{
'key1': ['K0', 'K0', 'K1', 'K2'],
'key2': ['K0', 'K1', 'K0', 'K1'],
'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3']
}
)
# key1 key2 A B
# 0 K0 K0 A0 B0
# 1 K0 K1 A1 B1
# 2 K1 K0 A2 B2
# 3 K2 K1 A3 B3
right = pd.DataFrame(
{
'key1': ['K0', 'K1', 'K1', 'K2'],
'key2': ['K0', 'K0', 'K0', 'K0'],
'C': ['C0', 'C1', 'C2', 'C3'],
'D': ['D0', 'D1', 'D2', 'D3']
}
)
# key1 key2 C D
# 0 K0 K0 C0 D0
# 1 K1 K0 C1 D1
# 2 K1 K0 C2 D2
# 3 K2 K0 C3 D3
result = pd.merge(left, right, how='left', on=['key1', 'key2'])
# key1 key2 A B C D
# 0 K0 K0 A0 B0 C0 D0
# 1 K0 K1 A1 B1 NaN NaN
# 2 K1 K0 A2 B2 C1 D1
# 3 K1 K0 A2 B2 C2 D2
# 4 K2 K1 A3 B3 NaN NaN
columns = result.columns
# Index(['key1', 'key2', 'A', 'B', 'C', 'D'], dtype='object')
index = result.index
# Int64Index([0, 1, 2, 3, 4], dtype='int64')
指定 indicator=True
,可以具体行的连接方式:
result = pd.merge(left, right, how='left', on=['key1', 'key2'], indicator=True)
# key1 key2 A B C D _merge
# 0 K0 K0 A0 B0 C0 D0 both
# 1 K0 K1 A1 B1 NaN NaN left_only
# 2 K1 K0 A2 B2 C1 D1 both
# 3 K1 K0 A2 B2 C2 D2 both
# 4 K2 K1 A3 B3 NaN NaN left_only
columns = result.columns
# Index(['key1', 'key2', 'A', 'B', 'C', 'D', '_merge'], dtype='object')
index = result.index
# Int64Index([0, 1, 2, 3, 4], dtype='int64')
而如果传入字符串给 indicator
,会重命名 indicator
这一列的名字:
result = pd.merge(left, right, how='left', on=['key1', 'key2'], indicator='comment')
# key1 key2 A B C D comment
# 0 K0 K0 A0 B0 C0 D0 both
# 1 K0 K1 A1 B1 NaN NaN left_only
# 2 K1 K0 A2 B2 C1 D1 both
# 3 K1 K0 A2 B2 C2 D2 both
# 4 K2 K1 A3 B3 NaN NaN left_only
columns = result.columns
# Index(['key1', 'key2', 'A', 'B', 'C', 'D', 'comment'], dtype='object')
index = result.index
# Int64Index([0, 1, 2, 3, 4], dtype='int64')
pd.join
用法
pd.join(
self,
other: DataFrame | Series,
on: IndexLabel | None = None,
how: str = "left",
lsuffix: str = "",
rsuffix: str = "",
sort: bool = False,
)
示例
两个 df
将基于 left
这个 df
的行索引进行合并。
体现在代码中,如下所示:
left = pd.DataFrame(
{
'A': ['A0', 'A1', 'A2'],
'B': ['B0', 'B1', 'B2']
},
index=['K0', 'K1', 'K2']
)
# A B
# K0 A0 B0
# K1 A1 B1
# K2 A2 B2
right = pd.DataFrame(
{
'C': ['C0', 'C2', 'C3'],
'D': ['D0', 'D2', 'D3']
},
index=['K0', 'K2', 'K3']
)
# C D
# K0 C0 D0
# K2 C2 D2
# K3 C3 D3
expectedcolumns = set(left.columns.to_list()) | set(right.columns.to_list())
# {'B', 'A', 'C', 'D'}
expectedindex = set(left.index.to_list()) | (set(left.index.to_list()) & set(right.index.to_list()))
# {'K1', 'K2', 'K0'}
result = left.join(right)
# A B C D
# K0 A0 B0 C0 D0
# K1 A1 B1 NaN NaN
# K2 A2 B2 C2 D2
columns = result.columns
# Index(['A', 'B', 'C', 'D'], dtype='object')
index = result.index
# Index(['K0', 'K1', 'K2'], dtype='object')
体现在示意图中,如下所示:
1. how
可以通过 how
来指定行索引的连接方式。
比如外连接 outer
,以 left
和 right
的索引的合集来进行合并。
行
单行
left = pd.DataFrame(
{
'A': ['A0', 'A1', 'A2'],
'B': ['B0', 'B1', 'B2']
},
index=['K0', 'K1', 'K2']
)
# A B
# K0 A0 B0
# K1 A1 B1
# K2 A2 B2
right = pd.DataFrame(
{
'C': ['C0', 'C2', 'C3'],
'D': ['D0', 'D2', 'D3']
},
index=['K0', 'K2', 'K3']
)
# C D
# K0 C0 D0
# K2 C2 D2
# K3 C3 D3
expectedcolumns = set(left.columns.to_list()) | set(right.columns.to_list())
# {'C', 'B', 'D', 'A'}
expectedindex = set(left.index.to_list()) | set(right.index.to_list())
# {'K2', 'K3', 'K0', 'K1'}
result = left.join(right, how='outer')
# A B C D
# K0 A0 B0 C0 D0
# K1 A1 B1 NaN NaN
# K2 A2 B2 C2 D2
# K3 NaN NaN C3 D3
columns = result.columns
# Index(['A', 'B', 'C', 'D'], dtype='object')
index = result.index
# Index(['K0', 'K1', 'K2', 'K3'], dtype='object')
2. on
单列
两个 df
基于指定列 key
的共有值进行合并。
体现在代码中,如下所示:
left = pd.DataFrame(
{
'k': ['K0', 'K1', 'K2'],
'v': [1, 2, 3]
}
)
right = pd.DataFrame(
{
'k': ['K0', 'K0', 'K3'],
'v': [4, 5, 6]
}
)
expectedindex = set(left['k'].to_list()) & set(right['k'].to_list())
# {'K0'}
result = pd.merge(left, right, on='k')
# k v_x v_y
# 0 K0 1 4
# 1 K0 1 5
columns = result.columns
# Index(['k', 'v_x', 'v_y'], dtype='object')
index = result.index
# Int64Index([0, 1], dtype='int64')
也可以通过 suffixes
自定义重复列的列名,
left = pd.DataFrame(
{
'k': ['K0', 'K1', 'K2'],
'v': [1, 2, 3]
}
)
right = pd.DataFrame(
{
'k': ['K0', 'K0', 'K3'],
'v': [4, 5, 6]
}
)
expectedindex = set(left['k'].to_list()) & set(right['k'].to_list())
# {'K0'}
result = pd.merge(left, right, on='k', suffixes=('1_l', '2_r'))
# k v1_l v2_r
# 0 K0 1 4
# 1 K0 1 5
columns = result.columns
# Index(['k', 'v1_l', 'v2_r'], dtype='object')
index = result.index
# Int64Index([0, 1], dtype='int64')
行与列
两个 df
基于指定列 key
的共有值进行合并。
体现在代码中,如下所示:
left = pd.DataFrame(
{
'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3'],
'key': ['K0', 'K1', 'K0', 'K1']
}
)
# A B key
# 0 A0 B0 K0
# 1 A1 B1 K1
# 2 A2 B2 K0
# 3 A3 B3 K1
right = pd.DataFrame(
{
'C': ['C0', 'C1'],
'D': ['D0', 'D1']
},
index=['K0', 'K1']
)
# C D
# K0 C0 D0
# K1 C1 D1
result = left.join(right, on='key')
# A B key C D
# 0 A0 B0 K0 C0 D0
# 1 A1 B1 K1 C1 D1
# 2 A2 B2 K0 C0 D0
# 3 A3 B3 K1 C1 D1
columns = result.columns
# Index(['A', 'B', 'key', 'C', 'D'], dtype='object')
index = result.index
# RangeIndex(start=0, stop=4, step=1)
体现在示意图中,如下所示:
pd.combine_first
用法
obj1.combine_first(obj2)
其中,obj1
为函数调用对象的数据集;obj2
为函数参数对象的数据集。
该函数的作用是函数参数对象中的数据为函数调用对象的缺失数据 “打补丁”, 即填充函数调用对象中的数据缺失值 54。
示例
一个 df
中的空值被另外一个 df
中同位置处的非空值所替换,如下所示:
a = pd.DataFrame(
[
[np.nan, 3., 5.],
[-4.6, np.nan, np.nan],
[np.nan, 7., np.nan]
]
)
# 0 1 2
# 0 NaN 3.0 5.0
# 1 -4.6 NaN NaN
# 2 NaN 7.0 NaN
b = pd.DataFrame(
[
[-42.6, np.nan, -8.2],
[-5., 1.6, 4]
],
index=[1, 2]
)
# 0 1 2
# 1 -42.6 NaN -8.2
# 2 -5.0 1.6 4.0
result = a.combine_first(b)
# 0 1 2
# 0 NaN 3.0 5.0
# 1 -4.6 NaN -8.2
# 2 -5.0 7.0 4.0
体现在示意图中,如下所示:
或者,也可以使用 df.update
的方式,如下所示:
a.update(b)
# 0 1 2
# 0 NaN 3.0 5.0
# 1 -42.6 NaN -8.2
# 2 -5.0 1.6 4.0
参考链接
#todo:
AttributeError: ‘DataFrame’ object has no attribute ‘str’ while trying to fix my dataframe ↩︎
[dataframe的divide运算](Python pandas.DataFrame.divide函数方法的使用) ↩︎
pandas dataframe select columns in multiindex [duplicate] ↩︎ ↩︎
pandas dataframe select columns in multiindex [duplicate] ↩︎ ↩︎