pandas.DataFrame的合并比较方法

最新推荐文章于 2024-09-02 16:47:03 发布

峡谷的小鱼

最新推荐文章于 2024-09-02 16:47:03 发布

阅读量455

点赞数

分类专栏：数据分析 pandas 文章标签： python 数据分析机器学习深度学习 pytorch

本文链接：https://blog.csdn.net/weixin_43276033/article/details/124113612

版权

数据分析 pandas 专栏收录该内容

9 篇文章 1 订阅

订阅专栏

本文详细介绍了DataFrame的四种关键操作：append用于添加行，assign用于添加新列，compare用于比较不同，join和merge用于合并数据。示例展示了这些操作的具体用法和参数设置，帮助理解如何在实际数据分析中灵活运用。

摘要由CSDN通过智能技术生成

DataFrame结构的修改操作

1. `DataFrame.append(other, ignore_index, verify_integrity, sort)`

将other的数据添加为self的最后行。
参数：

other : DataFrame or Series/dict-like object, or list of these
将要添加的数据。
ignore_index : bool, default False
是否使用行标0,1,…,n-1。
verify_integrity : bool, default False
sort : bool, default False
如果self和other的列顺序不同时排序。

>>> df = pd.DataFrame({'A': ['foo', 'bar', 'foo', 'bar',
                                'foo', 'bar', 'foo', 'foo'],
                        'B': ['one', 'one', 'two', 'three',
                                'two', 'two', 'one', 'three'],
                        'C': np.random.randn(8),
                        'D': np.random.randn(8)})
>>> df
	A	B	C	D
0	foo	one	-0.253633	0.062874
1	bar	one	-0.674316	0.735620
2	foo	two	-0.988642	-0.451565
3	bar	three	-0.737381	1.504960
4	foo	two	1.841809	-0.242843
5	bar	two	-0.432108	0.345297
6	foo	one	1.060551	-1.169807
7	foo	three	-1.487355	-1.044460


>>> other = pd.DataFrame({'A': ['oth', 'oth'],
                      'B': ['one', 'two'],
                      'D': np.random.randn(2),
                      'E': np.random.randn(2),
                      'F': np.random.randn(2)})
>>> other
A	B	D	E	F
0	oth	one	-0.853782	0.225426	-0.374242
1	oth	two	0.965702	1.386243	0.114149


>>> df.append(other)
A	B	C	D	E	F
0	foo	one	-0.253633	0.062874	NaN	NaN
1	bar	one	-0.674316	0.735620	NaN	NaN
2	foo	two	-0.988642	-0.451565	NaN	NaN
3	bar	three	-0.737381	1.504960	NaN	NaN
4	foo	two	1.841809	-0.242843	NaN	NaN
5	bar	two	-0.432108	0.345297	NaN	NaN
6	foo	one	1.060551	-1.169807	NaN	NaN
7	foo	three	-1.487355	-1.044460	NaN	NaN
0	oth	one	NaN	-0.853782	0.225426	-0.374242
1	oth	two	NaN	0.965702	1.386243	0.114149

>>> df.append(other, ignore_index=True)
	A	B	C	D	E	F
0	foo	one	-0.253633	0.062874	NaN	NaN
1	bar	one	-0.674316	0.735620	NaN	NaN
2	foo	two	-0.988642	-0.451565	NaN	NaN
3	bar	three	-0.737381	1.504960	NaN	NaN
4	foo	two	1.841809	-0.242843	NaN	NaN
5	bar	two	-0.432108	0.345297	NaN	NaN
6	foo	one	1.060551	-1.169807	NaN	NaN
7	foo	three	-1.487355	-1.044460	NaN	NaN
8	oth	one	NaN	-0.853782	0.225426	-0.374242
9	oth	two	NaN	0.965702	1.386243	0.114149

2. `DataFrame.assign(**kwargs)`

给DataFrame对象添加列。
参数：

**kwargs : dict of {str: callable or Series}
新列。

>>> df = pd.DataFrame({'A': ['foo', 'bar', 'foo', 'bar',
                                'foo', 'bar', 'foo', 'foo'],
                        'B': ['one', 'one', 'two', 'three',
                                'two', 'two', 'one', 'three'],
                        'C': np.random.randn(8),
                        'D': np.random.randn(8)})
>>> df
	A	B	C	D
0	foo	one	-1.478852	0.314423
1	bar	one	1.495904	-0.789284
2	foo	two	-0.264907	-0.230382
3	bar	three	-0.287789	-1.907428
4	foo	two	0.808572	-1.537025
5	bar	two	-1.490136	1.249029
6	foo	one	1.049753	-2.579596
7	foo	three	1.961194	1.883564

>>> other = pd.DataFrame({'E': np.arange(len(df))})
>>> other
	E
0	0
1	1
2	2
3	3
4	4
5	5
6	6
7	7

>>> df.assign(E=other)
	A	B	C	D	E
0	foo	one	-1.478852	0.314423	0
1	bar	one	1.495904	-0.789284	1
2	foo	two	-0.264907	-0.230382	2
3	bar	three	-0.287789	-1.907428	3
4	foo	two	0.808572	-1.537025	4
5	bar	two	-1.490136	1.249029	5
6	foo	one	1.049753	-2.579596	6
7	foo	three	1.961194	1.883564	7

>>> df.assign(E=2)
	A	B	C	D	E
0	foo	one	-1.478852	0.314423	2
1	bar	one	1.495904	-0.789284	2
2	foo	two	-0.264907	-0.230382	2
3	bar	three	-0.287789	-1.907428	2
4	foo	two	0.808572	-1.537025	2
5	bar	two	-1.490136	1.249029	2
6	foo	one	1.049753	-2.579596	2
7	foo	three	1.961194	1.883564	2

>>> df.assign(E=df['D']*2)
	A	B	C	D	E
0	foo	one	-1.478852	0.314423	0.628846
1	bar	one	1.495904	-0.789284	-1.578567
2	foo	two	-0.264907	-0.230382	-0.460763
3	bar	three	-0.287789	-1.907428	-3.814855
4	foo	two	0.808572	-1.537025	-3.074051
5	bar	two	-1.490136	1.249029	2.498058
6	foo	one	1.049753	-2.579596	-5.159192
7	foo	three	1.961194	1.883564	3.767129

3. `DataFrame.compare(other, align_axis, keep_shape, keep_equal)`

比较两个DataFrame对象的不同。
参数：

other: DataFrame
比较的对象；
align_axis : {0 or 'index', 1 or 'columns'}, default 1
行还是列。
keep_shape : bool, default False
是否保持结果的大小和self相同。
keep_equal : bool, default False
如果为True，保留结果是相等的值。

>>> df = pd.DataFrame({'A': ['foo', 'bar', 'foo', 'bar',
                                'foo', 'bar', 'foo', 'foo'],
                        'B': ['one', 'one', 'two', 'three',
                                'two', 'two', 'one', 'three'],
                        'C': np.random.randn(8),
                        'D': np.random.randn(8)})
>>> df
	A	B	C	D
0	foo	one	-1.054664	-0.899963
1	bar	one	-1.636758	0.055059
2	foo	two	-1.847338	1.036013
3	bar	three	-1.282626	0.238854
4	foo	two	-0.302243	-1.690846
5	bar	two	-0.980312	-0.301000
6	foo	one	1.327013	-0.819923
7	foo	three	-1.172435	0.713987

>>> other = df.copy()
>>> other.iat[1,1] = 'Zero'
>>> other.iat[3,2] = 0
>>> other
	A	B	C	D
0	foo	one	-1.054664	-0.899963
1	bar	Zero	-1.636758	0.055059
2	foo	two	-1.847338	1.036013
3	bar	three	0.000000	0.238854
4	foo	two	-0.302243	-1.690846
5	bar	two	-0.980312	-0.301000
6	foo	one	1.327013	-0.819923
7	foo	three	-1.172435	0.713987

>>> df.compare(other)
B	C
self	other	self	other
1	one	Zero	NaN	NaN
3	NaN	NaN	-1.282626	0.0

>>> df.compare(other, keep_equal=False)
	B	C
self	other	self	other
1	one	Zero	-1.636758	-1.636758
3	three	three	-1.282626	0.000000

>>> df.compare(other, keep_shape=True)
A	B	C	D
self	other	self	other	self	other	self	other
0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	NaN	NaN	one	Zero	NaN	NaN	NaN	NaN
2	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
3	NaN	NaN	NaN	NaN	-1.282626	0.0	NaN	NaN
4	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
5	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
6	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
7	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

4. `DataFrame.join(other, on, how, lsuffix, rsuffix)`

添加另一个DataFrame对象的列。
参数：

other : DataFrame, Series, or list of DataFrame
要添加的行索引相似的列，如果给的是Series，则需给定name属性作为列名；
on : str, list of str, or array-like, optional
用于连接的列或者索引名；
how : {'left', 'right', 'outer', 'inner'}, default 'left'
如何处理连接；
lsuffix : str, default ''
左对象中重叠列使用的后缀；
rsuffix : str, default ''
右对象中重叠列使用的后缀；

>>> df = df = pd.DataFrame({'A': ['foo', 'bar', 'foo', 'bar',
                                'foo', 'bar', 'foo', 'foo'],
                        'B': ['one', 'one', 'two', 'three',
                                'two', 'two', 'one', 'three'],
                        'C': np.random.randn(8),
                        'D': np.random.randn(8)})
>>> df
	A	B	C	D
0	foo	one	-1.437858	0.155025
1	bar	one	1.150565	-0.614996
2	foo	two	0.296236	0.538160
3	bar	three	-1.355619	1.229465
4	foo	two	-0.411405	-1.167204
5	bar	two	-0.178302	-0.451726
6	foo	one	1.127362	-0.407458
7	foo	three	-1.608615	-1.025847

>>> other = pd.DataFrame({'D': np.random.randn(5),
                      'E': np.arange(5),
                      'F': np.arange(5)})
>>> other
D	E	F
0	0.301840	0	0
1	1.885965	1	1
2	0.925909	2	2
3	0.560766	3	3
4	-0.591023	4	4

>>> df.join(other,lsuffix='_left',rsuffix='_right',how='left')
	A	B	C	D_left	D_right	E	F
0	foo	one	-0.334271	-0.984569	0.301840	0.0	0.0
1	bar	one	-0.256735	0.344031	1.885965	1.0	1.0
2	foo	two	-0.832477	0.191283	0.925909	2.0	2.0
3	bar	three	-0.165142	-1.415241	0.560766	3.0	3.0
4	foo	two	2.160271	1.271076	-0.591023	4.0	4.0
5	bar	two	-0.804321	0.090582	NaN	NaN	NaN
6	foo	one	0.090808	0.821145	NaN	NaN	NaN
7	foo	three	1.145618	-0.026550	NaN	NaN	NaN

>>> df.join(other.set_index('D'), on='D', how='left')
	A	B	C	D	E	F
0	foo	one	-0.334271	-0.984569	NaN	NaN
1	bar	one	-0.256735	0.344031	NaN	NaN
2	foo	two	-0.832477	0.191283	NaN	NaN
3	bar	three	-0.165142	-1.415241	NaN	NaN
4	foo	two	2.160271	1.271076	NaN	NaN
5	bar	two	-0.804321	0.090582	NaN	NaN
6	foo	one	0.090808	0.821145	NaN	NaN
7	foo	three	1.145618	-0.026550	NaN	NaN

>>> df.join(other.set_index('D'), on='D', how='right')
	A	B	C	D	E	F
NaN	NaN	NaN	NaN	0.301840	0	0
NaN	NaN	NaN	NaN	1.885965	1	1
NaN	NaN	NaN	NaN	0.925909	2	2
NaN	NaN	NaN	NaN	0.560766	3	3
NaN	NaN	NaN	NaN	-0.591023	4	4

5. `DataFrame.merge(right, how, on, left_on, right_on, left_index, right_index, suffixes`

合并DataFrame对象。
参数：

right: DataFrame or named Series
要合并的对象；
how : {'left', 'right', 'outer', 'inner', 'cross'}, default 'inner'
方式；
on : label or list
合并使用的列名；
left_on : label or list, or array-like
左表中用于连接的列名；
right_on : label or list, or array-like
右表用于连接的列名；
left_index : bool, default False
左表中用于连接键的索引；
right_index : bool, default False
右表中用于连接键的索引；
suffixes : list-like, default is ("_x", "_y")
长为2的字符串元组，添加后缀；

>>> df = pd.DataFrame({'A': ['foo', 'bar', 'bar',
                                'foo', 'bar'],
                        'B': ['one', 'two', 'three',
                                'two', 'one'],
                        'C': np.random.randn(5),
                        'D': np.random.randn(5)})
>>> df

A	B	C	D
0	foo	one	-0.598380	0.934121
1	bar	two	0.209143	-0.988885
2	bar	three	-0.634051	0.286479
3	foo	two	0.551370	-0.282658
4	bar	one	-0.174126	-0.193955

>>> right = pd.DataFrame({'A': ['foo', 'bar', 'foo', 
                                'foo'],
                      'D': np.random.randn(4)})
>>> right
	A	D
0	foo	0.626931
1	bar	0.033224
2	foo	0.688018
3	foo	0.420344

# 使用"A"列合并两表，重复列名，默认添加 _x和_y
>>> df.merge(right, left_on='A', right_on='A')
	A	B	C	D_x	D_y
0	foo	one	-0.598380	0.934121	0.626931
1	foo	one	-0.598380	0.934121	0.688018
2	foo	one	-0.598380	0.934121	0.420344
3	foo	two	0.551370	-0.282658	0.626931
4	foo	two	0.551370	-0.282658	0.688018
5	foo	two	0.551370	-0.282658	0.420344
6	bar	two	0.209143	-0.988885	0.033224
7	bar	three	-0.634051	0.286479	0.033224
8	bar	one	-0.174126	-0.193955	0.033224

>>> df.merge(right, on='A', suffixes=('_left', '_right'))
	A	B	C	D_left	D_right
0	foo	one	-0.598380	0.934121	0.626931
1	foo	one	-0.598380	0.934121	0.688018
2	foo	one	-0.598380	0.934121	0.420344
3	foo	two	0.551370	-0.282658	0.626931
4	foo	two	0.551370	-0.282658	0.688018
5	foo	two	0.551370	-0.282658	0.420344
6	bar	two	0.209143	-0.988885	0.033224
7	bar	three	-0.634051	0.286479	0.033224
8	bar	one	-0.174126	-0.193955	0.033224

6. `DataFrame.update(other)`

使用传递的序列的值修改序列。
参数：

other: Series, or object coercible into Series

>>> df = pd.DataFrame({'A': ['foo', 'bar', 'foo', 'bar',
                                'foo', 'bar', 'foo', 'foo'],
                        'B': ['one', 'one', 'two', 'three',
                                'two', 'two', 'one', 'three'],
                        'C': np.random.randn(8),
                        'D': np.random.randn(8)})
>>> df
A	B	C	D
0	foo	one	-1.355228	-1.154257
1	bar	one	-1.358041	-0.381076
2	foo	two	-0.802152	0.992619
3	bar	three	-0.862556	-0.428032
4	foo	two	-0.180756	-0.352562
5	bar	two	-1.327571	-0.577039
6	foo	one	-1.520092	-0.278328
7	foo	three	-0.833843	0.644365

>>> other = list(np.random.randn(7))
>>> other
[-0.3970351091229411,
 1.074159729611879,
 -0.5120292849313965,
 1.4501524543239934,
 0.012847492863483247,
 -0.8810326090034996,
 0.8653033389654352]

>>> df['C'].update(other)
>>> df

A	B	C	D
0	foo	one	-0.397035	-1.154257
1	bar	one	1.074160	-0.381076
2	foo	two	-0.512029	0.992619
3	bar	three	1.450152	-0.428032
4	foo	two	0.012847	-0.352562
5	bar	two	-0.881033	-0.577039
6	foo	one	0.865303	-0.278328
7	foo	three	-0.833843	0.644365