DataFrame结构的修改操作
1. DataFrame.append(other, ignore_index, verify_integrity, sort)
将other的数据添加为self的最后行。
参数:
other : DataFrame or Series/dict-like object, or list of these
将要添加的数据。ignore_index : bool, default False
是否使用行标0,1,…,n-1。verify_integrity : bool, default False
sort : bool, default False
如果self和other的列顺序不同时排序。
>>> df = pd.DataFrame({'A': ['foo', 'bar', 'foo', 'bar',
'foo', 'bar', 'foo', 'foo'],
'B': ['one', 'one', 'two', 'three',
'two', 'two', 'one', 'three'],
'C': np.random.randn(8),
'D': np.random.randn(8)})
>>> df
A B C D
0 foo one -0.253633 0.062874
1 bar one -0.674316 0.735620
2 foo two -0.988642 -0.451565
3 bar three -0.737381 1.504960
4 foo two 1.841809 -0.242843
5 bar two -0.432108 0.345297
6 foo one 1.060551 -1.169807
7 foo three -1.487355 -1.044460
>>> other = pd.DataFrame({'A': ['oth', 'oth'],
'B': ['one', 'two'],
'D': np.random.randn(2),
'E': np.random.randn(2),
'F': np.random.randn(2)})
>>> other
A B D E F
0 oth one -0.853782 0.225426 -0.374242
1 oth two 0.965702 1.386243 0.114149
>>> df.append(other)
A B C D E F
0 foo one -0.253633 0.062874 NaN NaN
1 bar one -0.674316 0.735620 NaN NaN
2 foo two -0.988642 -0.451565 NaN NaN
3 bar three -0.737381 1.504960 NaN NaN
4 foo two 1.841809 -0.242843 NaN NaN
5 bar two -0.432108 0.345297 NaN NaN
6 foo one 1.060551 -1.169807 NaN NaN
7 foo three -1.487355 -1.044460 NaN NaN
0 oth one NaN -0.853782 0.225426 -0.374242
1 oth two NaN 0.965702 1.386243 0.114149
>>> df.append(other, ignore_index=True)
A B C D E F
0 foo one -0.253633 0.062874 NaN NaN
1 bar one -0.674316 0.735620 NaN NaN
2 foo two -0.988642 -0.451565 NaN NaN
3 bar three -0.737381 1.504960 NaN NaN
4 foo two 1.841809 -0.242843 NaN NaN
5 bar two -0.432108 0.345297 NaN NaN
6 foo one 1.060551 -1.169807 NaN NaN
7 foo three -1.487355 -1.044460 NaN NaN
8 oth one NaN -0.853782 0.225426 -0.374242
9 oth two NaN 0.965702 1.386243 0.114149
2. DataFrame.assign(**kwargs)
给DataFrame对象添加列。
参数:
**kwargs : dict of {str: callable or Series}
新列。
>>> df = pd.DataFrame({'A': ['foo', 'bar', 'foo', 'bar',
'foo', 'bar', 'foo', 'foo'],
'B': ['one', 'one', 'two', 'three',
'two', 'two', 'one', 'three'],
'C': np.random.randn(8),
'D': np.random.randn(8)})
>>> df
A B C D
0 foo one -1.478852 0.314423
1 bar one 1.495904 -0.789284
2 foo two -0.264907 -0.230382
3 bar three -0.287789 -1.907428
4 foo two 0.808572 -1.537025
5 bar two -1.490136 1.249029
6 foo one 1.049753 -2.579596
7 foo three 1.961194 1.883564
>>> other = pd.DataFrame({'E': np.arange(len(df))})
>>> other
E
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
>>> df.assign(E=other)
A B C D E
0 foo one -1.478852 0.314423 0
1 bar one 1.495904 -0.789284 1
2 foo two -0.264907 -0.230382 2
3 bar three -0.287789 -1.907428 3
4 foo two 0.808572 -1.537025 4
5 bar two -1.490136 1.249029 5
6 foo one 1.049753 -2.579596 6
7 foo three 1.961194 1.883564 7
>>> df.assign(E=2)
A B C D E
0 foo one -1.478852 0.314423 2
1 bar one 1.495904 -0.789284 2
2 foo two -0.264907 -0.230382 2
3 bar three -0.287789 -1.907428 2
4 foo two 0.808572 -1.537025 2
5 bar two -1.490136 1.249029 2
6 foo one 1.049753 -2.579596 2
7 foo three 1.961194 1.883564 2
>>> df.assign(E=df['D']*2)
A B C D E
0 foo one -1.478852 0.314423 0.628846
1 bar one 1.495904 -0.789284 -1.578567
2 foo two -0.264907 -0.230382 -0.460763
3 bar three -0.287789 -1.907428 -3.814855
4 foo two 0.808572 -1.537025 -3.074051
5 bar two -1.490136 1.249029 2.498058
6 foo one 1.049753 -2.579596 -5.159192
7 foo three 1.961194 1.883564 3.767129
3. DataFrame.compare(other, align_axis, keep_shape, keep_equal)
比较两个DataFrame对象的不同。
参数:
other: DataFrame
比较的对象;align_axis : {0 or 'index', 1 or 'columns'}, default 1
行还是列。keep_shape : bool, default False
是否保持结果的大小和self相同。keep_equal : bool, default False
如果为True,保留结果是相等的值。
>>> df = pd.DataFrame({'A': ['foo', 'bar', 'foo', 'bar',
'foo', 'bar', 'foo', 'foo'],
'B': ['one', 'one', 'two', 'three',
'two', 'two', 'one', 'three'],
'C': np.random.randn(8),
'D': np.random.randn(8)})
>>> df
A B C D
0 foo one -1.054664 -0.899963
1 bar one -1.636758 0.055059
2 foo two -1.847338 1.036013
3 bar three -1.282626 0.238854
4 foo two -0.302243 -1.690846
5 bar two -0.980312 -0.301000
6 foo one 1.327013 -0.819923
7 foo three -1.172435 0.713987
>>> other = df.copy()
>>> other.iat[1,1] = 'Zero'
>>> other.iat[3,2] = 0
>>> other
A B C D
0 foo one -1.054664 -0.899963
1 bar Zero -1.636758 0.055059
2 foo two -1.847338 1.036013
3 bar three 0.000000 0.238854
4 foo two -0.302243 -1.690846
5 bar two -0.980312 -0.301000
6 foo one 1.327013 -0.819923
7 foo three -1.172435 0.713987
>>> df.compare(other)
B C
self other self other
1 one Zero NaN NaN
3 NaN NaN -1.282626 0.0
>>> df.compare(other, keep_equal=False)
B C
self other self other
1 one Zero -1.636758 -1.636758
3 three three -1.282626 0.000000
>>> df.compare(other, keep_shape=True)
A B C D
self other self other self other self other
0 NaN NaN NaN NaN NaN NaN NaN NaN
1 NaN NaN one Zero NaN NaN NaN NaN
2 NaN NaN NaN NaN NaN NaN NaN NaN
3 NaN NaN NaN NaN -1.282626 0.0 NaN NaN
4 NaN NaN NaN NaN NaN NaN NaN NaN
5 NaN NaN NaN NaN NaN NaN NaN NaN
6 NaN NaN NaN NaN NaN NaN NaN NaN
7 NaN NaN NaN NaN NaN NaN NaN NaN
4. DataFrame.join(other, on, how, lsuffix, rsuffix)
添加另一个DataFrame对象的列。
参数:
other : DataFrame, Series, or list of DataFrame
要添加的行索引相似的列,如果给的是Series,则需给定name属性作为列名;on : str, list of str, or array-like, optional
用于连接的列或者索引名;how : {'left', 'right', 'outer', 'inner'}, default 'left'
如何处理连接;lsuffix : str, default ''
左对象中重叠列使用的后缀;rsuffix : str, default ''
右对象中重叠列使用的后缀;
>>> df = df = pd.DataFrame({'A': ['foo', 'bar', 'foo', 'bar',
'foo', 'bar', 'foo', 'foo'],
'B': ['one', 'one', 'two', 'three',
'two', 'two', 'one', 'three'],
'C': np.random.randn(8),
'D': np.random.randn(8)})
>>> df
A B C D
0 foo one -1.437858 0.155025
1 bar one 1.150565 -0.614996
2 foo two 0.296236 0.538160
3 bar three -1.355619 1.229465
4 foo two -0.411405 -1.167204
5 bar two -0.178302 -0.451726
6 foo one 1.127362 -0.407458
7 foo three -1.608615 -1.025847
>>> other = pd.DataFrame({'D': np.random.randn(5),
'E': np.arange(5),
'F': np.arange(5)})
>>> other
D E F
0 0.301840 0 0
1 1.885965 1 1
2 0.925909 2 2
3 0.560766 3 3
4 -0.591023 4 4
>>> df.join(other,lsuffix='_left',rsuffix='_right',how='left')
A B C D_left D_right E F
0 foo one -0.334271 -0.984569 0.301840 0.0 0.0
1 bar one -0.256735 0.344031 1.885965 1.0 1.0
2 foo two -0.832477 0.191283 0.925909 2.0 2.0
3 bar three -0.165142 -1.415241 0.560766 3.0 3.0
4 foo two 2.160271 1.271076 -0.591023 4.0 4.0
5 bar two -0.804321 0.090582 NaN NaN NaN
6 foo one 0.090808 0.821145 NaN NaN NaN
7 foo three 1.145618 -0.026550 NaN NaN NaN
>>> df.join(other.set_index('D'), on='D', how='left')
A B C D E F
0 foo one -0.334271 -0.984569 NaN NaN
1 bar one -0.256735 0.344031 NaN NaN
2 foo two -0.832477 0.191283 NaN NaN
3 bar three -0.165142 -1.415241 NaN NaN
4 foo two 2.160271 1.271076 NaN NaN
5 bar two -0.804321 0.090582 NaN NaN
6 foo one 0.090808 0.821145 NaN NaN
7 foo three 1.145618 -0.026550 NaN NaN
>>> df.join(other.set_index('D'), on='D', how='right')
A B C D E F
NaN NaN NaN NaN 0.301840 0 0
NaN NaN NaN NaN 1.885965 1 1
NaN NaN NaN NaN 0.925909 2 2
NaN NaN NaN NaN 0.560766 3 3
NaN NaN NaN NaN -0.591023 4 4
5. DataFrame.merge(right, how, on, left_on, right_on, left_index, right_index, suffixes
合并DataFrame对象。
参数:
right: DataFrame or named Series
要合并的对象;how : {'left', 'right', 'outer', 'inner', 'cross'}, default 'inner'
方式;on : label or list
合并使用的列名;left_on : label or list, or array-like
左表中用于连接的列名;right_on : label or list, or array-like
右表用于连接的列名;left_index : bool, default False
左表中用于连接键的索引;right_index : bool, default False
右表中用于连接键的索引;suffixes : list-like, default is ("_x", "_y")
长为2的字符串元组,添加后缀;
>>> df = pd.DataFrame({'A': ['foo', 'bar', 'bar',
'foo', 'bar'],
'B': ['one', 'two', 'three',
'two', 'one'],
'C': np.random.randn(5),
'D': np.random.randn(5)})
>>> df
A B C D
0 foo one -0.598380 0.934121
1 bar two 0.209143 -0.988885
2 bar three -0.634051 0.286479
3 foo two 0.551370 -0.282658
4 bar one -0.174126 -0.193955
>>> right = pd.DataFrame({'A': ['foo', 'bar', 'foo',
'foo'],
'D': np.random.randn(4)})
>>> right
A D
0 foo 0.626931
1 bar 0.033224
2 foo 0.688018
3 foo 0.420344
# 使用"A"列合并两表,重复列名,默认添加 _x和_y
>>> df.merge(right, left_on='A', right_on='A')
A B C D_x D_y
0 foo one -0.598380 0.934121 0.626931
1 foo one -0.598380 0.934121 0.688018
2 foo one -0.598380 0.934121 0.420344
3 foo two 0.551370 -0.282658 0.626931
4 foo two 0.551370 -0.282658 0.688018
5 foo two 0.551370 -0.282658 0.420344
6 bar two 0.209143 -0.988885 0.033224
7 bar three -0.634051 0.286479 0.033224
8 bar one -0.174126 -0.193955 0.033224
>>> df.merge(right, on='A', suffixes=('_left', '_right'))
A B C D_left D_right
0 foo one -0.598380 0.934121 0.626931
1 foo one -0.598380 0.934121 0.688018
2 foo one -0.598380 0.934121 0.420344
3 foo two 0.551370 -0.282658 0.626931
4 foo two 0.551370 -0.282658 0.688018
5 foo two 0.551370 -0.282658 0.420344
6 bar two 0.209143 -0.988885 0.033224
7 bar three -0.634051 0.286479 0.033224
8 bar one -0.174126 -0.193955 0.033224
6. DataFrame.update(other)
使用传递的序列的值修改序列。
参数:
other: Series, or object coercible into Series
>>> df = pd.DataFrame({'A': ['foo', 'bar', 'foo', 'bar',
'foo', 'bar', 'foo', 'foo'],
'B': ['one', 'one', 'two', 'three',
'two', 'two', 'one', 'three'],
'C': np.random.randn(8),
'D': np.random.randn(8)})
>>> df
A B C D
0 foo one -1.355228 -1.154257
1 bar one -1.358041 -0.381076
2 foo two -0.802152 0.992619
3 bar three -0.862556 -0.428032
4 foo two -0.180756 -0.352562
5 bar two -1.327571 -0.577039
6 foo one -1.520092 -0.278328
7 foo three -0.833843 0.644365
>>> other = list(np.random.randn(7))
>>> other
[-0.3970351091229411,
1.074159729611879,
-0.5120292849313965,
1.4501524543239934,
0.012847492863483247,
-0.8810326090034996,
0.8653033389654352]
>>> df['C'].update(other)
>>> df
A B C D
0 foo one -0.397035 -1.154257
1 bar one 1.074160 -0.381076
2 foo two -0.512029 0.992619
3 bar three 1.450152 -0.428032
4 foo two 0.012847 -0.352562
5 bar two -0.881033 -0.577039
6 foo one 0.865303 -0.278328
7 foo three -0.833843 0.644365