数据规整：沿轴向连接,联合重叠数据

最新推荐文章于 2024-07-19 04:01:31 发布

AI路漫漫

最新推荐文章于 2024-07-19 04:01:31 发布

阅读量234

点赞数

分类专栏：数据分析文章标签：数据分析

本文链接：https://blog.csdn.net/weixin_46192930/article/details/106746364

版权

数据分析专栏收录该内容

36 篇文章 6 订阅

订阅专栏

2.3 沿轴向连接

另一种数据组合操作可互换的称为拼接，绑定或堆叠。Numpy 的 concatenate 函数可以在 numpy 数组上实现该功能。

arr=np.arange(12).reshape((3,4))
np.concatenate([arr,arr],axis=1)
array([[ 0,  1,  2,  3,  0,  1,  2,  3],
       [ 4,  5,  6,  7,  4,  5,  6,  7],
       [ 8,  9, 10, 11,  8,  9, 10, 11]])

s1 = pd.Series([0, 1], index=['a', 'b'])
s2 = pd.Series([2, 3, 4], index=['c', 'd', 'e'])
s3 = pd.Series([5, 6], index=['f', 'g'])
pd.concat([s1,s2,s3])
a    0
b    1
c    2
d    3
e    4
f    5
g    6
dtype: int64

pd.concat([s1,s2,s3],axis=1)      # axis=1 生成dataframe 。
   0	1	2
a	0.0	NaN	NaN
b	1.0	NaN	NaN
c	NaN	2.0	NaN
d	NaN	3.0	NaN
e	NaN	4.0	NaN
f	NaN	NaN	5.0
g	NaN	NaN	6.0

pd.concat([s1,s2,s3],axis=1,join='inner')             # 交集，返回的是空的。

result=pd.concat([s1,s1,s3],keys=['one','two','three']) # 在连接轴向上创建一个多层索引
result
one    a    0
       b    1
two    a    0
       b    1
three  f    5
       g    6
dtype: int64

result.unstack()             # 把那多层索引给去了。
	a	b	f	g
one	0.0	1.0	NaN	NaN
two	0.0	1.0	NaN	NaN
three	NaN	NaN	5.0	6.0

pd.concat([s1,s1,s3],keys=['one','two','three'],axis=1)
   one	two	three
a	0.0	0.0	NaN
b	1.0	1.0	NaN           # keys 成为dataframe 的列头。
f	NaN	NaN	5.0
g	NaN	NaN	6.0


# 相同的逻辑扩展到dataframe 上。
df1=pd.DataFrame(np.arange(6).reshape(3,2),index=['a','b','c'],columns=['one','two'])
df2=pd.DataFrame(5+np.arange(4).reshape(2,2),index=['a','c'],columns=['three','four'])
df1
  one	two
a	0	1
b	2	3
c	4	5
df2
   three	four     # 5+np.arange(4) >>>    array([5, 6, 7, 8])  挺新奇啊，，
a	5	6 
c	7	8

pd.concat([df1,df2],axis=1,keys=['level1','level2'])
 level1  	level2
 one	two	three	four
a	0	1	5.0	6.0
b	2	3	NaN	NaN
c	4	5	7.0	8.0

pd.concat({'level1':df1,'level2':df2},axis=1,names=['upper','lower'])    
# 传递的是对象的字典而不是列表，字典的键会用于keys 选项。另外通过传递 names 给轴层级命名。
upper	level1	level2
lower	one	two	three	four
a	0	1	5.0	6.0
b	2	3	NaN	NaN
c	4	5	7.0	8.0


df1=pd.DataFrame(np.random.randn(3,4),columns=['a','b','c','d'])
df2=pd.DataFrame(np.random.randn(2,3),columns=['b','d','a'])
pd.concat([df1,df2],ignore_index=True)                     # 不保留原来的索引，默认是保留的
               a	b	c	d
0	0.930013	-0.097318	-1.574589	-0.172548
1	0.478817	-0.453965	-0.510146	0.921774
2	-0.076038	-0.279683	1.101760	0.675676
3	-1.227738	0.439467	NaN	-0.059227
4	-0.856187	-0.468771	NaN	0.900543

join_axes 这个参数没有了，，我买的不应该是假书啊。。。。。。
在这里插入图片描述

2.4 联合重叠数据

另一种数据联合的场景，不是合并，也不是连接操作，两个数据集的索引全部或部分重叠，

a = pd.Series([np.nan, 2.5, np.nan, 3.5, 4.5, np.nan],
              index=['f', 'e', 'd', 'c', 'b', 'a'])
b = pd.Series([0.,np.nan,2.,np.nan,9.,5.],
              index=['f', 'e', 'd', 'c', 'b', 'a'])
np.where(pd.isnull(a),b,a)
array([0. , 2.5, 2. , 3.5, 4.5, 5. ])

b.combine_first(a)
f    0.0
e    2.5
d    2.0
c    3.5
b    9.0           # 当 b 不是null ，就保留了。
a    5.0
dtype: float64


df1 = pd.DataFrame({'a': [1., np.nan, 5., np.nan],
                    'b': [np.nan, 2., np.nan, 6.],
                    'c': range(2, 18, 4)})
df2 = pd.DataFrame({'a': [5., 4., np.nan, 3., 7.],
                    'b': [np.nan, 3., 4., 6., 8.]})
df1
	a	b	c
0	1.0	NaN	2
1	NaN	2.0	6
2	5.0	NaN	10
3	NaN	6.0	14
df2
    a	b
0	5.0	NaN
1	4.0	3.0
2	NaN	4.0
3	3.0	6.0
4	7.0	8.0

df1.combine_first(df2)
	a	b	c
0	1.0	NaN	2.0
1	4.0	2.0	6.0
2	5.0	4.0	10.0
3	3.0	6.0	14.0
4	7.0	8.0	NaN