多个标的dataframe,如何转成多重索引的dataframe.
有点象,有中证500股票的数据,如何把这500只股票,整成一个多重索引的dataframe?
import pandas as pd
df1 = pd.read_csv(r"C:\Users\songroom\Desktop\600019.csv")
df2 = pd.read_csv(r"C:\Users\songroom\Desktop\600036.csv")
print(df1.head())
print(df2.head())
格式输出如下:
Unnamed: 0 open close low high volume money factor \
0 2010-1-4 9:31 958.39 959.56 958.39 959.95 3615 3465823 39.086
1 2010-1-4 9:32 959.56 957.61 957.61 959.56 2236 2142921 39.086
2 2010-1-4 9:33 957.61 954.87 954.87 957.61 1356 1296027 39.086
3 2010-1-4 9:34 955.26 954.48 954.09 955.65 2085 1990535 39.086
4 2010-1-4 9:35 954.87 954.09 953.70 954.87 2730 2606687 39.086
high_limit low_limit avg pre_close paused open_interest
0 1047.9 857.16 958.78 952.53 0 0
1 1047.9 857.16 958.39 959.56 0 0
2 1047.9 857.16 955.65 957.61 0 0
3 1047.9 857.16 954.48 954.87 0 0
4 1047.9 857.16 954.87 954.48 0 0
Unnamed: 0 open close low high volume money factor \
0 2010-1-4 9:31 1000 1001 1002 1003 3615 3465823 39.086
1 2010-1-4 9:32 1000 1001 1002 1003 2236 2142921 39.086
2 2010-1-4 9:33 1000 1001 1002 1003 1356 1296027 39.086
3 2010-1-4 9:34 1000 1001 1002 1003 2085 1990535 39.086
4 2010-1-4 9:35 1000 1001 1002 1003 2730 2606687 39.086
high_limit low_limit avg pre_close paused open_interest
0 1047.9 857.16 958.78 952.53 0 0
1 1047.9 857.16 958.39 959.56 0 0
2 1047.9 857.16 955.65 957.61 0 0
3 1047.9 857.16 954.48 954.87 0 0
4 1047.9 857.16 954.87 954.48 0 0
稍对数据进行整理一下,为多重索引数据的建立做准备:
codes1 = ["600019.XSHG" for i in range(len(df1))]
codes2 = ["600036.XSHG" for i in range(len(df1))]
df1 = df1.rename(columns={"Unnamed: 0":"datetime"})
df2 = df2.rename(columns={"Unnamed: 0":"datetime"})
# df1["code"] = codes1
# df2["code"] = codes2
#print(df1.head())
#print(df2.head())
import numpy as np
arr1 = np.array(df1.iloc[:,1:])
#print(arr1.shape)
arr2 = np.array(df2.iloc[:,1:])
def column_hstack(arrs):
row,col = arrs[0].shape
for arr in arrs:
assert arr.shape == (row,col)
_arrs = []
for i in range(col):
for j in range(len(arrs)):
temp = arrs[j][:,i]
if len(_arrs)!= 0:
_arrs = np.column_stack((_arrs,temp))
else:
_arrs = np.array(temp)
#print(f"{i} {j} _arrs.shape: {_arrs.shape}")
return _arrs
arrs = [arr1,arr2]
data = column_hstack(arrs)
print(data.shape)
这样,我们需要的data就准备好了。另外,行索引上,我们只放datetime,列索引上,我们放codes名称,以及各个字段。
注意:数据的大小和位置要匹配好!否则会后面报错。
需要注意的是,我们想做的格式不是下面的格式(格式1)【格式,数据是随便填充的】
格式1:这个是codes在上面,fields在下面
而是格式2:股价字段fields在上面,codes在下面
下面,我们就按经典的pd.MultiIndex.from_product构建方法来操作:
dates = df1.datetime
codes = ["600019.XSHG","600036.XSHG" ]
index = pd.MultiIndex.from_product([dates],names=['datetime']) # row
fields = ["open","close","low","high","volume","money","factor","high_limit","low_limit","avg","pre_close","paused","open_interest"]
columns = pd.MultiIndex.from_product([fields,codes],names=['fields','codes']) # col的索引顺序需要特别注意
df_data = pd.DataFrame(data,index=index,columns=columns)
上面注意:col的索引需要注意,"fields"在前面,表示在上面,"codes"在下面,这个数据要匹配好。
输出:
df_data.close
codes 600019.XSHG 600036.XSHG
datetime
2010-1-4 9:31 959.56 1001.0
2010-1-4 9:32 957.61 1001.0
2010-1-4 9:33 954.87 1001.0
2010-1-4 9:34 954.48 1001.0
2010-1-4 9:35 954.09 1001.0
2010-1-4 9:36 955.65 1001.0
2010-1-4 9:37 958.00 1001.0
2010-1-4 9:38 955.65 1001.0
2010-1-4 9:39 953.70 1001.0
2010-1-4 9:40 953.70 1001.0
经验证,数据是对的。
[df1.close,df2.close]
[0 959.56
1 957.61
2 954.87
3 954.48
4 954.09
5 955.65
6 958.00
7 955.65
8 953.70
9 953.70
Name: close, dtype: float64, 0 1001
1 1001
2 1001
3 1001
4 1001
5 1001
6 1001
7 1001
8 1001
9 1001
Name: close, dtype: int64]