本篇将介绍如何对pandas数据进行操作
import numpy as np
import pandas as pd
from pandas import Series, DataFrame
设置一下输出屏幕的宽度
pd.set_option('display.width', 200)
一、数据创建的其他方式
数据结构的创建不止是上篇中介绍的标准形式,本篇再介绍几种。例如,我们可以创建一个以日期为元素的Series:
dates = pd.date_range('20150101', periods=5)
print(dates)
# DatetimeIndex(['2015-01-01', '2015-01-02', '2015-01-03',
'2015-01-04', '2015-01-05'], dtype='datetime64[ns]', freq='D')
将这个日期Series作为索引赋给一个DataFrame:
df=pd.DataFrame(np.random.randn(5, 4),index=dates,columns=list('ABCD'))
print(df)
# A B C D
# 2015-01-01 -0.3307 -1.1993 2.1873 0.9368
# 2015-01-02 -0.1020 -1.0258 1.4000 -0.4759
# 2015-01-03 0.9739 1.9301 -0.9519 -0.7754
# 2015-01-04 1.1615 0.3040 0.3494 -1.1886
# 2015-01-05 0.9669 0.0103 -0.3264 -0.8607
只要是能转换成Series的对象,都可以用于创建DataFrame:
df2 = pd.DataFrame({ 'A': 1.,
'B': pd.Timestamp('20150214'),
'C': pd.Series(1.6,index=list(range(4)),dtype='float64'),
'D': np.array([4] * 4, dtype='int64'),
'E': 'hello pandas!'})
print(df2)
# A B C D E
# 0 1.0 2015-02-14 1.6 4 hello pandas!
# 1 1.0 2015-02-14 1.6 4 hello pandas!
# 2 1.0 2015-02-14 1.6 4 hello pandas!
# 3 1.0 2015-02-14 1.6 4 hello pandas!
二、数据的查看
获取2015年一月份全部的交易日内十支股票的日行情信息
stock_list = ['000001.XSHE', '000002.XSHE', '000568.XSHE', '000625.XSHE', '000768.XSHE',
'600028.XSHG', '600030.XSHG', '601111.XSHG', '601390.XSHG', '601998.XSHG']
raw_data = DataAPI.MktEqudGet(secID=stock_list, beginDate='20150101', endDate='20150131', pandas='1')
df = raw_data[['secID', 'tradeDate', 'secShortName', 'openPrice', 'highestPrice',
'lowestPrice', 'closePrice', 'turnoverVol']]
print(df.shape)
# (200, 8)
print(df.head(3))
# secID tradeDate secShortName openPrice highestPrice lowestPrice closePrice turnoverVol
# 0 000001.XSHE 2015-01-05 平安银行 15.99 16.28 15.60 16.02 286043643
# 1 000001.XSHE 2015-01-06 平安银行 15.85 16.39 15.55 15.78 216642140
# 2 000001.XSHE 2015-01-07 平安银行 15.56 15.83 15.30 15.48 170012067
dataframe.describe()提供了DataFrame中纯数值数据的统计信息:
print(df.describe())
# openPrice highestPrice lowestPrice closePrice turnoverVol
# count 200.00000 200.000000 200.00000 200.000000 2.000000e+02
# mean 15.17095 15.634000 14.86545 15.242750 2.384811e+08
# std 7.72807 7.997345 7.56136 7.772184 2.330510e+08
# min 6.14000 6.170000 6.02000 6.030000 1.242183e+07
# 25% 8.09500 8.250000 7.98750 8.127500 7.357002e+07
# 50% 13.96000 14.335000 13.75500 13.925000 1.554569e+08
# 75% 19.95000 20.500000 19.46250 20.012500 3.358617e+08
# max 36.40000 37.250000 34.68000 36.150000 1.310855e+09
DataFrame提供了两种形式的排序。一种是按行列排序,即按照索引(行名)或者列名进行排序,可调用dataframe.sort_index,指定axis=0表示按索引(行名)排序,axis=1表示按列名排序,并可指定升序或者降序:
print(df.sort_index(axis=1, ascending=False).head())
# turnoverVol tradeDate secShortName secID openPrice lowestPrice highestPrice closePrice
# 0 286043643 2015-01-05 平安银行 000001.XSHE 15.99 15.60 16.28 16.02
# 1 216642140 2015-01-06 平安银行 000001.XSHE 15.85 15.55 16.39 15.78
# 2 170012067 2015-01-07 平安银行 000001.XSHE 15.56 15.30 15.83 15.48
# 3 140771421 2015-01-08 平安银行 000001.XSHE 15.50 14.90 15.57 14.96
# 4 250850023 2015-01-09 平安银行 000001.XSHE 14.90 14.71 15.87 15.08
第二种排序是按值排序,可指定列名和排序方式,默认的是升序排序:
print(df.sort(columns='tradeDate').head())
df = df.sort(columns=['tradeDate', 'secID'], ascending=[False, True])
print(df.head())
# secID tradeDate secShortName openPrice highestPrice lowestPrice closePrice turnoverVol
# 0 000001.XSHE 2015-01-05 平安银行 15.99 16.28 15.60 16.02 286043643
# 60 000625.XSHE 2015-01-05 长安汽车 16.40 18.07 16.32 18.07 82087982
# 140 601111.XSHG 2015-01-05 中国国航 7.98 8.62 7.98 8.62 231611758
# 20 000002.XSHE 2015-01-05 万科A 14.39 15.29 14.22 14.91 656083570
# 100 600028.XSHG 2015-01-05 中国石化 6.59 7.14 6.45 7.14 1186499645
# secID tradeDate secShortName openPrice highestPrice lowestPrice closePrice turnoverVol
# 19 000001.XSHE 2015-01-30 平安银行 13.93 14.12 13.76 13.93 93011669
# 39 000002.XSHE 2015-01-30 万科A 13.09 13.49 12.80 13.12 209624706
# 59 000568.XSHE 2015-01-30 泸州老窖 19.15 19.51 19.11 19.12 14177179
# 79 000625.XSHE 2015-01-30 长安汽车 19.16 19.45 18.92 19.18 21233495
# 99 000768.XSHE 2015-01-30 中航飞机 25.38 25.65 24.28 24.60 59550293
三、数据的访问和操作
3.1 再谈数据的访问
收盘价在均值以上的数据:
print(df[df.closePrice > df.closePrice.mean()].head())
# secID tradeDate secShortName openPrice highestPrice lowestPrice closePrice turnoverVol
# 59 000568.XSHE 2015-01-30 泸州老窖 19.15 19.51 19.11 19.12 14177179
# 79 000625.XSHE 2015-01-30 长安汽车 19.16 19.45 18.92 19.18 21233495
# 99 000768.XSHE 2015-01-30 中航飞机 25.38 25.65 24.28 24.60 59550293
# 139 600030.XSHG 2015-01-30 中信证券 28.50 28.72 27.78 27.86 304218245
# 58 000568.XSHE 2015-01-29 泸州老窖 19.04 19.23 19.00 19.15 12421826
isin()函数可方便地过滤DataFrame中的数据:
print(df[df['secID'].isin(['601628.XSHG', '000001.XSHE', '600030.XSHG'])].head())
# secID tradeDate secShortName openPrice highestPrice lowestPrice closePrice turnoverVol
# 19 000001.XSHE 2015-01-30 平安银行 13.93 14.12 13.76 13.93 93011669
# 139 600030.XSHG 2015-01-30 中信证券 28.50 28.72 27.78 27.86 304218245
# 18 000001.XSHE 2015-01-29 平安银行 13.82 14.01 13.75 13.90 101675329
# 138 600030.XSHG 2015-01-29 中信证券 28.10 28.58 27.81 28.18 386310957
# 17 000001.XSHE 2015-01-28 平安银行 13.87 14.30 13.80 14.06 124087755
3.2 处理缺失数据
在访问数据的基础上,我们可以更改数据,例如,修改某些元素为缺失值:
df['openPrice'][df['secID'] == '000001.XSHE'] = np.nan
df['highestPrice'][df['secID'] == '601111.XSHG'] = np.nan
df['lowestPrice'][df['secID'] == '601111.XSHG'] = np.nan
df['closePrice'][df['secID'] == '000002.XSHE'] = np.nan
df['turnoverVol'][df['secID'] == '601111.XSHG'] = np.nan
print(df.head(10))
# secID tradeDate secShortName openPrice highestPrice lowestPrice closePrice turnoverVol
# 19 000001.XSHE 2015-01-30 平安银行 NaN 14.12 13.76 13.93 93011669.0
# 39 000002.XSHE 2015-01-30 万科A 13.09 13.49 12.80 NaN 209624706.0
# 59 000568.XSHE 2015-01-30 泸州老窖 19.15 19.51 19.11 19.12 14177179.0
# 79 000625.XSHE 2015-01-30 长安汽车 19.16 19.45 18.92 19.18 21233495.0
# 99 000768.XSHE 2015-01-30 中航飞机 25.38 25.65 24.28 24.60 59550293.0
# 119 600028.XSHG 2015-01-30 中国石化 6.14 6.17 6.02 6.03 502445638.0
# 139 600030.XSHG 2015-01-30 中信证券 28.50 28.72 27.78 27.86 304218245.0
# 159 601111.XSHG 2015-01-30 中国国航 7.92 NaN NaN 7.69 NaN
# 179 601390.XSHG 2015-01-30 中国中铁 8.69 8.69 8.12 8.14 352357431.0
# 199 601998.XSHG 2015-01-30 中信银行 7.10 7.14 6.92 6.95 68146718.0
dataframe.dropna()可以按行丢弃带有nan的数据;若指定how=‘all’(默认是’any’),则只在整行全部是nan时丢弃数据;若指定thresh,则表示当某行数据非缺失列数超过指定数值时才保留;要指定根据某列丢弃可以通过subset完成。
print(df.shape)
print(df.dropna(thresh=6).shape)
print(df.dropna(thresh=6).head())
print(df.dropna(subset=['closePrice']).shape)
print(df.dropna(subset=['closePrice']).head())
# (200, 8)
# (180, 8)
# secID tradeDate secShortName openPrice highestPrice lowestPrice closePrice turnoverVol
# 19 000001.XSHE 2015-01-30 平安银行 NaN 14.12 13.76 13.93 93011669.0
# 39 000002.XSHE 2015-01-30 万科A 13.09 13.49 12.80 NaN 209624706.0
# 59 000568.XSHE 2015-01-30 泸州老窖 19.15 19.51 19.11 19.12 14177179.0
# 79 000625.XSHE 2015-01-30 长安汽车 19.16 19.45 18.92 19.18 21233495.0
# 99 000768.XSHE 2015-01-30 中航飞机 25.38 25.65 24.28 24.60 59550293.0
# (180, 8)
# secID tradeDate secShortName openPrice highestPrice lowestPrice closePrice turnoverVol
# 19 000001.XSHE 2015-01-30 平安银行 NaN 14.12 13.76 13.93 93011669.0
# 59 000568.XSHE 2015-01-30 泸州老窖 19.15 19.51 19.11 19.12 14177179.0
# 79 000625.XSHE 2015-01-30 长安汽车 19.16 19.45 18.92 19.18 21233495.0
# 99 000768.XSHE 2015-01-30 中航飞机 25.38 25.65 24.28 24.60 59550293.0
# 119 600028.XSHG 2015-01-30 中国石化 6.14 6.17 6.02 6.03 502445638.0
有数据缺失时也未必是全部丢弃,dataframe.fillna(value=value)可以指定填补缺失值的数值
3.3 数据操作
Series和DataFrame的类函数,如mean()、sum()等,指定0按列进行,指定1按行进行:
df = raw_data[['secID', 'tradeDate', 'secShortName', 'openPrice', 'highestPrice',
'lowestPrice', 'closePrice', 'turnoverVol']]
print(df.mean(0))
# openPrice 1.517095e+01
# highestPrice 1.563400e+01
# lowestPrice 1.486545e+01
# closePrice 1.524275e+01
# turnoverVol 2.384811e+08
# dtype: float64
value_counts函数可以方便地统计频数:
print(df['closePrice'].value_counts().head())
# 6.58 3
# 13.12 2
# 8.58 2
# 8.12 2
# 6.60 2
# Name: closePrice, dtype: int64
Series可以调用map函数来对每个元素应用一个函数,DataFrame可以调用apply函数对每一列(行)应用一个函数,applymap对每个元素应用一个函数。
将收盘价调整到[0, 1]区间:
print(df[['closePrice']].apply(lambda x: (x-x.min())/(x.max()-x.min())).head())
# closePrice
# 0 0.331673
# 1 0.323705
# 2 0.313745
# 3 0.296481
# 4 0.300465
使用append可以在Series后添加元素,以及在DataFrame尾部添加一行:
dat1 = df[['secID', 'tradeDate', 'closePrice']].head()
dat2 = df[['secID', 'tradeDate', 'closePrice']].iloc[2]
print(dat1)
dat = dat1.append(dat2, ignore_index=True)
print(dat) #dat1没有增加行
# secID tradeDate closePrice
# 0 000001.XSHE 2015-01-05 16.02
# 1 000001.XSHE 2015-01-06 15.78
# 2 000001.XSHE 2015-01-07 15.48
# 3 000001.XSHE 2015-01-08 14.96
# 4 000001.XSHE 2015-01-09 15.08
# secID tradeDate closePrice
# 0 000001.XSHE 2015-01-05 16.02
# 1 000001.XSHE 2015-01-06 15.78
# 2 000001.XSHE 2015-01-07 15.48
# 3 000001.XSHE 2015-01-08 14.96
# 4 000001.XSHE 2015-01-09 15.08
# 5 000001.XSHE 2015-01-07 15.48
DataFrame可以像在SQL中一样进行合并,concat函数就是一种合并的方式。另外一种方式使用merge函数,需要指定依照哪些列进行合并
dat1 = df[['secID', 'tradeDate', 'closePrice']]
dat2 = df[['secID', 'tradeDate', 'turnoverVol']]
dat = dat1.merge(dat2, on=['secID', 'tradeDate'])
print(dat1.head())
print(dat2.head())
print(dat.head())
# secID tradeDate closePrice
# 0 000001.XSHE 2015-01-05 16.02
# 1 000001.XSHE 2015-01-06 15.78
# 2 000001.XSHE 2015-01-07 15.48
# 3 000001.XSHE 2015-01-08 14.96
# 4 000001.XSHE 2015-01-09 15.08
# secID tradeDate turnoverVol
# 0 000001.XSHE 2015-01-05 286043643
# 1 000001.XSHE 2015-01-06 216642140
# 2 000001.XSHE 2015-01-07 170012067
# 3 000001.XSHE 2015-01-08 140771421
# 4 000001.XSHE 2015-01-09 250850023
# secID tradeDate closePrice turnoverVol
# 0 000001.XSHE 2015-01-05 16.02 286043643
# 1 000001.XSHE 2015-01-06 15.78 216642140
# 2 000001.XSHE 2015-01-07 15.48 170012067
# 3 000001.XSHE 2015-01-08 14.96 140771421
# 4 000001.XSHE 2015-01-09 15.08 250850023
DataFrame另一个强大的函数是groupby,对2015年一月内十支股票的开盘价,最高价,最低价,收盘价和成交量求平均值:
df_grp = df.groupby('secID')
grp_mean = df_grp.mean()
print(grp_mean)
# openPrice highestPrice lowestPrice closePrice turnoverVol
# secID
# 000001.XSHE 14.6550 14.9840 14.4330 14.6650 154710615
# 000002.XSHE 13.3815 13.7530 13.0575 13.4100 277459431
# 000568.XSHE 19.7220 20.1015 19.4990 19.7935 29199107
# 000625.XSHE 19.4915 20.2275 19.1040 19.7170 42633332
# 000768.XSHE 22.4345 23.4625 21.8830 22.6905 92781199
# 600028.XSHG 6.6060 6.7885 6.4715 6.6240 531966632
# 600030.XSHG 31.1505 32.0825 30.4950 31.2325 611544509
# 601111.XSHG 8.4320 8.6520 8.2330 8.4505 104143358
# 601390.XSHG 8.4060 8.6625 8.2005 8.4100 362831455
# 601998.XSHG 7.4305 7.6260 7.2780 7.4345 177541066
如果希望取每只股票的最新数据,应该怎么操作呢?drop_duplicates可以实现这个功能,首先对数据按日期排序,再按security ID去重:
df2 = df.sort(columns=['secID', 'tradeDate'], ascending=[True, False])
print(df2.drop_duplicates(subset='secID'))
# secID tradeDate secShortName openPrice highestPrice lowestPrice closePrice turnoverVol
# 19 000001.XSHE 2015-01-30 平安银行 13.93 14.12 13.76 13.93 93011669
# 39 000002.XSHE 2015-01-30 万科A 13.09 13.49 12.80 13.12 209624706
# 59 000568.XSHE 2015-01-30 泸州老窖 19.15 19.51 19.11 19.12 14177179
# 79 000625.XSHE 2015-01-30 长安汽车 19.16 19.45 18.92 19.18 21233495
# 99 000768.XSHE 2015-01-30 中航飞机 25.38 25.65 24.28 24.60 59550293
# 119 600028.XSHG 2015-01-30 中国石化 6.14 6.17 6.02 6.03 502445638
# 139 600030.XSHG 2015-01-30 中信证券 28.50 28.72 27.78 27.86 304218245
# 159 601111.XSHG 2015-01-30 中国国航 7.92 8.03 7.65 7.69 61877792
# 179 601390.XSHG 2015-01-30 中国中铁 8.69 8.69 8.12 8.14 352357431
# 199 601998.XSHG 2015-01-30 中信银行 7.10 7.14 6.92 6.95 68146718
若想要保留最老的数据,可以在降序排列后取最后一个记录,通过指定take_last=True(默认值为False,取第一条记录)
print(df2.drop_duplicates(subset='secID', take_last=True))
四、数据可视化
pandas数据直接可以绘图查看,其中set_index(‘tradeDate’)[‘closePrice’]表示将DataFrame的’tradeDate’这一列作为索引,将’closePrice’这一列作为Series的值,返回一个Series对象,随后调用plot函数绘图,更多的参数可以在matplotlib的文档中查看。
dat = df[df['secID'] == '600028.XSHG'].set_index('tradeDate')['closePrice']
dat.plot(title="Close Price of SINOPEC (600028) during Jan, 2015")
参考文献