第6天:数据处理的瑞士军刀pandas下篇

本篇将介绍如何对pandas数据进行操作

import numpy as np
import pandas as pd
from pandas import Series, DataFrame

设置一下输出屏幕的宽度

pd.set_option('display.width', 200)

一、数据创建的其他方式

数据结构的创建不止是上篇中介绍的标准形式,本篇再介绍几种。例如,我们可以创建一个以日期为元素的Series:

dates = pd.date_range('20150101', periods=5)
print(dates)
# DatetimeIndex(['2015-01-01', '2015-01-02', '2015-01-03', 
'2015-01-04', '2015-01-05'], dtype='datetime64[ns]', freq='D')

将这个日期Series作为索引赋给一个DataFrame:

 

df=pd.DataFrame(np.random.randn(5, 4),index=dates,columns=list('ABCD'))
print(df)
#                  A       B       C       D
# 2015-01-01 -0.3307 -1.1993  2.1873  0.9368
# 2015-01-02 -0.1020 -1.0258  1.4000 -0.4759
# 2015-01-03  0.9739  1.9301 -0.9519 -0.7754
# 2015-01-04  1.1615  0.3040  0.3494 -1.1886
# 2015-01-05  0.9669  0.0103 -0.3264 -0.8607

只要是能转换成Series的对象,都可以用于创建DataFrame:

df2 = pd.DataFrame({ 'A': 1.,
                    'B': pd.Timestamp('20150214'),
                    'C': pd.Series(1.6,index=list(range(4)),dtype='float64'), 
                    'D': np.array([4] * 4, dtype='int64'),
                    'E': 'hello pandas!'})
print(df2)
#      A          B    C  D              E
# 0  1.0 2015-02-14  1.6  4  hello pandas!
# 1  1.0 2015-02-14  1.6  4  hello pandas!
# 2  1.0 2015-02-14  1.6  4  hello pandas!
# 3  1.0 2015-02-14  1.6  4  hello pandas!

二、数据的查看

获取2015年一月份全部的交易日内十支股票的日行情信息

stock_list = ['000001.XSHE', '000002.XSHE', '000568.XSHE', '000625.XSHE', '000768.XSHE', 
              '600028.XSHG', '600030.XSHG', '601111.XSHG', '601390.XSHG', '601998.XSHG']
raw_data = DataAPI.MktEqudGet(secID=stock_list, beginDate='20150101', endDate='20150131', pandas='1')
df = raw_data[['secID', 'tradeDate', 'secShortName', 'openPrice', 'highestPrice', 
'lowestPrice', 'closePrice', 'turnoverVol']]
print(df.shape)
# (200, 8)
print(df.head(3))
#          secID   tradeDate secShortName  openPrice  highestPrice  lowestPrice  closePrice  turnoverVol
# 0  000001.XSHE  2015-01-05         平安银行      15.99         16.28        15.60       16.02    286043643
# 1  000001.XSHE  2015-01-06         平安银行      15.85         16.39        15.55       15.78    216642140
# 2  000001.XSHE  2015-01-07         平安银行      15.56         15.83        15.30       15.48    170012067

dataframe.describe()提供了DataFrame中纯数值数据的统计信息:

print(df.describe())
#        openPrice  highestPrice  lowestPrice  closePrice   turnoverVol
# count  200.00000    200.000000    200.00000  200.000000  2.000000e+02
# mean    15.17095     15.634000     14.86545   15.242750  2.384811e+08
# std      7.72807      7.997345      7.56136    7.772184  2.330510e+08
# min      6.14000      6.170000      6.02000    6.030000  1.242183e+07
# 25%      8.09500      8.250000      7.98750    8.127500  7.357002e+07
# 50%     13.96000     14.335000     13.75500   13.925000  1.554569e+08
# 75%     19.95000     20.500000     19.46250   20.012500  3.358617e+08
# max     36.40000     37.250000     34.68000   36.150000  1.310855e+09

DataFrame提供了两种形式的排序。一种是按行列排序,即按照索引(行名)或者列名进行排序,可调用dataframe.sort_index,指定axis=0表示按索引(行名)排序,axis=1表示按列名排序,并可指定升序或者降序:

print(df.sort_index(axis=1, ascending=False).head())
#    turnoverVol   tradeDate secShortName        secID  openPrice  lowestPrice  highestPrice  closePrice
# 0    286043643  2015-01-05         平安银行  000001.XSHE      15.99        15.60         16.28       16.02
# 1    216642140  2015-01-06         平安银行  000001.XSHE      15.85        15.55         16.39       15.78
# 2    170012067  2015-01-07         平安银行  000001.XSHE      15.56        15.30         15.83       15.48
# 3    140771421  2015-01-08         平安银行  000001.XSHE      15.50        14.90         15.57       14.96
# 4    250850023  2015-01-09         平安银行  000001.XSHE      14.90        14.71         15.87       15.08

第二种排序是按值排序,可指定列名和排序方式,默认的是升序排序:

print(df.sort(columns='tradeDate').head())
df = df.sort(columns=['tradeDate', 'secID'], ascending=[False, True])
print(df.head())
#            secID   tradeDate secShortName  openPrice  highestPrice  lowestPrice  closePrice  turnoverVol
# 0    000001.XSHE  2015-01-05         平安银行      15.99         16.28        15.60       16.02    286043643
# 60   000625.XSHE  2015-01-05         长安汽车      16.40         18.07        16.32       18.07     82087982
# 140  601111.XSHG  2015-01-05         中国国航       7.98          8.62         7.98        8.62    231611758
# 20   000002.XSHE  2015-01-05          万科A      14.39         15.29        14.22       14.91    656083570
# 100  600028.XSHG  2015-01-05         中国石化       6.59          7.14         6.45        7.14   1186499645
#           secID   tradeDate secShortName  openPrice  highestPrice  lowestPrice  closePrice  turnoverVol
# 19  000001.XSHE  2015-01-30         平安银行      13.93         14.12        13.76       13.93     93011669
# 39  000002.XSHE  2015-01-30          万科A      13.09         13.49        12.80       13.12    209624706
# 59  000568.XSHE  2015-01-30         泸州老窖      19.15         19.51        19.11       19.12     14177179
# 79  000625.XSHE  2015-01-30         长安汽车      19.16         19.45        18.92       19.18     21233495
# 99  000768.XSHE  2015-01-30         中航飞机      25.38         25.65        24.28       24.60     59550293

三、数据的访问和操作

3.1 再谈数据的访问

收盘价在均值以上的数据:

print(df[df.closePrice > df.closePrice.mean()].head())
#            secID   tradeDate secShortName  openPrice  highestPrice  lowestPrice  closePrice  turnoverVol
# 59   000568.XSHE  2015-01-30         泸州老窖      19.15         19.51        19.11       19.12     14177179
# 79   000625.XSHE  2015-01-30         长安汽车      19.16         19.45        18.92       19.18     21233495
# 99   000768.XSHE  2015-01-30         中航飞机      25.38         25.65        24.28       24.60     59550293
# 139  600030.XSHG  2015-01-30         中信证券      28.50         28.72        27.78       27.86    304218245
# 58   000568.XSHE  2015-01-29         泸州老窖      19.04         19.23        19.00       19.15     12421826

isin()函数可方便地过滤DataFrame中的数据:

print(df[df['secID'].isin(['601628.XSHG', '000001.XSHE', '600030.XSHG'])].head())
#            secID   tradeDate secShortName  openPrice  highestPrice  lowestPrice  closePrice  turnoverVol
# 19   000001.XSHE  2015-01-30         平安银行      13.93         14.12        13.76       13.93     93011669
# 139  600030.XSHG  2015-01-30         中信证券      28.50         28.72        27.78       27.86    304218245
# 18   000001.XSHE  2015-01-29         平安银行      13.82         14.01        13.75       13.90    101675329
# 138  600030.XSHG  2015-01-29         中信证券      28.10         28.58        27.81       28.18    386310957
# 17   000001.XSHE  2015-01-28         平安银行      13.87         14.30        13.80       14.06    124087755

3.2 处理缺失数据

在访问数据的基础上,我们可以更改数据,例如,修改某些元素为缺失值:

df['openPrice'][df['secID'] == '000001.XSHE'] = np.nan
df['highestPrice'][df['secID'] == '601111.XSHG'] = np.nan
df['lowestPrice'][df['secID'] == '601111.XSHG'] = np.nan
df['closePrice'][df['secID'] == '000002.XSHE'] = np.nan
df['turnoverVol'][df['secID'] == '601111.XSHG'] = np.nan
print(df.head(10))
#            secID   tradeDate secShortName  openPrice  highestPrice  lowestPrice  closePrice  turnoverVol
# 19   000001.XSHE  2015-01-30         平安银行        NaN         14.12        13.76       13.93   93011669.0
# 39   000002.XSHE  2015-01-30          万科A      13.09         13.49        12.80         NaN  209624706.0
# 59   000568.XSHE  2015-01-30         泸州老窖      19.15         19.51        19.11       19.12   14177179.0
# 79   000625.XSHE  2015-01-30         长安汽车      19.16         19.45        18.92       19.18   21233495.0
# 99   000768.XSHE  2015-01-30         中航飞机      25.38         25.65        24.28       24.60   59550293.0
# 119  600028.XSHG  2015-01-30         中国石化       6.14          6.17         6.02        6.03  502445638.0
# 139  600030.XSHG  2015-01-30         中信证券      28.50         28.72        27.78       27.86  304218245.0
# 159  601111.XSHG  2015-01-30         中国国航       7.92           NaN          NaN        7.69          NaN
# 179  601390.XSHG  2015-01-30         中国中铁       8.69          8.69         8.12        8.14  352357431.0
# 199  601998.XSHG  2015-01-30         中信银行       7.10          7.14         6.92        6.95   68146718.0

dataframe.dropna()可以按行丢弃带有nan的数据;若指定how=‘all’(默认是’any’),则只在整行全部是nan时丢弃数据;若指定thresh,则表示当某行数据非缺失列数超过指定数值时才保留;要指定根据某列丢弃可以通过subset完成。

print(df.shape)
print(df.dropna(thresh=6).shape)
print(df.dropna(thresh=6).head())
print(df.dropna(subset=['closePrice']).shape)
print(df.dropna(subset=['closePrice']).head())
# (200, 8)
# (180, 8)
#           secID   tradeDate secShortName  openPrice  highestPrice  lowestPrice  closePrice  turnoverVol
# 19  000001.XSHE  2015-01-30         平安银行        NaN         14.12        13.76       13.93   93011669.0
# 39  000002.XSHE  2015-01-30          万科A      13.09         13.49        12.80         NaN  209624706.0
# 59  000568.XSHE  2015-01-30         泸州老窖      19.15         19.51        19.11       19.12   14177179.0
# 79  000625.XSHE  2015-01-30         长安汽车      19.16         19.45        18.92       19.18   21233495.0
# 99  000768.XSHE  2015-01-30         中航飞机      25.38         25.65        24.28       24.60   59550293.0
# (180, 8)
#            secID   tradeDate secShortName  openPrice  highestPrice  lowestPrice  closePrice  turnoverVol
# 19   000001.XSHE  2015-01-30         平安银行        NaN         14.12        13.76       13.93   93011669.0
# 59   000568.XSHE  2015-01-30         泸州老窖      19.15         19.51        19.11       19.12   14177179.0
# 79   000625.XSHE  2015-01-30         长安汽车      19.16         19.45        18.92       19.18   21233495.0
# 99   000768.XSHE  2015-01-30         中航飞机      25.38         25.65        24.28       24.60   59550293.0
# 119  600028.XSHG  2015-01-30         中国石化       6.14          6.17         6.02        6.03  502445638.0

有数据缺失时也未必是全部丢弃,dataframe.fillna(value=value)可以指定填补缺失值的数值

3.3 数据操作

Series和DataFrame的类函数,如mean()、sum()等,指定0按列进行,指定1按行进行:

df = raw_data[['secID', 'tradeDate', 'secShortName', 'openPrice', 'highestPrice', 
               'lowestPrice', 'closePrice', 'turnoverVol']]
print(df.mean(0))
# openPrice       1.517095e+01
# highestPrice    1.563400e+01
# lowestPrice     1.486545e+01
# closePrice      1.524275e+01
# turnoverVol     2.384811e+08
# dtype: float64

value_counts函数可以方便地统计频数:

print(df['closePrice'].value_counts().head())
# 6.58     3
# 13.12    2
# 8.58     2
# 8.12     2
# 6.60     2
# Name: closePrice, dtype: int64

Series可以调用map函数来对每个元素应用一个函数,DataFrame可以调用apply函数对每一列(行)应用一个函数,applymap对每个元素应用一个函数。

将收盘价调整到[0, 1]区间:

print(df[['closePrice']].apply(lambda x: (x-x.min())/(x.max()-x.min())).head())
#    closePrice
# 0    0.331673
# 1    0.323705
# 2    0.313745
# 3    0.296481
# 4    0.300465

使用append可以在Series后添加元素,以及在DataFrame尾部添加一行:

dat1 = df[['secID', 'tradeDate', 'closePrice']].head()
dat2 = df[['secID', 'tradeDate', 'closePrice']].iloc[2]
print(dat1)
dat = dat1.append(dat2, ignore_index=True)
print(dat)   #dat1没有增加行
#          secID   tradeDate  closePrice
# 0  000001.XSHE  2015-01-05       16.02
# 1  000001.XSHE  2015-01-06       15.78
# 2  000001.XSHE  2015-01-07       15.48
# 3  000001.XSHE  2015-01-08       14.96
# 4  000001.XSHE  2015-01-09       15.08
#          secID   tradeDate  closePrice
# 0  000001.XSHE  2015-01-05       16.02
# 1  000001.XSHE  2015-01-06       15.78
# 2  000001.XSHE  2015-01-07       15.48
# 3  000001.XSHE  2015-01-08       14.96
# 4  000001.XSHE  2015-01-09       15.08
# 5  000001.XSHE  2015-01-07       15.48

DataFrame可以像在SQL中一样进行合并,concat函数就是一种合并的方式。另外一种方式使用merge函数,需要指定依照哪些列进行合并

dat1 = df[['secID', 'tradeDate', 'closePrice']]
dat2 = df[['secID', 'tradeDate', 'turnoverVol']]
dat = dat1.merge(dat2, on=['secID', 'tradeDate'])
print(dat1.head())
print(dat2.head())
print(dat.head())
#          secID   tradeDate  closePrice
# 0  000001.XSHE  2015-01-05       16.02
# 1  000001.XSHE  2015-01-06       15.78
# 2  000001.XSHE  2015-01-07       15.48
# 3  000001.XSHE  2015-01-08       14.96
# 4  000001.XSHE  2015-01-09       15.08
#          secID   tradeDate  turnoverVol
# 0  000001.XSHE  2015-01-05    286043643
# 1  000001.XSHE  2015-01-06    216642140
# 2  000001.XSHE  2015-01-07    170012067
# 3  000001.XSHE  2015-01-08    140771421
# 4  000001.XSHE  2015-01-09    250850023
#          secID   tradeDate  closePrice  turnoverVol
# 0  000001.XSHE  2015-01-05       16.02    286043643
# 1  000001.XSHE  2015-01-06       15.78    216642140
# 2  000001.XSHE  2015-01-07       15.48    170012067
# 3  000001.XSHE  2015-01-08       14.96    140771421
# 4  000001.XSHE  2015-01-09       15.08    250850023

DataFrame另一个强大的函数是groupby,对2015年一月内十支股票的开盘价,最高价,最低价,收盘价和成交量求平均值:

df_grp = df.groupby('secID')
grp_mean = df_grp.mean()
print(grp_mean)
#              openPrice  highestPrice  lowestPrice  closePrice  turnoverVol
# secID                                                                     
# 000001.XSHE    14.6550       14.9840      14.4330     14.6650    154710615
# 000002.XSHE    13.3815       13.7530      13.0575     13.4100    277459431
# 000568.XSHE    19.7220       20.1015      19.4990     19.7935     29199107
# 000625.XSHE    19.4915       20.2275      19.1040     19.7170     42633332
# 000768.XSHE    22.4345       23.4625      21.8830     22.6905     92781199
# 600028.XSHG     6.6060        6.7885       6.4715      6.6240    531966632
# 600030.XSHG    31.1505       32.0825      30.4950     31.2325    611544509
# 601111.XSHG     8.4320        8.6520       8.2330      8.4505    104143358
# 601390.XSHG     8.4060        8.6625       8.2005      8.4100    362831455
# 601998.XSHG     7.4305        7.6260       7.2780      7.4345    177541066

如果希望取每只股票的最新数据,应该怎么操作呢?drop_duplicates可以实现这个功能,首先对数据按日期排序,再按security ID去重:

df2 = df.sort(columns=['secID', 'tradeDate'], ascending=[True, False])
print(df2.drop_duplicates(subset='secID'))
#            secID   tradeDate secShortName  openPrice  highestPrice  lowestPrice  closePrice  turnoverVol
# 19   000001.XSHE  2015-01-30         平安银行      13.93         14.12        13.76       13.93     93011669
# 39   000002.XSHE  2015-01-30          万科A      13.09         13.49        12.80       13.12    209624706
# 59   000568.XSHE  2015-01-30         泸州老窖      19.15         19.51        19.11       19.12     14177179
# 79   000625.XSHE  2015-01-30         长安汽车      19.16         19.45        18.92       19.18     21233495
# 99   000768.XSHE  2015-01-30         中航飞机      25.38         25.65        24.28       24.60     59550293
# 119  600028.XSHG  2015-01-30         中国石化       6.14          6.17         6.02        6.03    502445638
# 139  600030.XSHG  2015-01-30         中信证券      28.50         28.72        27.78       27.86    304218245
# 159  601111.XSHG  2015-01-30         中国国航       7.92          8.03         7.65        7.69     61877792
# 179  601390.XSHG  2015-01-30         中国中铁       8.69          8.69         8.12        8.14    352357431
# 199  601998.XSHG  2015-01-30         中信银行       7.10          7.14         6.92        6.95     68146718

若想要保留最老的数据,可以在降序排列后取最后一个记录,通过指定take_last=True(默认值为False,取第一条记录)

print(df2.drop_duplicates(subset='secID', take_last=True))

四、数据可视化

pandas数据直接可以绘图查看,其中set_index(‘tradeDate’)[‘closePrice’]表示将DataFrame的’tradeDate’这一列作为索引,将’closePrice’这一列作为Series的值,返回一个Series对象,随后调用plot函数绘图,更多的参数可以在matplotlib的文档中查看。

dat = df[df['secID'] == '600028.XSHG'].set_index('tradeDate')['closePrice']
dat.plot(title="Close Price of SINOPEC (600028) during Jan, 2015")

 

参考文献

  1. http://pandas.pydata.org/pandas-docs/version/0.14.1
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值