pandas DatetimeIndex indexing

最新推荐文章于 2022-09-29 15:42:02 发布

twt9628

最新推荐文章于 2022-09-29 15:42:02 发布

阅读量1.4k

点赞数

分类专栏： Pandas 文章标签： pandas python

本文链接：https://blog.csdn.net/twt9628/article/details/86658619

版权

Pandas 专栏收录该内容

2 篇文章 0 订阅

订阅专栏

文章目录

freq参数对应的缩写：Offset aliases
DatetimeIndex 与 Timestamp 的含义
DatetimeIndex 中partial slicing与resolution的问题
index为DatetimeIndex的DF/Seires，同时对index和column等加以判断并组合boolean array 的方法

freq参数对应的缩写：Offset aliases

实际上freq就有点类似于下面DatetimeIndex的resolution，表示Timestamp精确到月（M），天(D)，小时(H)，分钟（T），秒（S）等等
参考：Offset aliases

缩写	对应的含义
B	business day frequency
C	custom business day frequency 可自行定制的工作日（即挑出某几天）
D	calendar day frequency 日历天，即按顺序排下来的
W	weekly frequency 周
M	month end frequency 月份！！注意别和分钟弄混了
SM	semi-month end frequency (15th and end of month)
BM	business month end frequency
CBM	custom business month end frequency
MS	month start frequency
SMS	semi-month start frequency (1st and 15th)
BMS	business month start frequency
CBMS	custom business month start frequency
Q	quarter end frequency
BQ	business quarter end frequency
QS	quarter start frequency
BQS	business quarter start frequency
A, Y	year end frequency
BA, BY	business year end frequency
AS, YS	year start frequency
BAS, BYS	business year start frequency
BH	business hour frequency
H	hourly frequency 小时！！
T, min	minutely frequency 分钟！！别和月份M弄混了
S	secondly frequency 秒！！
L,	ms milliseconds
U,	us microseconds
N	nanoseconds

DatetimeIndex 与 Timestamp 的含义

在这里插入图片描述
DatetimeIndex 实际是对Timestamp的索引，Timestamp虽然都是有有hour, minute, second，但是一旦组合起来，变成了Timestamp，那就不一定了。

参考 exact-indexing

As discussed in previous section, indexing a DatetimeIndex with a partial string depends on the “accuracy” of the period, in other words how specific the interval is in relation to the resolution of the index. In contrast, indexing with Timestamp or datetime objects is exact, because the objects have exact meaning. These also follow the semantics of including both endpoints.

These Timestamp and datetime objects have exact hours, minutes, and seconds

意思是，若Dataframe或者Series的index是DatetimeIndex，则partial string selection可以做到的的最小分辨率（resolution）即最小的区间取决于这个DatetimeIndex的分辨率。具体的resolution可通过

DatetimeIndex.resolution

进行查看。

相反，Timestamp or datetime objects都一定是精确到了second的，即即使是没有指定的情况，下，他们也是包含了精确的hours、minutes 和 seconds

以Timestamp or datetime objects作为index的Dataframe或者Series，是一定能够通过形如’%Y-%m-%d %h:%m:%s’的index进行exact indexing。而DatetimeIndex则没办法，超过了原有的resolution则会出现下一小结中的错误: DatetimeIndex 中partial slicing与resolution的问题

DatetimeIndex 中partial slicing与resolution的问题

参考 slice-vs-exact-match

Warning However, if the string is treated as an exact match, the selection in DataFrame’s [] will be column-wise and not row-wise, see Indexing Basics. For example dft_minute[‘2011-12-31 23:59’] will raise KeyError as ‘2012-12-31 23:59’ has the same resolution as the index and there is no column with such name:
To always have unambiguous selection, whether the row is treated as a slice or a single selection, use .loc.

即DF的[]操作，当string的resolution和DatetimeIndex的resolution相同时，会判断为exact match，因此会优先会判断为column-wise的，i.e.，优先匹配columns names。但是column里面没有这样的name，因此会报错 Keyerror

可以通过loc来避免这个问题。

也可以通过用resolution更小的string如’2011-12-31 23’ 来避免

Example:

In[19]: corr_matrix_hour.head()
Out[19]: 
co2                      pm10     pm1d0     pm2d5
datetime                                         
2017-09-06 02:00:00  0.524205  0.515984  0.535088
2017-09-06 03:00:00  0.271715  0.371295  0.287175
2017-09-06 04:00:00 -0.006424 -0.051545 -0.040707
2017-09-06 05:00:00 -0.146557 -0.329352 -0.205284
2017-09-06 06:00:00  0.190316  0.130562  0.206690
In[20]: corr_matrix_hour.index.resolution
Out[20]: 'hour'

可以看到是精确到hour

直接用相同或者更精确的resolution会报错

In[20]: corr_matrix_hour.index.resolution
Out[20]: 'hour'
In[21]: corr_matrix_hour['2017-09-06 02']
Traceback (most recent call last):
  File "C:\Users\walter\AppData\Local\Programs\Python\Python36\lib\site-packages\pandas\core\indexes\base.py", line 3078, in get_loc
    return self._engine.get_loc(key)
  File "pandas\_libs\index.pyx", line 140, in pandas._libs.index.IndexEngine.get_loc
  File "pandas\_libs\index.pyx", line 162, in pandas._libs.index.IndexEngine.get_loc
  File "pandas\_libs\hashtable_class_helper.pxi", line 1492, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas\_libs\hashtable_class_helper.pxi", line 1500, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: '2017-09-06 02'

但是通过loc则可以解决。或者用resolution更小的

In[23]: corr_matrix_hour.loc['2017-09-06 02:00:00']
Out[23]: 
co2
pm10     0.524205
pm1d0    0.515984
pm2d5    0.535088
Name: 2017-09-06 02:00:00, dtype: float64

In[24]: corr_matrix_hour.loc['2017-09-06']
Out[24]: 
co2                      pm10     pm1d0     pm2d5
datetime                                         
2017-09-06 02:00:00  0.524205  0.515984  0.535088
2017-09-06 03:00:00  0.271715  0.371295  0.287175
2017-09-06 04:00:00 -0.006424 -0.051545 -0.040707
2017-09-06 05:00:00 -0.146557 -0.329352 -0.205284
2017-09-06 06:00:00  0.190316  0.130562  0.206690
2017-09-06 07:00:00  0.320764  0.433057  0.316239
2017-09-06 08:00:00  0.478715  0.427422  0.448940
2017-09-06 09:00:00  0.104344  0.176117  0.172312
2017-09-06 10:00:00  0.681273  0.747524  0.695242
2017-09-06 11:00:00  0.578912  0.695337  0.597895
2017-09-06 12:00:00  0.268466  0.306412  0.274614
2017-09-06 13:00:00  0.469630  0.463735  0.509890
2017-09-06 14:00:00  0.432735  0.441847  0.397852
2017-09-06 15:00:00  0.357006  0.368656  0.457821
2017-09-06 16:00:00  0.356666  0.288574  0.367849
2017-09-06 17:00:00  0.200700  0.146319  0.142676
2017-09-06 18:00:00  0.137906  0.227373  0.203481
2017-09-06 19:00:00  0.092497  0.201569  0.096877
2017-09-06 20:00:00 -0.141685 -0.094120 -0.066167
2017-09-06 21:00:00  0.188640  0.183061  0.201465
2017-09-06 22:00:00  0.040039  0.023205  0.051618
2017-09-06 23:00:00 -0.042926 -0.000429 -0.047177

index为DatetimeIndex的DF/Seires，同时对index和column等加以判断并组合boolean array 的方法

主要是利用index.isin()以及pd.date_range()

其中特别要注意的是，pd.date_range(start=None, end=None, periods=None, freq=None)，中的这四个参数，必须且只能指定其中的三个，空出一个，从而产生出对应DatetimeIndex。

Example:

In[68]: co2.head(10)
Out[68]: 
                     co2  pm10  pm1d0  pm2d5 Location
datetime                                             
2017-09-06 02:19:58  594    24     16     24  C215012
2017-09-06 02:19:58  579    24     14     24  C215019
2017-09-06 02:19:58  615    25     18     23  C215013
2017-09-06 02:19:58  626    25     16     25  C215016
2017-09-06 02:19:58  581    27     17     26  C215017
2017-09-06 02:19:58  564    26     16     23  C215014
2017-09-06 02:19:58  601    25     17     24  C215018
2017-09-06 02:19:58  752    35     19     29  C215011
2017-09-06 02:19:59  615    25     18     23  C215013
2017-09-06 02:19:59  626    26     17     26  C215016

In[69]: pd.date_range('2018-01-14 03:00', '2018-01-14 04:00', freq='S') 
#2018-01-04 3am到4am，抽出其中的每一秒组成一个DatetimeIndex
Out[69]: 
DatetimeIndex(['2018-01-14 03:00:00', '2018-01-14 03:00:01',
               '2018-01-14 03:00:02', '2018-01-14 03:00:03',
               '2018-01-14 03:00:04', '2018-01-14 03:00:05',
               '2018-01-14 03:00:06', '2018-01-14 03:00:07',
               '2018-01-14 03:00:08', '2018-01-14 03:00:09',
               ...
               '2018-01-14 03:59:51', '2018-01-14 03:59:52',
               '2018-01-14 03:59:53', '2018-01-14 03:59:54',
               '2018-01-14 03:59:55', '2018-01-14 03:59:56',
               '2018-01-14 03:59:57', '2018-01-14 03:59:58',
               '2018-01-14 03:59:59', '2018-01-14 04:00:00'],
              dtype='datetime64[ns]', length=3601, freq='S')

以下代码就是Location列为’C215012’，时间在2018-01-14 3AM到4AM之间的数据抽取出来

In[70]: datetime_boolean = co2.index.isin(pd.date_range('2018-01-14 03:00', '2018-01-14 04:00', freq='S'))
   ...: co2[datetime_boolean & (co2['Location'] == 'C215012')]
   ...: 


Out[70]: 
                     co2  pm10  pm1d0  pm2d5 Location
datetime                                             
2018-01-14 03:40:56  505    22     13     20  C215012
2018-01-14 03:40:58  505    22     14     21  C215012
2018-01-14 03:40:59  505    22     14     21  C215012
2018-01-14 03:41:00  505    22     14     21  C215012
2018-01-14 03:41:01  505    21     14     21  C215012
2018-01-14 03:41:02  505    21     14     21  C215012
2018-01-14 03:41:03  505    21     14     21  C215012
2018-01-14 03:41:04  505    21     13     20  C215012
2018-01-14 03:41:05  505    21     13     20  C215012
2018-01-14 03:41:06  505    21     13     20  C215012
2018-01-14 03:41:07  505    20     14     20  C215012
2018-01-14 03:41:08  505    20     14     20  C215012