Python 数据处理 —— pandas HDF5 读写查询

本文围绕Python的Pandas库,介绍了层级数据格式的使用。包括读写API、fixed和table格式特点,层级键存储方式,还阐述了存储混合类型、多索引DataFrame的数据类型。重点讲解了查询操作,如查询表、多级索引、通过数据列查询等,以及高级查询和多表查询方法。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

前言

HDF(Hierarchical Data Format, 层级数据格式),是设计用来存储和组织大量数据的一组文件格式(HDF4HDF5

HDF5 允许您存储大量的数值数据,同时能够轻松、快速地访问数据。数千个数据集可以存储在一个文件中,可以根据需要进行分类和标记

使用

HDFStore 是一个类似 dict 的对象,它使用 PyTables 库并以高性能的 HDF5 格式来读写 pandas 对象。

In [345]: store = pd.HDFStore("store.h5")

In [346]: print(store)
<class 'pandas.io.pytables.HDFStore'>
File path: store.h5

可以将对象写入文件,就像将键值对添加到字典一样

In [347]: index = pd.date_range("1/1/2000", periods=8)

In [348]: s = pd.Series(np.random.randn(5), index=["a", "b", "c", "d", "e"])

In [349]: df = pd.DataFrame(np.random.randn(8, 3), index=index, columns=["A", "B", "C"])

# store.put('s', s) is an equivalent method
In [350]: store["s"] = s

In [351]: store["df"] = df

In [352]: store
Out[352]: 
<class 'pandas.io.pytables.HDFStore'>
File path: store.h5

在当前或以后的 Python 会话中,您可以检索存储的对象

# store.get('df') is an equivalent method
In [353]: store["df"]
Out[353]: 
                   A         B         C
2000-01-01  1.334065  0.521036  0.930384
2000-01-02 -1.613932  1.088104 -0.632963
2000-01-03 -0.585314 -0.275038 -0.937512
2000-01-04  0.632369 -1.249657  0.975593
2000-01-05  1.060617 -0.143682  0.218423
2000-01-06  3.050329  1.317933 -0.963725
2000-01-07 -0.539452 -0.771133  0.023751
2000-01-08  0.649464 -1.736427  0.197288

# dotted (attribute) access provides get as well
In [354]: store.df
Out[354]: 
                   A         B         C
2000-01-01  1.334065  0.521036  0.930384
2000-01-02 -1.613932  1.088104 -0.632963
2000-01-03 -0.585314 -0.275038 -0.937512
2000-01-04  0.632369 -1.249657  0.975593
2000-01-05  1.060617 -0.143682  0.218423
2000-01-06  3.050329  1.317933 -0.963725
2000-01-07 -0.539452 -0.771133  0.023751
2000-01-08  0.649464 -1.736427  0.197288

删除键对应的对象:

# store.remove('df') is an equivalent method
In [355]: del store["df"]

In [356]: store
Out[356]: 
<class 'pandas.io.pytables.HDFStore'>
File path: store.h5

关闭 store 并使用上下文管理器

In [357]: store.close()

In [358]: store
Out[358]: 
<class 'pandas.io.pytables.HDFStore'>
File path: store.h5

In [359]: store.is_open
Out[359]: False

# Working with, and automatically closing the store using a context manager
In [360]: with pd.HDFStore("store.h5") as store:
   .....:     store.keys()
   .....: 

1 读写 API

HDFStore 支持使用 read_hdf 进行读取和 to_hdf 进行写入的顶级 API,其工作方式类似于 read_csvto_csv

In [361]: df_tl = pd.DataFrame({"A": list(range(5)), "B": list(range(5))})

In [362]: df_tl.to_hdf("store_tl.h5", "table", append=True)

In [363]: pd.read_hdf("store_tl.h5", "table", where=["index>2"])
Out[363]: 
   A  B
3  3  3
4  4  4

HDFStore 默认情况下不会删除空数据行,可以通过设置 dropna=True 来更改此行为

In [364]: df_with_missing = pd.DataFrame(
   .....:     {
   .....:         "col1": [0, np.nan, 2],
   .....:         "col2": [1, np.nan, np.nan],
   .....:     }
   .....: )
   .....: 

In [365]: df_with_missing
Out[365]: 
   col1  col2
0   0.0   1.0
1   NaN   NaN
2   2.0   NaN

In [366]: df_with_missing.to_hdf("file.h5", "df_with_missing", format="table", mode="w")

In [367]: pd.read_hdf("file.h5", "df_with_missing")
Out[367]: 
   col1  col2
0   0.0   1.0
1   NaN   NaN
2   2.0   NaN

In [368]: df_with_missing.to_hdf(
   .....:     "file.h5", "df_with_missing", format="table", mode="w", dropna=True
   .....: )
   .....: 

In [369]: pd.read_hdf("file.h5", "df_with_missing")
Out[369]: 
   col1  col2
0   0.0   1.0
2   2.0   NaN

2 fixed 格式

fixedputto_hdf 的默认格式,能够进行快速的读写,但是不可以追加,也不可以搜索。

使用 format ='fixed'format ='f' 指定

3 table (表)格式

HDFStore 在磁盘上支持另一种 PyTables 格式,即表格式。

从概念上讲,表的形状很像 DataFrame,有行和列。一个表可以在同一会话或其他会话中被追加

此外,还支持删除和查询类型操作。该格式可以在 appendputto_hdf 中设置 format='table'format='t' 来指定

也可以设置为一个全局选项 pd.set_option('io.hdf.default_format','table') 来设置 put/append/to_hdf 的默认存储为表格式。

In [370]: store = pd.HDFStore("store.h5")

In [371]: df1 = df[0:4]

In [372]: df2 = df[4:]

# append data (creates a table automatically)
In [373]: store.append("df", df1)

In [374]: store.append("df", df2)

In [375]: store
Out[375]: 
<class 'pandas.io.pytables.HDFStore'>
File path: store.h5

# select the entire object
In [376]: store.select("df")
Out[376]: 
                   A         B         C
2000-01-01  1.334065  0.521036  0.930384
2000-01-02 -1.613932  1.088104 -0.632963
2000-01-03 -0.585314 -0.275038 -0.937512
2000-01-04  0.632369 -1.249657  0.975593
2000-01-05  1.060617 -0.143682  0.218423
2000-01-06  3.050329  1.317933 -0.963725
2000-01-07 -0.539452 -0.771133  0.023751
2000-01-08  0.649464 -1.736427  0.197288

# the type of stored data
In [377]: store.root.df._v_attrs.pandas_type
Out[377]: 'frame_table'

4 层级键

存储的键可以是类似于路径名的分层格式的字符串,这将会生成子存储的层次结构(或者说 PyTables 中的 Groups

可以在没有 '/' 开头的情况下指定键,并且总是绝对有效的(例如 'foo' 指代 '/foo')。

删除操作会删除子存储和下面的所有东西,所以要小心。

In [378]: store.put("foo/bar/bah", df)

In [379]: store.append("food/orange", df)

In [380]: store.append("food/apple", df)

In [381]: store
Out[381]: 
<class 'pandas.io.pytables.HDFStore'>
File path: store.h5

# a list of keys are returned
In [382]: store.keys()
Out[382]: ['/df', '/food/apple', '/food/orange', '/foo/bar/bah']

# remove all nodes under this level
In [383]: store.remove("food")

In [384]: store
Out[384]: 
<class 'pandas.io.pytables.HDFStore'>
File path: store.h5

您可以使用 walk 方法遍历组层次结构

In [385]: for (path, subgroups, subkeys) in store.walk():
   .....:     for subgroup in subgroups:
   .....:         print("GROUP: {}/{}".format(path, subgroup))
   .....:     for subkey in subkeys:
   .....:         key = "/".join([path, subkey])
   .....:         print("KEY: {}".format(key))
   .....:         print(store.get(key))
   .....: 
GROUP: /foo
KEY: /df
                   A         B         C
2000-01-01  1.334065  0.521036  0.930384
2000-01-02 -1.613932  1.088104 -0.632963
2000-01-03 -0.585314 -0.275038 -0.937512
2000-01-04  0.632369 -1.249657  0.975593
2000-01-05  1.060617 -0.143682  0.218423
2000-01-06  3.050329  1.317933 -0.963725
2000-01-07 -0.539452 -0.771133  0.023751
2000-01-08  0.649464 -1.736427  0.197288
GROUP: /foo/bar
KEY: /foo/bar/bah
                   A         B         C
2000-01-01  1.334065  0.521036  0.930384
2000-01-02 -1.613932  1.088104 -0.632963
2000-01-03 -0.585314 -0.275038 -0.937512
2000-01-04  0.632369 -1.249657  0.975593
2000-01-05  1.060617 -0.143682  0.218423
2000-01-06  3.050329  1.317933 -0.963725
2000-01-07 -0.539452 -0.771133  0.023751
2000-01-08  0.649464 -1.736427  0.197288

对于存储在根节点下的项目,不能像上面描述的那样以点(属性)访问方式检索层次键

In [8]: store.foo.bar.bah
AttributeError: 'HDFStore' object has no attribute 'foo'

# you can directly access the actual PyTables node but using the root node
In [9]: store.root.foo.bar.bah
Out[9]:
/foo/bar/bah (Group) ''
  children := ['block0_items' (Array), 'block0_values' (Array), 'axis0' (Array), 'axis1' (Array)]

但是可以使用的显式键值字符串访问

In [386]: store["foo/bar/bah"]
Out[386]: 
                   A         B         C
2000-01-01  1.334065  0.521036  0.930384
2000-01-02 -1.613932  1.088104 -0.632963
2000-01-03 -0.585314 -0.275038 -0.937512
2000-01-04  0.632369 -1.249657  0.975593
2000-01-05  1.060617 -0.143682  0.218423
2000-01-06  3.050329  1.317933 -0.963725
2000-01-07 -0.539452 -0.771133  0.023751
2000-01-08  0.649464 -1.736427  0.197288

5 存储类型

5.1 存储混合类型

也可以存储混合类型值,字符串列将会存储为固定的长度,其最大长度为数据中字符串长度的最大值,如果后续想要添加更长的字符串,将会抛出异常

设置 min_itemsize={"values": size} 可以指定字符串列的最小长度,同时也支持 floats, strings, ints, bools, datetime64 数据类型的存储。

对于字符串列,nan_rep='nan' 可以更改磁盘中 NaN 的默认表示形式

In [387]: df_mixed = pd.DataFrame(
   .....:     {
   .....:         "A": np.random.randn(8),
   .....:         "B": np.random.randn(8),
   .....:         "C": np.array(np.random.randn(8), dtype="float32"),
   .....:         "string": "string",
   .....:         "int": 1,
   .....:         "bool": True,
   .....:         "datetime64": pd.Timestamp("20010102"),
   .....:     },
   .....:     index=list(range(8)),
   .....: )
   .....: 

In [388]: df_mixed.loc[df_mixed.index[3:5], ["A", "B", "string", "datetime64"]] = np.nan

In [389]: store.append("df_mixed", df_mixed, min_itemsize={"values": 50})

In [390]: df_mixed1 = store.select("df_mixed")

In [391]: df_mixed1
Out[391]: 
          A         B         C  string  int  bool datetime64
0 -0.116008  0.743946 -0.398501  string    1  True 2001-01-02
1  0.592375 -0.533097 -0.677311  string    1  True 2001-01-02
2  0.476481 -0.140850 -0.874991  string    1  True 2001-01-02
3       NaN       NaN -1.167564     NaN    1  True        NaT
4       NaN       NaN -0.593353     NaN    1  True        NaT
5  0.852727  0.463819  0.146262  string    1  True 2001-01-02
6 -1.177365  0.793644 -0.131959  string    1  True 2001-01-02
7  1.236988  0.221252  0.089012  string    1  True 2001-01-02

In [392]: df_mixed1.dtypes.value_counts()
Out[392]: 
float64           2
int64             1
float32           1
bool              1
datetime64[ns]    1
object            1
dtype: int64

# we have provided a minimum string column size
In [393]: store.root.df_mixed.table
Out[393]: 
/df_mixed/table (Table(8,)) ''
  description := {
  "index": Int64Col(shape=(), dflt=0, pos=0),
  "values_block_0": Float64Col(shape=(2,), dflt=0.0, pos=1),
  "values_block_1": Float32Col(shape=(1,), dflt=0.0, pos=2),
  "values_block_2": Int64Col(shape=(1,), dflt=0, pos=3),
  "values_block_3": Int64Col(shape=(1,), dflt=0, pos=4),
  "values_block_4": BoolCol(shape=(1,), dflt=False, pos=5),
  "values_block_5": StringCol(itemsize=50, shape=(1,), dflt=b'', pos=6)}
  byteorder := 'little'
  chunkshape := (689,)
  autoindex := True
  colindexes := {
    "index": Index(6, medium, shuffle, zlib(1)).is_csi=False}

6 存储包含多索引的 DataFrame

In [394]: index = pd.MultiIndex(
   .....:     levels=[["foo", "bar", "baz", "qux"], ["one", "two", "three"]],
   .....:     codes=[[0, 0, 0, 1, 1, 2, 2, 3, 3, 3], [0, 1, 2, 0, 1, 1, 2, 0, 1, 2]],
   .....:     names=["foo", "bar"],
   .....: )
   .....: 

In [395]: df_mi = pd.DataFrame(np.random.randn(10, 3), index=index, columns=["A", "B", "C"])

In [396]: df_mi
Out[396]: 
                  A         B         C
foo bar                                
foo one    0.667450  0.169405 -1.358046
    two   -0.105563  0.492195  0.076693
    three  0.213685 -0.285283 -1.210529
bar one   -1.408386  0.941577 -0.342447
    two    0.222031  0.052607  2.093214
baz two    1.064908  1.778161 -0.913867
    three -0.030004 -0.399846 -1.234765
qux one    0.081323 -0.268494  0.168016
    two   -0.898283 -0.218499  1.408028
    three -1.267828 -0.689263  0.520995

In [397]: store.append("df_mi", df_mi)

In [398]: store.select("df_mi")
Out[398]: 
                  A         B         C
foo bar                                
foo one    0.667450  0.169405 -1.358046
    two   -0.105563  0.492195  0.076693
    three  0.213685 -0.285283 -1.210529
bar one   -1.408386  0.941577 -0.342447
    two    0.222031  0.052607  2.093214
baz two    1.064908  1.778161 -0.913867
    three -0.030004 -0.399846 -1.234765
qux one    0.081323 -0.268494  0.168016
    two   -0.898283 -0.218499  1.408028
    three -1.267828 -0.689263  0.520995

# the levels are automatically included as data columns
In [399]: store.select("df_mi", "foo=bar")
Out[399]: 
                A         B         C
foo bar                              
bar one -1.408386  0.941577 -0.342447
    two  0.222031  0.052607  2.093214
数据类型

HDFStore 将对象 dtype 映射到 PyTables dtype。支持以下类型

image.png

不支持 unicode

分类数据

可以将包含分类的数据写入到 HDFStore。查询的工作原理与对象数组相同。

然而,分类类型的数据以一种更有效的方式存储

In [473]: dfcat = pd.DataFrame(
   .....:     {"A": pd.Series(list("aabbcdba")).astype("category"), "B": np.random.randn(8)}
   .....: )
   .....: 

In [474]: dfcat
Out[474]: 
   A         B
0  a  0.477849
1  a  0.283128
2  b -2.045700
3  b -0.338206
4  c -0.423113
5  d  2.314361
6  b -0.033100
7  a -0.965461

In [475]: dfcat.dtypes
Out[475]: 
A    category
B     float64
dtype: object

In [476]: cstore = pd.HDFStore("cats.h5", mode="w")

In [477]: cstore.append("dfcat", dfcat, format="table", data_columns=["A"])

In [478]: result = cstore.select("dfcat", where="A in ['b', 'c']")

In [479]: result
Out[479]: 
   A         B
2  b -2.045700
3  b -0.338206
4  c -0.423113
6  b -0.033100

In [480]: result.dtypes
Out[480]: 
A    category
B     float64
dtype: object
字符串列

HDFStore 的底层实现对字符串列使用了固定的列宽(itemsize)。itemsize 的值为在第一次调用 append 时,计算出的字符串列中最长的字符串长度

而在随后的 append 中,如果传入的字符串列长度超过了之前的 itemsize,将会抛出异常。

在第一次创建表时,传入 min_itemsize 参数来指定特定字符串列的最小长度。

min_itemsize 可以是整数,也可以是将列名对应整数的字典,用于指定特定字符串列的最小列宽

传递 min_itemsize 字典将导致所有传递的列自动创建为 data_columns

In [481]: dfs = pd.DataFrame({"A": "foo", "B": "bar"}, index=list(range(5)))

In [482]: dfs
Out[482]: 
     A    B
0  foo  bar
1  foo  bar
2  foo  bar
3  foo  bar
4  foo  bar

# A and B have a size of 30
In [483]: store.append("dfs", dfs, min_itemsize=30)

In [484]: store.get_storer("dfs").table
Out[484]: 
/dfs/table (Table(5,)) ''
  description := {
  "index": Int64Col(shape=(), dflt=0, pos=0),
  "values_block_0": StringCol(itemsize=30, shape=(2,), dflt=b'', pos=1)}
  byteorder := 'little'
  chunkshape := (963,)
  autoindex := True
  colindexes := {
    "index": Index(6, medium, shuffle, zlib(1)).is_csi=False}

# A is created as a data_column with a size of 30
# B is size is calculated
In [485]: store.append("dfs2", dfs, min_itemsize={"A": 30})

In [486]: store.get_storer("dfs2").table
Out[486]: 
/dfs2/table (Table(5,)) ''
  description := {
  "index": Int64Col(shape=(), dflt=0, pos=0),
  "values_block_0": StringCol(itemsize=3, shape=(1,), dflt=b'', pos=1),
  "A": StringCol(itemsize=30, shape=(), dflt=b'', pos=2)}
  byteorder := 'little'
  chunkshape := (1598,)
  autoindex := True
  colindexes := {
    "index": Index(6, medium, shuffle, zlib(1)).is_csi=False,
    "A": Index(6, medium, shuffle, zlib(1)).is_csi=False}

字符串列将使用 nan_rep 参数指定的字符串来表示 np.nan(缺失值),默认值为 nan

默认的行为可能会在无意中把一个实际的 nan 值变成一个缺失值

In [487]: dfss = pd.DataFrame({"A": ["foo", "bar", "nan"]})

In [488]: dfss
Out[488]: 
     A
0  foo
1  bar
2  nan

In [489]: store.append("dfss", dfss)

In [490]: store.select("dfss")
Out[490]: 
     A
0  foo
1  bar
2  NaN

# here you need to specify a different nan rep
In [491]: store.append("dfss2", dfss, nan_rep="_nan_")

In [492]: store.select("dfss2")
Out[492]: 
     A
0  foo
1  bar
2  nan

7 查询

7.1 查询表

selectdelete 操作有一个可选的条件,可以指定该条件来选择/删除数据的一个子集。

这允许我们从一个非常大的磁盘表中,检索出一小部分的数据。

比较操作

=, ==, !=, >, >=, <, <=

布尔表达式

  • |: 或
  • &: 与
  • (): 分组

注意

  • = 将会自动扩展为比较运算符 ==
  • ~ 是取反,使用条件有限
  • 如果传递表达式列表或元组,将使用 & 连接

下面是有效的表达式:

'index >= date'
"columns = ['A', 'D']"
"columns in ['A', 'D']"
'columns = A'
'columns == A'
"~(columns = ['A', 'B'])"
'index > df.index[3] & string = "bar"'
'(index > df.index[3] & index <= df.index[6]) | string = "bar"'
"ts >= Timestamp('2012-02-01')"
"major_axis>=20130101"

indexers 位于子表达式的左侧

  • columns, major_axis, ts

表达式的右侧可以是:

  • 求值函数,如 Timestamp('2012-02-01')
  • 字符串,如 "bar"
  • 类似日期,如 20130101, 或 "20130101"
  • 列表,如 "['A', 'B']"
  • 局部定义的变量,如 date

注意

不要使用字符串内插的方式构造查询表达式,可以将字符串赋值给变量,然后直接的表达式内使用。比如

string = "HolyMoly'"
store.select("df", "index == string")

如果替换为

string = "HolyMoly'"
store.select('df', f'index == {string}')

将会引发 SyntaxError 异常,因为 string 变量双引号内有一个单引号

如果一定要使用内插法,可以使用 "%r" 格式化

store.select("df", "index == %r" % string)

我们有如下数据

In [400]: dfq = pd.DataFrame(
   .....:     np.random.randn(10, 4),
   .....:     columns=list("ABCD"),
   .....:     index=pd.date_range("20130101", periods=10),
   .....: )
   .....: 

In [401]: store.append("dfq", dfq, format="table", data_columns=True)

使用布尔表达式和内联函数求值。

In [402]: store.select("dfq", "index>pd.Timestamp('20130104') & columns=['A', 'B']")
Out[402]: 
                   A         B
2013-01-05 -1.083889  0.811865
2013-01-06 -0.402227  1.618922
2013-01-07  0.948196  0.183573
2013-01-08 -1.043530 -0.708145
2013-01-09  0.813949  1.508891
2013-01-10  1.176488 -1.246093

使用内联列

In [403]: store.select("dfq", where="A>0 or C>0")
Out[403]: 
                   A         B         C         D
2013-01-01  0.620028  0.159416 -0.263043 -0.639244
2013-01-04 -0.536722  1.005707  0.296917  0.139796
2013-01-05 -1.083889  0.811865  1.648435 -0.164377
2013-01-07  0.948196  0.183573  0.145277  0.308146
2013-01-08 -1.043530 -0.708145  1.430905 -0.850136
2013-01-09  0.813949  1.508891 -1.556154  0.187597
2013-01-10  1.176488 -1.246093 -0.002726 -0.444249

可以提供 columns 关键字来选择要返回的列,这等价于传递 'columns=list_of_columns_to_filter':

In [404]: store.select("df", "columns=['A', 'B']")
Out[404]: 
                   A         B
2000-01-01  1.334065  0.521036
2000-01-02 -1.613932  1.088104
2000-01-03 -0.585314 -0.275038
2000-01-04  0.632369 -1.249657
2000-01-05  1.060617 -0.143682
2000-01-06  3.050329  1.317933
2000-01-07 -0.539452 -0.771133
2000-01-08  0.649464 -1.736427

可以指定 startstop 参数来限制行起始和终止。这些值是根据表中的总行数计算的

>>> store.select("dfq", "columns=['A', 'B']", start=3, stop=5)
>>>
            A	        B
2013-01-04	-0.483155	1.143564
2013-01-05	0.218290	-1.391789
7.2 查询 timedelta64[ns]

您可以使用 timedelta64 [ns] 类型进行存储和查询。

Terms 可以用以下格式指定: <float>(<unit>),其中 float 可以是有符号的(和小数),单位可以是 Dsmsusns

这里有一个例子

In [405]: from datetime import timedelta

In [406]: dftd = pd.DataFrame(
   .....:     {
   .....:         "A": pd.Timestamp("20130101"),
   .....:         "B": [
   .....:             pd.Timestamp("20130101") + timedelta(days=i, seconds=10)
   .....:             for i in range(10)
   .....:         ],
   .....:     }
   .....: )
   .....: 

In [407]: dftd["C"] = dftd["A"] - dftd["B"]

In [408]: dftd
Out[408]: 
           A                   B                  C
0 2013-01-01 2013-01-01 00:00:10  -1 days +23:59:50
1 2013-01-01 2013-01-02 00:00:10  -2 days +23:59:50
2 2013-01-01 2013-01-03 00:00:10  -3 days +23:59:50
3 2013-01-01 2013-01-04 00:00:10  -4 days +23:59:50
4 2013-01-01 2013-01-05 00:00:10  -5 days +23:59:50
5 2013-01-01 2013-01-06 00:00:10  -6 days +23:59:50
6 2013-01-01 2013-01-07 00:00:10  -7 days +23:59:50
7 2013-01-01 2013-01-08 00:00:10  -8 days +23:59:50
8 2013-01-01 2013-01-09 00:00:10  -9 days +23:59:50
9 2013-01-01 2013-01-10 00:00:10 -10 days +23:59:50

In [409]: store.append("dftd", dftd, data_columns=True)

In [410]: store.select("dftd", "C<'-3.5D'")
Out[410]: 
           A                   B                  C
4 2013-01-01 2013-01-05 00:00:10  -5 days +23:59:50
5 2013-01-01 2013-01-06 00:00:10  -6 days +23:59:50
6 2013-01-01 2013-01-07 00:00:10  -7 days +23:59:50
7 2013-01-01 2013-01-08 00:00:10  -8 days +23:59:50
8 2013-01-01 2013-01-09 00:00:10  -9 days +23:59:50
9 2013-01-01 2013-01-10 00:00:10 -10 days +23:59:50
7.3 查询多级索引

可以通过使用 level 的名称从多索引中提取数据

In [411]: df_mi.index.names
Out[411]: FrozenList(['foo', 'bar'])

In [412]: store.select("df_mi", "foo=baz and bar=two")
Out[412]: 
                A         B         C
foo bar                              
baz two  1.064908  1.778161 -0.913867

如果多索引 level=None,则可以通过默认的级别名称 level_n 方访问,其中 n 为您想要选择的多索引级别

In [413]: index = pd.MultiIndex(
   .....:     levels=[["foo", "bar", "baz", "qux"], ["one", "two", "three"]],
   .....:     codes=[[0, 0, 0, 1, 1, 2, 2, 3, 3, 3], [0, 1, 2, 0, 1, 1, 2, 0, 1, 2]],
   .....: )
   .....: 

In [414]: df_mi_2 = pd.DataFrame(np.random.randn(10, 3), index=index, columns=["A", "B", "C"])

In [415]: df_mi_2
Out[415]: 
                  A         B         C
foo one    0.856838  1.491776  0.001283
    two    0.701816 -1.097917  0.102588
    three  0.661740  0.443531  0.559313
bar one   -0.459055 -1.222598 -0.455304
    two   -0.781163  0.826204 -0.530057
baz two    0.296135  1.366810  1.073372
    three -0.994957  0.755314  2.119746
qux one   -2.628174 -0.089460 -0.133636
    two    0.337920 -0.634027  0.421107
    three  0.604303  1.053434  1.109090

In [416]: store.append("df_mi_2", df_mi_2)

# the levels are automatically included as data columns with keyword level_n
In [417]: store.select("df_mi_2", "level_0=foo and level_1=two")
Out[417]: 
                A         B         C
foo two  0.701816 -1.097917  0.102588
7.4 索引

如果数据已经存储在表中,可以使用 create_table_index 为表创建和修改索引。

强烈建议创建表索引。这样在执行 select 时,能够大大加快您的查询速度。

注意

索引会自动在可索引对象和指定的任何数据列上自动创建。可以通过对 append 传递 index=False 来关闭这一行为

# we have automagically already created an index (in the first section)
In [418]: i = store.root.df.table.cols.index.index

In [419]: i.optlevel, i.kind
Out[419]: (6, 'medium')

# change an index by passing new parameters
In [420]: store.create_table_index("df", optlevel=9, kind="full")

In [421]: i = store.root.df.table.cols.index.index

In [422]: i.optlevel, i.kind
Out[422]: (9, 'full')

通常在将大量数据追加到一个存储中时,关闭每次追加时创建索引,然后在添加完后重新创建是很有用的

In [423]: df_1 = pd.DataFrame(np.random.randn(10, 2), columns=list("AB"))

In [424]: df_2 = pd.DataFrame(np.random.randn(10, 2), columns=list("AB"))

In [425]: st = pd.HDFStore("appends.h5", mode="w")

In [426]: st.append("df", df_1, data_columns=["B"], index=False)

In [427]: st.append("df", df_2, data_columns=["B"], index=False)

In [428]: st.get_storer("df").table
Out[428]: 
/df/table (Table(20,)) ''
  description := {
  "index": Int64Col(shape=(), dflt=0, pos=0),
  "values_block_0": Float64Col(shape=(1,), dflt=0.0, pos=1),
  "B": Float64Col(shape=(), dflt=0.0, pos=2)}
  byteorder := 'little'
  chunkshape := (2730,)

在完成追加后,再创建索引。

In [429]: st.create_table_index("df", columns=["B"], optlevel=9, kind="full")

In [430]: st.get_storer("df").table
Out[430]: 
/df/table (Table(20,)) ''
  description := {
  "index": Int64Col(shape=(), dflt=0, pos=0),
  "values_block_0": Float64Col(shape=(1,), dflt=0.0, pos=1),
  "B": Float64Col(shape=(), dflt=0.0, pos=2)}
  byteorder := 'little'
  chunkshape := (2730,)
  autoindex := True
  colindexes := {
    "B": Index(9, full, shuffle, zlib(1)).is_csi=True}

In [431]: st.close()
7.5 通过数据列查询

有时,你可能只想在某些列上进行查询操作,您可以指定 data_columns=True 强制所有列为 data_columns

In [432]: df_dc = df.copy()

In [433]: df_dc["string"] = "foo"

In [434]: df_dc.loc[df_dc.index[4:6], "string"] = np.nan

In [435]: df_dc.loc[df_dc.index[7:9], "string"] = "bar"

In [436]: df_dc["string2"] = "cool"

In [437]: df_dc.loc[df_dc.index[1:3], ["B", "C"]] = 1.0

In [438]: df_dc
Out[438]: 
                   A         B         C string string2
2000-01-01  1.334065  0.521036  0.930384    foo    cool
2000-01-02 -1.613932  1.000000  1.000000    foo    cool
2000-01-03 -0.585314  1.000000  1.000000    foo    cool
2000-01-04  0.632369 -1.249657  0.975593    foo    cool
2000-01-05  1.060617 -0.143682  0.218423    NaN    cool
2000-01-06  3.050329  1.317933 -0.963725    NaN    cool
2000-01-07 -0.539452 -0.771133  0.023751    foo    cool
2000-01-08  0.649464 -1.736427  0.197288    bar    cool

# on-disk operations
In [439]: store.append("df_dc", df_dc, data_columns=["B", "C", "string", "string2"])

In [440]: store.select("df_dc", where="B > 0")
Out[440]: 
                   A         B         C string string2
2000-01-01  1.334065  0.521036  0.930384    foo    cool
2000-01-02 -1.613932  1.000000  1.000000    foo    cool
2000-01-03 -0.585314  1.000000  1.000000    foo    cool
2000-01-06  3.050329  1.317933 -0.963725    NaN    cool

# getting creative
In [441]: store.select("df_dc", "B > 0 & C > 0 & string == foo")
Out[441]: 
                   A         B         C string string2
2000-01-01  1.334065  0.521036  0.930384    foo    cool
2000-01-02 -1.613932  1.000000  1.000000    foo    cool
2000-01-03 -0.585314  1.000000  1.000000    foo    cool

# this is in-memory version of this type of selection
In [442]: df_dc[(df_dc.B > 0) & (df_dc.C > 0) & (df_dc.string == "foo")]
Out[442]: 
                   A         B         C string string2
2000-01-01  1.334065  0.521036  0.930384    foo    cool
2000-01-02 -1.613932  1.000000  1.000000    foo    cool
2000-01-03 -0.585314  1.000000  1.000000    foo    cool

# we have automagically created this index and the B/C/string/string2
# columns are stored separately as ``PyTables`` columns
In [443]: store.root.df_dc.table
Out[443]: 
/df_dc/table (Table(8,)) ''
  description := {
  "index": Int64Col(shape=(), dflt=0, pos=0),
  "values_block_0": Float64Col(shape=(1,), dflt=0.0, pos=1),
  "B": Float64Col(shape=(), dflt=0.0, pos=2),
  "C": Float64Col(shape=(), dflt=0.0, pos=3),
  "string": StringCol(itemsize=3, shape=(), dflt=b'', pos=4),
  "string2": StringCol(itemsize=4, shape=(), dflt=b'', pos=5)}
  byteorder := 'little'
  chunkshape := (1680,)
  autoindex := True
  colindexes := {
    "index": Index(6, medium, shuffle, zlib(1)).is_csi=False,
    "B": Index(6, medium, shuffle, zlib(1)).is_csi=False,
    "C": Index(6, medium, shuffle, zlib(1)).is_csi=False,
    "string": Index(6, medium, shuffle, zlib(1)).is_csi=False,
    "string2": Index(6, medium, shuffle, zlib(1)).is_csi=False}
7.6 迭代器

您可以将 iterator=Truechunksize=number_in_a_chunk 传递给 selectselect_as_multiple, 然后返回一个迭代器。 默认是一个块返回 50,000

In [444]: for df in store.select("df", chunksize=3):
   .....:     print(df)
   .....: 
                   A         B         C
2000-01-01  1.334065  0.521036  0.930384
2000-01-02 -1.613932  1.088104 -0.632963
2000-01-03 -0.585314 -0.275038 -0.937512
                   A         B         C
2000-01-04  0.632369 -1.249657  0.975593
2000-01-05  1.060617 -0.143682  0.218423
2000-01-06  3.050329  1.317933 -0.963725
                   A         B         C
2000-01-07 -0.539452 -0.771133  0.023751
2000-01-08  0.649464 -1.736427  0.197288

注意
你也可以使用 read_hdf 返回一个迭代器,在完成迭代后会自动关闭存储

for df in pd.read_hdf("store.h5", "df", chunksize=3):
    print(df)

注意

chunksize 关键字会应用到原始行中,意味着在你进行查询的时候,会将所有的行进行分割并应用查询,因此返回的迭代器的大小可能不一样

下面的例子用于生成大小相等的块

In [445]: dfeq = pd.DataFrame({"number": np.arange(1, 11)})

In [446]: dfeq
Out[446]: 
   number
0       1
1       2
2       3
3       4
4       5
5       6
6       7
7       8
8       9
9      10

In [447]: store.append("dfeq", dfeq, data_columns=["number"])

In [448]: def chunks(l, n):
   .....:     return [l[i: i + n] for i in range(0, len(l), n)]
   .....: 

In [449]: evens = [2, 4, 6, 8, 10]

In [450]: coordinates = store.select_as_coordinates("dfeq", "number=evens")

In [451]: for c in chunks(coordinates, 2):
   .....:     print(store.select("dfeq", where=c))
   .....: 
   number
1       2
3       4
   number
5       6
7       8
   number
9      10
7.7 高级查询
7.7.1 选择一列

可以使用 select_column 方法来选取一列,这种方法可以快速的获取索引列。返回的结果是 Series 类型,索引变成了递增的行号。

该方法还不支持 where 选择器

In [452]: store.select_column("df_dc", "index")
Out[452]: 
0   2000-01-01
1   2000-01-02
2   2000-01-03
3   2000-01-04
4   2000-01-05
5   2000-01-06
6   2000-01-07
7   2000-01-08
Name: index, dtype: datetime64[ns]

In [453]: store.select_column("df_dc", "string")
Out[453]: 
0    foo
1    foo
2    foo
3    foo
4    NaN
5    NaN
6    foo
7    bar
Name: string, dtype: object
7.7.2 选择坐标

有时,你可能想获取查询的坐标(即索引的位置)。可以将返回的 Int64Index 结果索引传递给 where 进行后续操作

In [454]: df_coord = pd.DataFrame(
   .....:     np.random.randn(1000, 2), index=pd.date_range("20000101", periods=1000)
   .....: )
   .....: 

In [455]: store.append("df_coord", df_coord)

In [456]: c = store.select_as_coordinates("df_coord", "index > 20020101")

In [457]: c
Out[457]: 
Int64Index([732, 733, 734, 735, 736, 737, 738, 739, 740, 741,
            ...
            990, 991, 992, 993, 994, 995, 996, 997, 998, 999],
           dtype='int64', length=268)

In [458]: store.select("df_coord", where=c)
Out[458]: 
                   0         1
2002-01-02 -0.165548  0.646989
2002-01-03  0.782753 -0.123409
2002-01-04 -0.391932 -0.740915
2002-01-05  1.211070 -0.668715
2002-01-06  0.341987 -0.685867
...              ...       ...
2002-09-22  1.788110 -0.405908
2002-09-23 -0.801912  0.768460
2002-09-24  0.466284 -0.457411
2002-09-25 -0.364060  0.785367
2002-09-26 -1.463093  1.187315

[268 rows x 2 columns]
7.7.3 使用 where 选择

有时您的查询操作可能会涉及要创建一个行列表来进行选择。通常可以使用索引操作返回结果的索引来进行选择。

In [459]: df_mask = pd.DataFrame(
   .....:     np.random.randn(1000, 2), index=pd.date_range("20000101", periods=1000)
   .....: )
   .....: 

In [460]: store.append("df_mask", df_mask)

In [461]: c = store.select_column("df_mask", "index")

In [462]: where = c[pd.DatetimeIndex(c).month == 5].index

In [463]: store.select("df_mask", where=where)
Out[463]: 
                   0         1
2000-05-01  1.735883 -2.615261
2000-05-02  0.422173  2.425154
2000-05-03  0.632453 -0.165640
2000-05-04 -1.017207 -0.005696
2000-05-05  0.299606  0.070606
...              ...       ...
2002-05-27  0.234503  1.199126
2002-05-28 -3.021833 -1.016828
2002-05-29  0.522794  0.063465
2002-05-30 -1.653736  0.031709
2002-05-31 -0.968402 -0.393583

[93 rows x 2 columns]
7.7.4 存储对象

如果要检查存储的对象,可以通过 get_store 进行检索。您可以通过编程方式使用它来获取对象中的行数

In [464]: store.get_storer("df_dc").nrows
Out[464]: 8
7.8 多表查询

append_to_multipleselect_as_multiple 方法可以一次性追加/选择多个表。

他的思想是让你使用一个表(称为选择器表)来索引大部分或所有的列,并执行查询操作。其他表是数据表,其索引与选择器表的索引匹配。

这样,您就可以在选择器表上执行一个非常快的查询,但是能够得到大量的相关数据。这个方法类似于拥有一个非常宽的表,但是支持更有效的查询。

append_to_multiple 方法会根据一个字典将给定的单个 DataFrame 拆分为多个表,该字典将表名映射到你想在该表中包含的列。如果对应的值为 None 而不是列的列表,则该表将包含给定 DataFrame 中其余未指定的列

参数 selector 定义了哪个表是选择器表(即可用该表进行查询),参数 dropna 会删除 DataFrame 中的空行。也就是说,如果一张表中包含了空行,则该行在其他表中对应的行都会被删除,以确保表是同步的

如果 dropna=False,则需要用户自行同步表。记住 空行不会被写入 HDFStore 中,所以如果你选择调用 dropna=False,一些表可能比其他表有更多的行,因此 select_as_multiple 可能不起作用,或者它可能会返回意外的结果

In [465]: df_mt = pd.DataFrame(
   .....:     np.random.randn(8, 6),
   .....:     index=pd.date_range("1/1/2000", periods=8),
   .....:     columns=["A", "B", "C", "D", "E", "F"],
   .....: )
   .....: 

In [466]: df_mt["foo"] = "bar"

In [467]: df_mt.loc[df_mt.index[1], ("A", "B")] = np.nan

# you can also create the tables individually
In [468]: store.append_to_multiple(
   .....:     {"df1_mt": ["A", "B"], "df2_mt": None}, df_mt, selector="df1_mt"
   .....: )
   .....: 

In [469]: store
Out[469]: 
<class 'pandas.io.pytables.HDFStore'>
File path: store.h5

# individual tables were created
In [470]: store.select("df1_mt")
Out[470]: 
                   A         B
2000-01-01  1.251079 -0.362628
2000-01-02       NaN       NaN
2000-01-03  0.719421 -0.448886
2000-01-04  1.140998 -0.877922
2000-01-05  1.043605  1.798494
2000-01-06 -0.467812 -0.027965
2000-01-07  0.150568  0.754820
2000-01-08 -0.596306 -0.910022

In [471]: store.select("df2_mt")
Out[471]: 
                   C         D         E         F  foo
2000-01-01  1.602451 -0.221229  0.712403  0.465927  bar
2000-01-02 -0.525571  0.851566 -0.681308 -0.549386  bar
2000-01-03 -0.044171  1.396628  1.041242 -1.588171  bar
2000-01-04  0.463351 -0.861042 -2.192841 -1.025263  bar
2000-01-05 -1.954845 -1.712882 -0.204377 -1.608953  bar
2000-01-06  1.601542 -0.417884 -2.757922 -0.307713  bar
2000-01-07 -1.935461  1.007668  0.079529 -1.459471  bar
2000-01-08 -1.057072 -0.864360 -1.124870  1.732966  bar

# as a multiple
In [472]: store.select_as_multiple(
   .....:     ["df1_mt", "df2_mt"],
   .....:     where=["A>0", "B>0"],
   .....:     selector="df1_mt",
   .....: )
   .....: 
Out[472]: 
                   A         B         C         D         E         F  foo
2000-01-05  1.043605  1.798494 -1.954845 -1.712882 -0.204377 -1.608953  bar
2000-01-07  0.150568  0.754820 -1.935461  1.007668  0.079529 -1.459471  bar
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

名本无名

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值