xarray （教程）第三章索引的使用

西大·阿志

已于 2024-01-12 15:23:53 修改

阅读量2.7k

点赞数 33

分类专栏： xarray使用教程文章标签： python

于 2024-01-12 11:28:59 首次发布

本文链接：https://blog.csdn.net/qq_46608464/article/details/135548507

版权

Indexing and selecting data

Xarray提供了极其灵活的索引例程，结合了NumPy和pandas在数据选择方面的最佳特性。
访问DataArray对象元素的最基本方法是使用Python的[]语法，比如array[i，j]
，其中I和j都是整数。由于xarray对象可以存储与数组的每个维度相对应的坐标，所以基于标签的索引类似于pandas。DataFrame.loc也是可以的。在基于标签的索引中，从坐标值中自动查找元素位置I。
xarray对象的维度有名称，因此您也可以通过名称查找维度，而不是记住它们的位置顺序。

Quick overview

在这里插入图片描述

Positional indexing

直接索引DataArray的工作方式(大部分)与索引numpy数组一样，只是返回的对象总是另一个DataArray:

import xarray as xr
import numpy as np
import pandas as pd

da = xr.DataArray(
    np.random.rand(4, 3),
    [
        ("time", pd.date_range("2000-01-01", periods=4)),
        ("space", ["IA", "IL", "IN"]),
    ],
)

print(da[:2])
"""<xarray.DataArray (time: 2, space: 3)>
array([[0.21404789, 0.916436  , 0.9182366 ],
       [0.11161617, 0.72361921, 0.28669233]])
Coordinates:
  * time     (time) datetime64[ns] 2000-01-01 2000-01-02
  * space    (space) <U2 'IA' 'IL' 'IN'
  """
print(da[0, 0])
"""<xarray.DataArray ()>
array(0.21404789)
Coordinates:
    time     datetime64[ns] 2000-01-01
    space    <U2 'IA'
    """
print(da[:, [2, 1]])
"""
<xarray.DataArray (time: 4, space: 2)>
array([[0.9182366 , 0.916436  ],
       [0.28669233, 0.72361921],
       [0.05506156, 0.36749797],
       [0.92949472, 0.20359987]])
Coordinates:
  * time     (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 2000-01-04
  * space    (space) <U2 'IN' 'IL'
  """

Xarray也支持基于标签的索引，就像pandas一样。因为我们用pandas.Index下的索引，基于标签的索引非常快。要进行基于标签的索引，请使用loc属性:

da.loc["2000-01-01":"2000-01-02", "IA"]
"""Out[5]: 
<xarray.DataArray (time: 2)>
array([0.127, 0.897])
Coordinates:
  * time     (time) datetime64[ns] 2000-01-01 2000-01-02
    space    <U2 'IA'"""

Indexing with dimension names

有了维度名称，我们就不必依赖维度顺序，可以显式地使用它们来切片数据。有两种方法可以做到这一点

使用sel()和isel()方便的方法:

# index by integer array indices
da.isel(space=0, time=slice(None, 2))
"""Out[8]: 
<xarray.DataArray (time: 2)>
array([0.127, 0.897])
Coordinates:
  * time     (time) datetime64[ns] 2000-01-01 2000-01-02
    space    <U2 'IA'"""

# index by dimension coordinate labels
da.sel(time=slice("2000-01-01", "2000-01-02"))
"""Out[9]: 
<xarray.DataArray (time: 2, space: 3)>
array([[  0.127, -10.   , -10.   ],
       [  0.897,   0.377,   0.336]])
Coordinates:
  * time     (time) datetime64[ns] 2000-01-01 2000-01-02
  * space    (space) <U2 'IA' 'IL' 'IN'"""

使用字典作为基于数组位置或标签的数组索引的参数

# index by integer array indices
da[dict(space=0, time=slice(None, 2))]
"""Out[10]: 
<xarray.DataArray (time: 2)>
array([0.127, 0.897])
Coordinates:
  * time     (time) datetime64[ns] 2000-01-01 2000-01-02
    space    <U2 'IA'"""

# index by dimension coordinate labels
da.loc[dict(time=slice("2000-01-01", "2000-01-02"))]
"""Out[11]: 
<xarray.DataArray (time: 2, space: 3)>
array([[  0.127, -10.   , -10.   ],
       [  0.897,   0.377,   0.336]])
Coordinates:
  * time     (time) datetime64[ns] 2000-01-01 2000-01-02
  * space    (space) <U2 'IA' 'IL' 'IN'"""

Nearest neighbor lookups

基于标签的选择方法sel()、reindex()和reindex_like()
都支持方法和容差关键字参数。方法参数允许通过使用“填充”、“回填”或“最近的”方法来启用最近邻(不精确)查找:

da = xr.DataArray([1, 2, 3], [("x", [0, 1, 2])])

da.sel(x=[1.1, 1.9], method="nearest")
"""Out[13]: 
<xarray.DataArray (x: 2)>
array([2, 3])
Coordinates:
  * x        (x) int64 1 2"""

da.sel(x=0.1, method="backfill")
"""Out[14]: 
<xarray.DataArray ()>
array(2)
Coordinates:
    x        int64 1"""

da.reindex(x=[0.5, 1, 1.5, 2, 2.5], method="pad")
"""Out[15]: 
<xarray.DataArray (x: 5)>
array([1, 2, 2, 3, 3])
Coordinates:
  * x        (x) float64 0.5 1.0 1.5 2.0 2.5"""

method : {None, “nearest”, “pad”, “ffill”, “backfill”, “bfill”}, optional
Method to use for inexact matches:

        - None (default): only exact matches
        - pad / ffill: propagate last valid index value forward
        - backfill / bfill: propagate next valid index value backward
        - nearest: use nearest valid index value

公差限制了不精确查找的有效匹配的最大距离:

da.reindex(x=[1.1, 1.5], method="nearest", tolerance=0.2)
"""Out[16]: 
<xarray.DataArray (x: 2)>
array([ 2., nan])
Coordinates:
  * x        (x) float64 1.1 1.5"""

如果参数是一个切片对象，暂时还不支持

da.sel(x=slice(1, 3), method="nearest")
NotImplementedError

然而，你不需要使用方法来做不精确的切片。只要索引标签是单调递增的，切片就已经返回该范围内的所有值(包括):

da.sel(x=slice(0.9, 3.1))
"""Out[18]: 
<xarray.DataArray (x: 2)>
array([2, 3])
Coordinates:
  * x        (x) int64 1 2"""

Dataset indexing

我们还可以使用这些方法同时索引数据集中的所有变量，返回一个新的数据集:

da = xr.DataArray(
    np.random.rand(4, 3),
    [
        ("time", pd.date_range("2000-01-01", periods=4)),
        ("space", ["IA", "IL", "IN"]),
    ],
)

ds = da.to_dataset(name="foo")

ds.isel(space=[0], time=[0])
"""Out[23]: 
<xarray.Dataset>
Dimensions:  (time: 1, space: 1)
Coordinates:
  * time     (time) datetime64[ns] 2000-01-01
  * space    (space) <U2 'IA'
Data variables:
    foo      (time, space) float64 0.1294"""

ds.sel(time="2000-01-01")
"""Out[24]: 
<xarray.Dataset>
Dimensions:  (space: 3)
Coordinates:
    time     datetime64[ns] 2000-01-01
  * space    (space) <U2 'IA' 'IL' 'IN'
Data variables:
    foo      (space) float64 0.1294 0.8599 0.8204"""

不支持对数据集进行位置索引，因为数据集中维度的顺序有些不明确(不同的数组之间会有所不同)。但是，您可以使用维名称进行常规索引:

ds[dict(space=[0], time=[0])]
"""Out[25]: 
<xarray.Dataset>
Dimensions:  (time: 1, space: 1)
Coordinates:
  * time     (time) datetime64[ns] 2000-01-01
  * space    (space) <U2 'IA'
Data variables:
    foo      (time, space) float64 0.1294"""

ds.loc[dict(time=