xarray (教程)第二章 数据结构

数据结构

DataArray

DataArray是xarray对带标签的多维数组的实现。
它有几个关键属性:

  • values:保存数组值的numpy.ndarray或类似numpy的数组
  • dims:每个轴的尺寸名称(例如,(’ x ‘,’ y ‘,’ z '))
  • coords:一个类似dict的数组(坐标)容器,用来标记每个点(例如,一维数组的数字、日期对象或字符串)
  • attrs:保存任意元数据(属性)的dict
Creating a DataArray

DataArray构造函数采用:

  • 数据:值的多维数组(例如,numpy ndarray、类似numpy的数组、Series、DataFrame或pandas。面板)
  • 坐标表:坐标表或坐标字典。如果是列表,应该是元组列表,其中第一个元素是维度名称,第二个元素是对应的坐标array_like对象。
  • dims:维度名称列表。如果省略,并且coords是元组列表,则维度名称取自coords。
  • attrs:要添加到实例的属性字典
  • name:命名实例的字符串
import xarray as xr
import numpy as np
import pandas as pd
data = np.random.rand(4, 3)

locs = ["IA", "IL", "IN"]

times = pd.date_range("2000-01-01", periods=4)

foo = xr.DataArray(data, coords=[times, locs], dims=["time", "space"])

print(foo)
"""
<xarray.DataArray (time: 4, space: 3)>
array([[0.35200563, 0.569955  , 0.98816514],
       [0.95311654, 0.8268097 , 0.15125512],
       [0.6361332 , 0.35353079, 0.33760348],
       [0.28808098, 0.19150729, 0.79793025]])
Coordinates:
  * time     (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 2000-01-04
  * space    (space) <U2 'IA' 'IL' 'IN'
"""
DataArray properties
import xarray as xr
import numpy as np
import pandas as pd
data = np.random.rand(4, 3)

locs = ["IA", "IL", "IN"]

times = pd.date_range("2000-01-01", periods=4)

foo = xr.DataArray(data, coords=[times, locs], dims=["time1", "space1"])
print(foo.values)

print(foo.dims)

print(foo.coords)

print(foo.attrs)

print(foo.name)
"""
[[0.90006724 0.52320969 0.54733588]
 [0.1018204  0.65948629 0.52749082]
 [0.73632638 0.50559387 0.57415316]
 [0.24010929 0.52222784 0.39407124]]
('time1', 'space1')
Coordinates:
  * time1    (time1) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 2000-01-04
  * space1   (space1) <U2 'IA' 'IL' 'IN'
{}
None
"""

foo.name = "foo"

foo.attrs["units"] = "meters"

foo

"""
<xarray.DataArray 'foo' (time: 4, space: 3)>
array([[0.127, 0.967, 0.26 ],
       [0.897, 0.377, 0.336],
       [0.451, 0.84 , 0.123],
       [0.543, 0.373, 0.448]])
Coordinates:
  * time     (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 2000-01-04
  * space    (space) <U2 'IA' 'IL' 'IN'
Attributes:
    units:    meters
"""
foo.rename("bar")
"""
<xarray.DataArray 'bar' (time: 4, space: 3)>
array([[0.127, 0.967, 0.26 ],
       [0.897, 0.377, 0.336],
       [0.451, 0.84 , 0.123],
       [0.543, 0.373, 0.448]])
Coordinates:
  * time     (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 2000-01-04
  * space    (space) <U2 'IA' 'IL' 'IN'
Attributes:
    units:    meters
"""
DataArray Coordinates

foo.coords["time"]
"""
<xarray.DataArray 'time' (time: 4)>
array(['2000-01-01T00:00:00.000000000', '2000-01-02T00:00:00.000000000',
       '2000-01-03T00:00:00.000000000', '2000-01-04T00:00:00.000000000'],
      dtype='datetime64[ns]')
Coordinates:
  * time     (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 2000-01-04
"""
foo["time"]
"""
<xarray.DataArray 'time' (time: 4)>
array(['2000-01-01T00:00:00.000000000', '2000-01-02T00:00:00.000000000',
       '2000-01-03T00:00:00.000000000', '2000-01-04T00:00:00.000000000'],
      dtype='datetime64[ns]')
Coordinates:
  * time     (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 2000-01-04
"""
foo["ranking"] = ("space", [1, 2, 3])

foo.coords
"""
Coordinates:
  * time     (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 2000-01-04
  * space    (space) <U2 'IA' 'IL' 'IN'
    ranking  (space) int64 1 2 3
"""
del foo["ranking"]

foo.coords
"""
Coordinates:
  * time     (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 2000-01-04
  * space    (space) <U2 'IA' 'IL' 'IN'
"""

Dataset

Dataset是xarray的DataFrame的多维等价物。它是一个类似于dict的容器,由带有对齐维度的标签数组(DataArray对象)组成。它被设计为netCDF文件格式的数据模型的内存表示.
除了数据集本身的类dict接口(可用于访问数据集中的任何变量)之外,数据集还有四个关键属性:

  • dims:从维度名称到每个维度的固定长度的字典映射(例如,{‘x’: 6,’ y’: 6,’ time’: 8})
  • data_vars:一个类似dict的容器,包含对应于变量的数据数组
  • coords:data arrays的另一个类似dict的容器,用于标记data_vars中使用的点(例如,数字、日期时间对象或字符串的数组)
  • attrs: dict来保存任意元数据
Creating a Dataset

要从头开始创建数据集,请为任何变量(data_vars)、坐标(coords)和属性(attrs)提供字典。

  • data_vars应该是一个字典,每个键都是变量名,每个值都是以下值之一:
    • 数据数组或变量
    • 一个形式为(dims,data[,attrs])的元组,它被转换为变量的参数
    • 一个pandas,它被转换成一个数据数组
    • 一个1D数组或列表,它被解释为一维坐标变量的值,沿着与其名称相同的维度
  • coords应该是和data_vars一样形式的字典。
  • attrs应该是一本字典。
temp = 15 + 8 * np.random.randn(2, 2, 3)

precip = 10 * np.random.rand(2, 2, 3)

lon = [[-99.83, -99.32], [-99.79, -99.23]]

lat = [[42.25, 42.21], [42.63, 42.59]]

ds = xr.Dataset(
    {
        "temperature": (["x", "y", "time"], temp),
        "precipitation": (["x", "y", "time"], precip),
    },
    coords={
        "lon": (["x", "y"], lon),
        "lat": (["x", "y"], lat),
        "time": pd.date_range("2014-09-06", periods=3),
        "reference_time": pd.Timestamp("2014-09-05"),
    },
)
ds
"""
<xarray.Dataset>
Dimensions:         (x: 2, y: 2, time: 3)
Coordinates:
    lon             (x, y) float64 -99.83 -99.32 -99.79 -99.23
    lat             (x, y) float64 42.25 42.21 42.63 42.59
  * time            (time) datetime64[ns] 2014-09-06 2014-09-07 2014-09-08
    reference_time  datetime64[ns] 2014-09-05
Dimensions without coordinates: x, y
Data variables:
    temperature     (x, y, time) float64 11.04 23.57 20.77 ... 6.301 9.61 15.91
    precipitation   (x, y, time) float64 5.904 2.453 3.404 ... 3.435 1.709 3.947
"""
Dataset contents
"temperature" in ds
True
ds["temperature"]
"""
<xarray.DataArray 'temperature' (x: 2, y: 2, time: 3)>
array([[[11.041, 23.574, 20.772],
        [ 9.346,  6.683, 17.175]],

       [[11.6  , 19.536, 17.21 ],
        [ 6.301,  9.61 , 15.909]]])
Coordinates:
    lon             (x, y) float64 -99.83 -99.32 -99.79 -99.23
    lat             (x, y) float64 42.25 42.21 42.63 42.59
  * time            (time) datetime64[ns] 2014-09-06 2014-09-07 2014-09-08
    reference_time  datetime64[ns] 2014-09-05
Dimensions without coordinates: x, y
"""
ds.data_vars
"""
Data variables:
    temperature    (x, y, time) float64 11.04 23.57 20.77 ... 6.301 9.61 15.91
    precipitation  (x, y, time) float64 5.904 2.453 3.404 ... 3.435 1.709 3.947

"""
ds.coords
"""
Coordinates:
    lon             (x, y) float64 -99.83 -99.32 -99.79 -99.23
    lat             (x, y) float64 42.25 42.21 42.63 42.59
  * time            (time) datetime64[ns] 2014-09-06 2014-09-07 2014-09-08
    reference_time  datetime64[ns] 2014-09-05
"""
ds.attrs["title"] = "example attribute"
"""
<xarray.Dataset>
Dimensions:         (x: 2, y: 2, time: 3)
Coordinates:
    lon             (x, y) float64 -99.83 -99.32 -99.79 -99.23
    lat             (x, y) float64 42.25 42.21 42.63 42.59
  * time            (time) datetime64[ns] 2014-09-06 2014-09-07 2014-09-08
    reference_time  datetime64[ns] 2014-09-05
Dimensions without coordinates: x, y
Data variables:
    temperature     (x, y, time) float64 11.04 23.57 20.77 ... 6.301 9.61 15.91
    precipitation   (x, y, time) float64 5.904 2.453 3.404 ... 3.435 1.709 3.947
Attributes:
    title:    example attribute
"""
Dictionary like methods
ds = xr.Dataset()

ds["temperature"] = (("x", "y", "time"), temp)

ds["temperature_double"] = (("x", "y", "time"), temp * 2)

ds["precipitation"] = (("x", "y", "time"), precip)

ds.coords["lat"] = (("x", "y"), lat)

ds.coords["lon"] = (("x", "y"), lon)

ds.coords["time"] = pd.date_range("2014-09-06", periods=3)

ds.coords["reference_time"] = pd.Timestamp("2014-09-05")
Transforming datasets

除了类似字典的方法(如上所述),xarray还有其他方法(如pandas)将数据集转换成新对象。
要删除变量,可以通过用名称列表进行索引或使用drop_vars()方法返回新的数据集来选择和删除变量的显式列表。这些操作围绕坐标进行:

ds[["temperature"]]
"""
<xarray.Dataset>
Dimensions:         (x: 2, y: 2, time: 3)
Coordinates:
    lat             (x, y) float64 42.25 42.21 42.63 42.59
    lon             (x, y) float64 -99.83 -99.32 -99.79 -99.23
  * time            (time) datetime64[ns] 2014-09-06 2014-09-07 2014-09-08
    reference_time  datetime64[ns] 2014-09-05
Dimensions without coordinates: x, y
Data variables:
    temperature     (x, y, time) float64 11.04 23.57 20.77 ... 6.301 9.61 15.91
    """
ds[["temperature", "temperature_double"]]
"""
<xarray.Dataset>
Dimensions:             (x: 2, y: 2, time: 3)
Coordinates:
    lat                 (x, y) float64 42.25 42.21 42.63 42.59
    lon                 (x, y) float64 -99.83 -99.32 -99.79 -99.23
  * time                (time) datetime64[ns] 2014-09-06 2014-09-07 2014-09-08
    reference_time      datetime64[ns] 2014-09-05
Dimensions without coordinates: x, y
Data variables:
    temperature         (x, y, time) float64 11.04 23.57 20.77 ... 9.61 15.91
    temperature_double  (x, y, time) float64 22.08 47.15 41.54 ... 19.22 31.82

"""
ds.drop_vars("temperature")
"""
<xarray.Dataset>
Dimensions:             (x: 2, y: 2, time: 3)
Coordinates:
    lat                 (x, y) float64 42.25 42.21 42.63 42.59
    lon                 (x, y) float64 -99.83 -99.32 -99.79 -99.23
  * time                (time) datetime64[ns] 2014-09-06 2014-09-07 2014-09-08
    reference_time      datetime64[ns] 2014-09-05
Dimensions without coordinates: x, y
Data variables:
    temperature_double  (x, y, time) float64 22.08 47.15 41.54 ... 19.22 31.82
    precipitation       (x, y, time) float64 5.904 2.453 3.404 ... 1.709 3.947
    
"""

要删除一个维度,可以使用drop_dims()方法。使用该维度的任何变量都将被删除:

ds.drop_dims("time")
"""Out[58]: 
<xarray.Dataset>
Dimensions:         (x: 2, y: 2)
Coordinates:
    lat             (x, y) float64 42.25 42.21 42.63 42.59
    lon             (x, y) float64 -99.83 -99.32 -99.79 -99.23
    reference_time  datetime64[ns] 2014-09-05
Dimensions without coordinates: x, y
Data variables:
    *empty*
    """

作为类似字典的修改的替代方法,您可以使用assign()和assign_coords()。这些方法返回一个带有附加(或替换)值的新数据集:

ds.assign(temperature2=2 * ds.temperature)
"""Out[59]: 
<xarray.Dataset>
Dimensions:             (x: 2, y: 2, time: 3)
Coordinates:
    lat                 (x, y) float64 42.25 42.21 42.63 42.59
    lon                 (x, y) float64 -99.83 -99.32 -99.79 -99.23
  * time                (time) datetime64[ns] 2014-09-06 2014-09-07 2014-09-08
    reference_time      datetime64[ns] 2014-09-05
Dimensions without coordinates: x, y
Data variables:
    temperature         (x, y, time) float64 11.04 23.57 20.77 ... 9.61 15.91
    temperature_double  (x, y, time) float64 22.08 47.15 41.54 ... 19.22 31.82
    precipitation       (x, y, time) float64 5.904 2.453 3.404 ... 1.709 3.947
    temperature2        (x, y, time) float64 22.08 47.15 41.54 ... 19.22 31.82"""

还有pipe()方法,它允许您将方法调用与外部函数(例如ds.pipe(func))一起使用,而不是简单地调用它(例如func(ds))。这允许您编写用于转换数据的管道(使用“方法链接”),而不是编写难以遵循的嵌套函数调用:

# these lines are equivalent, but with pipe we can make the logic flow
# entirely from left to right
plt.plot((2 * ds.temperature.sel(x=0)).mean("y"))
#Out[60]: [<matplotlib.lines.Line2D at 0x7f6e862a7460>]

(ds.temperature.sel(x=0).pipe(lambda x: 2 * x).mean("y").pipe(plt.plot))
#Out[61]: [<matplotlib.lines.Line2D at 0x7f6e862a6230>]

使用xarray,创建新数据集不会有性能损失,即使变量是从磁盘上的文件中延迟加载的。创建新对象而不是改变现有对象通常会使代码更容易理解,所以我们鼓励使用这种方法

Renaming variables

ds.rename({"temperature": "temp", "precipitation": "precip"})
"""Out[62]: 
<xarray.Dataset>
Dimensions:             (x: 2, y: 2, time: 3)
Coordinates:
    lat                 (x, y) float64 42.25 42.21 42.63 42.59
    lon                 (x, y) float64 -99.83 -99.32 -99.79 -99.23
  * time                (time) datetime64[ns] 2014-09-06 2014-09-07 2014-09-08
    reference_time      datetime64[ns] 2014-09-05
Dimensions without coordinates: x, y
Data variables:
    temp                (x, y, time) float64 11.04 23.57 20.77 ... 9.61 15.91
    temperature_double  (x, y, time) float64 22.08 47.15 41.54 ... 19.22 31.82
    precip              (x, y, time) float64 5.904 2.453 3.404 ... 1.709 3.947"""

Coordinates

坐标是为coords属性中的DataArray和Dataset对象存储的辅助变量:

"""ds.coords
Out[65]: 
Coordinates:
    lat             (x, y) float64 42.25 42.21 42.63 42.59
    lon             (x, y) float64 -99.83 -99.32 -99.79 -99.23
  * time            (time) datetime64[ns] 2014-09-06 2014-09-07 2014-09-08
    reference_time  datetime64[ns] 2014-09-05
    day             (time) int64 6 7 8"""
  • 维度坐标是一维坐标,其名称等于其唯一的维度(打印数据集或数据数组时用*标记)。它们用于基于标签的索引和对齐,就像熊猫数据帧或系列中的索引一样。的确,这些“维度”坐标用的是一只熊猫。内部索引来存储它们的值。
  • 非尺寸坐标是包含坐标数据的变量,但不是尺寸坐标。它们可以是多维的(请参阅使用多维坐标),并且非维度坐标的名称与其维度的名称之间没有关系。无量纲坐标可用于索引或绘图;否则,xarray不会直接使用与它们相关联的值。它们不用于对齐或自动索引,也不需要在做算术时匹配(见坐标)。
Modifying coordinates

要完全添加或删除坐标数组,可以使用类似字典的语法,如上所示。 要在数据和坐标之间来回转换,可以使用set_coords()和reset_coords()方法:

ds.reset_coords()
'''Out[66]: 
<xarray.Dataset>
Dimensions:             (x: 2, y: 2, time: 3)
Coordinates:
  * time                (time) datetime64[ns] 2014-09-06 2014-09-07 2014-09-08
Dimensions without coordinates: x, y
Data variables:
    temperature         (x, y, time) float64 11.04 23.57 20.77 ... 9.61 15.91
    temperature_double  (x, y, time) float64 22.08 47.15 41.54 ... 19.22 31.82
    precipitation       (x, y, time) float64 5.904 2.453 3.404 ... 1.709 3.947
    lat                 (x, y) float64 42.25 42.21 42.63 42.59
    lon                 (x, y) float64 -99.83 -99.32 -99.79 -99.23
    reference_time      datetime64[ns] 2014-09-05
    day                 (time) int64 6 7 8
'''
ds.set_coords(["temperature", "precipitation"])
'''Out[67]: 
<xarray.Dataset>
Dimensions:             (x: 2, y: 2, time: 3)
Coordinates:
    temperature         (x, y, time) float64 11.04 23.57 20.77 ... 9.61 15.91
    precipitation       (x, y, time) float64 5.904 2.453 3.404 ... 1.709 3.947
    lat                 (x, y) float64 42.25 42.21 42.63 42.59
    lon                 (x, y) float64 -99.83 -99.32 -99.79 -99.23
  * time                (time) datetime64[ns] 2014-09-06 2014-09-07 2014-09-08
    reference_time      datetime64[ns] 2014-09-05
    day                 (time) int64 6 7 8
Dimensions without coordinates: x, y
Data variables:
    temperature_double  (x, y, time) float64 22.08 47.15 41.54 ... 19.22 31.82
'''
ds["temperature"].reset_coords(drop=True)
'''Out[68]: 
<xarray.DataArray 'temperature' (x: 2, y: 2, time: 3)>
array([[[11.041, 23.574, 20.772],
        [ 9.346,  6.683, 17.175]],

       [[11.6  , 19.536, 17.21 ],
        [ 6.301,  9.61 , 15.909]]])
Coordinates:
  * time     (time) datetime64[ns] 2014-09-06 2014-09-07 2014-09-08
Dimensions without coordinates: x, y'''
Coordinates methods

坐标对象也有一些有用的方法,主要用于将它们转换成数据集对象:

ds.coords.to_dataset()
'''Out[69]: 
<xarray.Dataset>
Dimensions:         (x: 2, y: 2, time: 3)
Coordinates:
    lat             (x, y) float64 42.25 42.21 42.63 42.59
    lon             (x, y) float64 -99.83 -99.32 -99.79 -99.23
  * time            (time) datetime64[ns] 2014-09-06 2014-09-07 2014-09-08
    reference_time  datetime64[ns] 2014-09-05
    day             (time) int64 6 7 8
Dimensions without coordinates: x, y
Data variables:
    *empty*'''

merge方法特别有趣,因为它实现了在算术运算中用于合并坐标的相同逻辑(参见计算):

alt = xr.Dataset(coords={"z": [10], "lat": 0, "lon": 0})

ds.coords.merge(alt.coords)
'''Out[71]: 
<xarray.Dataset>
Dimensions:         (time: 3, z: 1)
Coordinates:
  * time            (time) datetime64[ns] 2014-09-06 2014-09-07 2014-09-08
    reference_time  datetime64[ns] 2014-09-05
    day             (time) int64 6 7 8
  * z               (z) int64 10
Data variables:
    *empty*'''
Indexes

将坐标(或任何数据数组)转换成实际的 pandas.Index,使用to_index()方法:

ds["time"].to_index()
#Out[72]: DatetimeIndex(['2014-09-06', '2014-09-07', '2014-09-08'], dtype='datetime64[ns]', name='time', freq='D')

一个有用的快捷方式是indexes属性(在DataArray和Dataset上),该属性惰性地构造一个字典,其键由每个维度给出,其值是Index对象:

ds.indexes
'''Out[73]: 
Indexes:
    time     DatetimeIndex(['2014-09-06', '2014-09-07', '2014-09-08'], dtype='datetime64[ns]', name='time', freq='D')'''
MultiIndex coordinates
midx = pd.MultiIndex.from_arrays(
    [["R", "R", "V", "V"], [0.1, 0.2, 0.7, 0.9]], names=("band", "wn")
)


mda = xr.DataArray(np.random.rand(4), coords={"spec": midx}, dims="spec")

mda
'''Out[76]: 
<xarray.DataArray (spec: 4)>
array([0.642, 0.275, 0.462, 0.871])
Coordinates:
  * spec     (spec) object MultiIndex
  * band     (spec) object 'R' 'R' 'V' 'V'
  * wn       (spec) float64 0.1 0.2 0.7 0.9'''

为方便起见,多索引级别可作为“虚拟”或“派生”坐标直接访问(在打印数据集或数据数组时用-标记):

mda["band"]
'''Out[77]: 
<xarray.DataArray 'band' (spec: 4)>
array(['R', 'R', 'V', 'V'], dtype=object)
Coordinates:
  * spec     (spec) object MultiIndex
  * band     (spec) object 'R' 'R' 'V' 'V'
  * wn       (spec) float64 0.1 0.2 0.7 0.9'''

mda.wn
'''Out[78]: 
<xarray.DataArray 'wn' (spec: 4)>
array([0.1, 0.2, 0.7, 0.9])
Coordinates:
  * spec     (spec) object MultiIndex
  * band     (spec) object 'R' 'R' 'V' 'V'
  * wn       (spec) float64 0.1 0.2 0.7 0.9'''
  • 17
    点赞
  • 30
    收藏
    觉得还不错? 一键收藏
  • 2
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值