数据结构
DataArray
DataArray是xarray对带标签的多维数组的实现。
它有几个关键属性:
- values:保存数组值的numpy.ndarray或类似numpy的数组
- dims:每个轴的尺寸名称(例如,(’ x ‘,’ y ‘,’ z '))
- coords:一个类似dict的数组(坐标)容器,用来标记每个点(例如,一维数组的数字、日期对象或字符串)
- attrs:保存任意元数据(属性)的dict
Creating a DataArray
DataArray构造函数采用:
- 数据:值的多维数组(例如,numpy ndarray、类似numpy的数组、Series、DataFrame或pandas。面板)
- 坐标表:坐标表或坐标字典。如果是列表,应该是元组列表,其中第一个元素是维度名称,第二个元素是对应的坐标array_like对象。
- dims:维度名称列表。如果省略,并且coords是元组列表,则维度名称取自coords。
- attrs:要添加到实例的属性字典
- name:命名实例的字符串
import xarray as xr
import numpy as np
import pandas as pd
data = np.random.rand(4, 3)
locs = ["IA", "IL", "IN"]
times = pd.date_range("2000-01-01", periods=4)
foo = xr.DataArray(data, coords=[times, locs], dims=["time", "space"])
print(foo)
"""
<xarray.DataArray (time: 4, space: 3)>
array([[0.35200563, 0.569955 , 0.98816514],
[0.95311654, 0.8268097 , 0.15125512],
[0.6361332 , 0.35353079, 0.33760348],
[0.28808098, 0.19150729, 0.79793025]])
Coordinates:
* time (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 2000-01-04
* space (space) <U2 'IA' 'IL' 'IN'
"""
DataArray properties
import xarray as xr
import numpy as np
import pandas as pd
data = np.random.rand(4, 3)
locs = ["IA", "IL", "IN"]
times = pd.date_range("2000-01-01", periods=4)
foo = xr.DataArray(data, coords=[times, locs], dims=["time1", "space1"])
print(foo.values)
print(foo.dims)
print(foo.coords)
print(foo.attrs)
print(foo.name)
"""
[[0.90006724 0.52320969 0.54733588]
[0.1018204 0.65948629 0.52749082]
[0.73632638 0.50559387 0.57415316]
[0.24010929 0.52222784 0.39407124]]
('time1', 'space1')
Coordinates:
* time1 (time1) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 2000-01-04
* space1 (space1) <U2 'IA' 'IL' 'IN'
{}
None
"""
foo.name = "foo"
foo.attrs["units"] = "meters"
foo
"""
<xarray.DataArray 'foo' (time: 4, space: 3)>
array([[0.127, 0.967, 0.26 ],
[0.897, 0.377, 0.336],
[0.451, 0.84 , 0.123],
[0.543, 0.373, 0.448]])
Coordinates:
* time (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 2000-01-04
* space (space) <U2 'IA' 'IL' 'IN'
Attributes:
units: meters
"""
foo.rename("bar")
"""
<xarray.DataArray 'bar' (time: 4, space: 3)>
array([[0.127, 0.967, 0.26 ],
[0.897, 0.377, 0.336],
[0.451, 0.84 , 0.123],
[0.543, 0.373, 0.448]])
Coordinates:
* time (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 2000-01-04
* space (space) <U2 'IA' 'IL' 'IN'
Attributes:
units: meters
"""
DataArray Coordinates
foo.coords["time"]
"""
<xarray.DataArray 'time' (time: 4)>
array(['2000-01-01T00:00:00.000000000', '2000-01-02T00:00:00.000000000',
'2000-01-03T00:00:00.000000000', '2000-01-04T00:00:00.000000000'],
dtype='datetime64[ns]')
Coordinates:
* time (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 2000-01-04
"""
foo["time"]
"""
<xarray.DataArray 'time' (time: 4)>
array(['2000-01-01T00:00:00.000000000', '2000-01-02T00:00:00.000000000',
'2000-01-03T00:00:00.000000000', '2000-01-04T00:00:00.000000000'],
dtype='datetime64[ns]')
Coordinates:
* time (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 2000-01-04
"""
foo["ranking"] = ("space", [1, 2, 3])
foo.coords
"""
Coordinates:
* time (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 2000-01-04
* space (space) <U2 'IA' 'IL' 'IN'
ranking (space) int64 1 2 3
"""
del foo["ranking"]
foo.coords
"""
Coordinates:
* time (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 2000-01-04
* space (space) <U2 'IA' 'IL' 'IN'
"""
Dataset
Dataset是xarray的DataFrame的多维等价物。它是一个类似于dict的容器,由带有对齐维度的标签数组(DataArray对象)组成。它被设计为netCDF文件格式的数据模型的内存表示.
除了数据集本身的类dict接口(可用于访问数据集中的任何变量)之外,数据集还有四个关键属性:
- dims:从维度名称到每个维度的固定长度的字典映射(例如,{‘x’: 6,’ y’: 6,’ time’: 8})
- data_vars:一个类似dict的容器,包含对应于变量的数据数组
- coords:data arrays的另一个类似dict的容器,用于标记data_vars中使用的点(例如,数字、日期时间对象或字符串的数组)
- attrs: dict来保存任意元数据
Creating a Dataset
要从头开始创建数据集,请为任何变量(data_vars)、坐标(coords)和属性(attrs)提供字典。
- data_vars应该是一个字典,每个键都是变量名,每个值都是以下值之一:
- 数据数组或变量
- 一个形式为(dims,data[,attrs])的元组,它被转换为变量的参数
- 一个pandas,它被转换成一个数据数组
- 一个1D数组或列表,它被解释为一维坐标变量的值,沿着与其名称相同的维度
- coords应该是和data_vars一样形式的字典。
- attrs应该是一本字典。
temp = 15 + 8 * np.random.randn(2, 2, 3)
precip = 10 * np.random.rand(2, 2, 3)
lon = [[-99.83, -99.32], [-99.79, -99.23]]
lat = [[42.25, 42.21], [42.63, 42.59]]
ds = xr.Dataset(
{
"temperature": (["x", "y", "time"], temp),
"precipitation": (["x", "y", "time"], precip),
},
coords={
"lon": (["x", "y"], lon),
"lat": (["x", "y"], lat),
"time": pd.date_range("2014-09-06", periods=3),
"reference_time": pd.Timestamp("2014-09-05"),
},
)
ds
"""
<xarray.Dataset>
Dimensions: (x: 2, y: 2, time: 3)
Coordinates:
lon (x, y) float64 -99.83 -99.32 -99.79 -99.23
lat (x, y) float64 42.25 42.21 42.63 42.59
* time (time) datetime64[ns] 2014-09-06 2014-09-07 2014-09-08
reference_time datetime64[ns] 2014-09-05
Dimensions without coordinates: x, y
Data variables:
temperature (x, y, time) float64 11.04 23.57 20.77 ... 6.301 9.61 15.91
precipitation (x, y, time) float64 5.904 2.453 3.404 ... 3.435 1.709 3.947
"""
Dataset contents
"temperature" in ds
True
ds["temperature"]
"""
<xarray.DataArray 'temperature' (x: 2, y: 2, time: 3)>
array([[[11.041, 23.574, 20.772],
[ 9.346, 6.683, 17.175]],
[[11.6 , 19.536, 17.21 ],
[ 6.301, 9.61 , 15.909]]])
Coordinates:
lon (x, y) float64 -99.83 -99.32 -99.79 -99.23
lat (x, y) float64 42.25 42.21 42.63 42.59
* time (time) datetime64[ns] 2014-09-06 2014-09-07 2014-09-08
reference_time datetime64[ns] 2014-09-05
Dimensions without coordinates: x, y
"""
ds.data_vars
"""
Data variables:
temperature (x, y, time) float64 11.04 23.57 20.77 ... 6.301 9.61 15.91
precipitation (x, y, time) float64 5.904 2.453 3.404 ... 3.435 1.709 3.947
"""
ds.coords
"""
Coordinates:
lon (x, y) float64 -99.83 -99.32 -99.79 -99.23
lat (x, y) float64 42.25 42.21 42.63 42.59
* time (time) datetime64[ns] 2014-09-06 2014-09-07 2014-09-08
reference_time datetime64[ns] 2014-09-05
"""
ds.attrs["title"] = "example attribute"
"""
<xarray.Dataset>
Dimensions: (x: 2, y: 2, time: 3)
Coordinates:
lon (x, y) float64 -99.83 -99.32 -99.79 -99.23
lat (x, y) float64 42.25 42.21 42.63 42.59
* time (time) datetime64[ns] 2014-09-06 2014-09-07 2014-09-08
reference_time datetime64[ns] 2014-09-05
Dimensions without coordinates: x, y
Data variables:
temperature (x, y, time) float64 11.04 23.57 20.77 ... 6.301 9.61 15.91
precipitation (x, y, time) float64 5.904 2.453 3.404 ... 3.435 1.709 3.947
Attributes:
title: example attribute
"""
Dictionary like methods
ds = xr.Dataset()
ds["temperature"] = (("x", "y", "time"), temp)
ds["temperature_double"] = (("x", "y", "time"), temp * 2)
ds["precipitation"] = (("x", "y", "time"), precip)
ds.coords["lat"] = (("x", "y"), lat)
ds.coords["lon"] = (("x", "y"), lon)
ds.coords["time"] = pd.date_range("2014-09-06", periods=3)
ds.coords["reference_time"] = pd.Timestamp("2014-09-05")
Transforming datasets
除了类似字典的方法(如上所述),xarray还有其他方法(如pandas)将数据集转换成新对象。
要删除变量,可以通过用名称列表进行索引或使用drop_vars()方法返回新的数据集来选择和删除变量的显式列表。这些操作围绕坐标进行:
ds[["temperature"]]
"""
<xarray.Dataset>
Dimensions: (x: 2, y: 2, time: 3)
Coordinates:
lat (x, y) float64 42.25 42.21 42.63 42.59
lon (x, y) float64 -99.83 -99.32 -99.79 -99.23
* time (time) datetime64[ns] 2014-09-06 2014-09-07 2014-09-08
reference_time datetime64[ns] 2014-09-05
Dimensions without coordinates: x, y
Data variables:
temperature (x, y, time) float64 11.04 23.57 20.77 ... 6.301 9.61 15.91
"""
ds[["temperature", "temperature_double"]]
"""
<xarray.Dataset>
Dimensions: (x: 2, y: 2, time: 3)
Coordinates:
lat (x, y) float64 42.25 42.21 42.63 42.59
lon (x, y) float64 -99.83 -99.32 -99.79 -99.23
* time (time) datetime64[ns] 2014-09-06 2014-09-07 2014-09-08
reference_time datetime64[ns] 2014-09-05
Dimensions without coordinates: x, y
Data variables:
temperature (x, y, time) float64 11.04 23.57 20.77 ... 9.61 15.91
temperature_double (x, y, time) float64 22.08 47.15 41.54 ... 19.22 31.82
"""
ds.drop_vars("temperature")
"""
<xarray.Dataset>
Dimensions: (x: 2, y: 2, time: 3)
Coordinates:
lat (x, y) float64 42.25 42.21 42.63 42.59
lon (x, y) float64 -99.83 -99.32 -99.79 -99.23
* time (time) datetime64[ns] 2014-09-06 2014-09-07 2014-09-08
reference_time datetime64[ns] 2014-09-05
Dimensions without coordinates: x, y
Data variables:
temperature_double (x, y, time) float64 22.08 47.15 41.54 ... 19.22 31.82
precipitation (x, y, time) float64 5.904 2.453 3.404 ... 1.709 3.947
"""
要删除一个维度,可以使用drop_dims()方法。使用该维度的任何变量都将被删除:
ds.drop_dims("time")
"""Out[58]:
<xarray.Dataset>
Dimensions: (x: 2, y: 2)
Coordinates:
lat (x, y) float64 42.25 42.21 42.63 42.59
lon (x, y) float64 -99.83 -99.32 -99.79 -99.23
reference_time datetime64[ns] 2014-09-05
Dimensions without coordinates: x, y
Data variables:
*empty*
"""
作为类似字典的修改的替代方法,您可以使用assign()和assign_coords()。这些方法返回一个带有附加(或替换)值的新数据集:
ds.assign(temperature2=2 * ds.temperature)
"""Out[59]:
<xarray.Dataset>
Dimensions: (x: 2, y: 2, time: 3)
Coordinates:
lat (x, y) float64 42.25 42.21 42.63 42.59
lon (x, y) float64 -99.83 -99.32 -99.79 -99.23
* time (time) datetime64[ns] 2014-09-06 2014-09-07 2014-09-08
reference_time datetime64[ns] 2014-09-05
Dimensions without coordinates: x, y
Data variables:
temperature (x, y, time) float64 11.04 23.57 20.77 ... 9.61 15.91
temperature_double (x, y, time) float64 22.08 47.15 41.54 ... 19.22 31.82
precipitation (x, y, time) float64 5.904 2.453 3.404 ... 1.709 3.947
temperature2 (x, y, time) float64 22.08 47.15 41.54 ... 19.22 31.82"""
还有pipe()方法,它允许您将方法调用与外部函数(例如ds.pipe(func))一起使用,而不是简单地调用它(例如func(ds))。这允许您编写用于转换数据的管道(使用“方法链接”),而不是编写难以遵循的嵌套函数调用:
# these lines are equivalent, but with pipe we can make the logic flow
# entirely from left to right
plt.plot((2 * ds.temperature.sel(x=0)).mean("y"))
#Out[60]: [<matplotlib.lines.Line2D at 0x7f6e862a7460>]
(ds.temperature.sel(x=0).pipe(lambda x: 2 * x).mean("y").pipe(plt.plot))
#Out[61]: [<matplotlib.lines.Line2D at 0x7f6e862a6230>]
使用xarray,创建新数据集不会有性能损失,即使变量是从磁盘上的文件中延迟加载的。创建新对象而不是改变现有对象通常会使代码更容易理解,所以我们鼓励使用这种方法
Renaming variables
ds.rename({"temperature": "temp", "precipitation": "precip"})
"""Out[62]:
<xarray.Dataset>
Dimensions: (x: 2, y: 2, time: 3)
Coordinates:
lat (x, y) float64 42.25 42.21 42.63 42.59
lon (x, y) float64 -99.83 -99.32 -99.79 -99.23
* time (time) datetime64[ns] 2014-09-06 2014-09-07 2014-09-08
reference_time datetime64[ns] 2014-09-05
Dimensions without coordinates: x, y
Data variables:
temp (x, y, time) float64 11.04 23.57 20.77 ... 9.61 15.91
temperature_double (x, y, time) float64 22.08 47.15 41.54 ... 19.22 31.82
precip (x, y, time) float64 5.904 2.453 3.404 ... 1.709 3.947"""
Coordinates
坐标是为coords属性中的DataArray和Dataset对象存储的辅助变量:
"""ds.coords
Out[65]:
Coordinates:
lat (x, y) float64 42.25 42.21 42.63 42.59
lon (x, y) float64 -99.83 -99.32 -99.79 -99.23
* time (time) datetime64[ns] 2014-09-06 2014-09-07 2014-09-08
reference_time datetime64[ns] 2014-09-05
day (time) int64 6 7 8"""
- 维度坐标是一维坐标,其名称等于其唯一的维度(打印数据集或数据数组时用*标记)。它们用于基于标签的索引和对齐,就像熊猫数据帧或系列中的索引一样。的确,这些“维度”坐标用的是一只熊猫。内部索引来存储它们的值。
- 非尺寸坐标是包含坐标数据的变量,但不是尺寸坐标。它们可以是多维的(请参阅使用多维坐标),并且非维度坐标的名称与其维度的名称之间没有关系。无量纲坐标可用于索引或绘图;否则,xarray不会直接使用与它们相关联的值。它们不用于对齐或自动索引,也不需要在做算术时匹配(见坐标)。
Modifying coordinates
要完全添加或删除坐标数组,可以使用类似字典的语法,如上所示。 要在数据和坐标之间来回转换,可以使用set_coords()和reset_coords()方法:
ds.reset_coords()
'''Out[66]:
<xarray.Dataset>
Dimensions: (x: 2, y: 2, time: 3)
Coordinates:
* time (time) datetime64[ns] 2014-09-06 2014-09-07 2014-09-08
Dimensions without coordinates: x, y
Data variables:
temperature (x, y, time) float64 11.04 23.57 20.77 ... 9.61 15.91
temperature_double (x, y, time) float64 22.08 47.15 41.54 ... 19.22 31.82
precipitation (x, y, time) float64 5.904 2.453 3.404 ... 1.709 3.947
lat (x, y) float64 42.25 42.21 42.63 42.59
lon (x, y) float64 -99.83 -99.32 -99.79 -99.23
reference_time datetime64[ns] 2014-09-05
day (time) int64 6 7 8
'''
ds.set_coords(["temperature", "precipitation"])
'''Out[67]:
<xarray.Dataset>
Dimensions: (x: 2, y: 2, time: 3)
Coordinates:
temperature (x, y, time) float64 11.04 23.57 20.77 ... 9.61 15.91
precipitation (x, y, time) float64 5.904 2.453 3.404 ... 1.709 3.947
lat (x, y) float64 42.25 42.21 42.63 42.59
lon (x, y) float64 -99.83 -99.32 -99.79 -99.23
* time (time) datetime64[ns] 2014-09-06 2014-09-07 2014-09-08
reference_time datetime64[ns] 2014-09-05
day (time) int64 6 7 8
Dimensions without coordinates: x, y
Data variables:
temperature_double (x, y, time) float64 22.08 47.15 41.54 ... 19.22 31.82
'''
ds["temperature"].reset_coords(drop=True)
'''Out[68]:
<xarray.DataArray 'temperature' (x: 2, y: 2, time: 3)>
array([[[11.041, 23.574, 20.772],
[ 9.346, 6.683, 17.175]],
[[11.6 , 19.536, 17.21 ],
[ 6.301, 9.61 , 15.909]]])
Coordinates:
* time (time) datetime64[ns] 2014-09-06 2014-09-07 2014-09-08
Dimensions without coordinates: x, y'''
Coordinates methods
坐标对象也有一些有用的方法,主要用于将它们转换成数据集对象:
ds.coords.to_dataset()
'''Out[69]:
<xarray.Dataset>
Dimensions: (x: 2, y: 2, time: 3)
Coordinates:
lat (x, y) float64 42.25 42.21 42.63 42.59
lon (x, y) float64 -99.83 -99.32 -99.79 -99.23
* time (time) datetime64[ns] 2014-09-06 2014-09-07 2014-09-08
reference_time datetime64[ns] 2014-09-05
day (time) int64 6 7 8
Dimensions without coordinates: x, y
Data variables:
*empty*'''
merge方法特别有趣,因为它实现了在算术运算中用于合并坐标的相同逻辑(参见计算):
alt = xr.Dataset(coords={"z": [10], "lat": 0, "lon": 0})
ds.coords.merge(alt.coords)
'''Out[71]:
<xarray.Dataset>
Dimensions: (time: 3, z: 1)
Coordinates:
* time (time) datetime64[ns] 2014-09-06 2014-09-07 2014-09-08
reference_time datetime64[ns] 2014-09-05
day (time) int64 6 7 8
* z (z) int64 10
Data variables:
*empty*'''
Indexes
将坐标(或任何数据数组)转换成实际的 pandas.Index,使用to_index()方法:
ds["time"].to_index()
#Out[72]: DatetimeIndex(['2014-09-06', '2014-09-07', '2014-09-08'], dtype='datetime64[ns]', name='time', freq='D')
一个有用的快捷方式是indexes属性(在DataArray和Dataset上),该属性惰性地构造一个字典,其键由每个维度给出,其值是Index对象:
ds.indexes
'''Out[73]:
Indexes:
time DatetimeIndex(['2014-09-06', '2014-09-07', '2014-09-08'], dtype='datetime64[ns]', name='time', freq='D')'''
MultiIndex coordinates
midx = pd.MultiIndex.from_arrays(
[["R", "R", "V", "V"], [0.1, 0.2, 0.7, 0.9]], names=("band", "wn")
)
mda = xr.DataArray(np.random.rand(4), coords={"spec": midx}, dims="spec")
mda
'''Out[76]:
<xarray.DataArray (spec: 4)>
array([0.642, 0.275, 0.462, 0.871])
Coordinates:
* spec (spec) object MultiIndex
* band (spec) object 'R' 'R' 'V' 'V'
* wn (spec) float64 0.1 0.2 0.7 0.9'''
为方便起见,多索引级别可作为“虚拟”或“派生”坐标直接访问(在打印数据集或数据数组时用-标记):
mda["band"]
'''Out[77]:
<xarray.DataArray 'band' (spec: 4)>
array(['R', 'R', 'V', 'V'], dtype=object)
Coordinates:
* spec (spec) object MultiIndex
* band (spec) object 'R' 'R' 'V' 'V'
* wn (spec) float64 0.1 0.2 0.7 0.9'''
mda.wn
'''Out[78]:
<xarray.DataArray 'wn' (spec: 4)>
array([0.1, 0.2, 0.7, 0.9])
Coordinates:
* spec (spec) object MultiIndex
* band (spec) object 'R' 'R' 'V' 'V'
* wn (spec) float64 0.1 0.2 0.7 0.9'''