空间数据建模
In real world datasets, particularly in climate and environmental problems, we often have measurements at a specified number of locations (in space and/or time), and will often need to make predictions of the output at unmeasured locations (for example, at the next several time steps, or for a spatial region).
We can approach the spatial problem by constructing a fine grid over the area in question and then predicting at each point on the grid, interpolating between the observed data points. If we have a set of explanatory/predictor variables at the locations of the measurements and for each point of the grid, then we could use a regression model to produce predictions. Our model should also provide a measure of uncertainty around any predictions, as we don’t observe the desired output at all possible locations.
In this course, we’ll be learning and applying methods for handling spatial and temporal measurement data. In this practical, we’ll consider a spatial dataset, and start by using methods that we already know, attempting to fit linear models to this spatial data, potentially identifying any issues with this approach and motivating the need to go beyond such models, as we will later in this course.
在现实世界的数据集中,特别是在气候和环境问题中,我们经常在指定数量的位置(空间或时间)进行测量,并且通常需要对未测量位置的输出进行预测。
我们可以通过在相关区域上构建精细网格,然后在网格上的每个点进行预测,在观察到的数据点之间进行插值来解决空间问题。如果我们在测量位置和网格的每个点都有一组解释或预测变量,那么我们可以使用回归模型来生成预测。我们的模型还应该提供任何预测的不确定性度量,因为我们没有在所有可能的位置观察到所需的输出。
在这个实践中,我们将考虑一个空间数据集,并从使用我们已知的方法开始,尝试将线性模型拟合到这个空间数据,识别这种方法的任何问题并激发超越这些模型的需要。
The Meuse dataset
We’re going to look at the meuse dataset, contained in the sp package. From its description (in ?meuse): “This data set gives locations and topsoil heavy metal concentrations, along with a number of soil and landscape variables at the observation locations, collected in a flood plain of the river Meuse, near the village of Stein (NL). Heavy metal concentrations are from composite samples of an area of approximately 15m x 15m.”
Let’s load in the data, and do some initial data analysis:
我们将使用sp
包中包含的meuse
数据集。在R中市容?meuse
查看它的描述:“该数据集提供了位置和表土重金属浓度,以及观察地点的许多土壤和景观变量,收集在Stein村附近的默兹河洪泛区(荷兰)。 重金属浓度来自面积约为 15m x 15m
的复合样品。”让我们加载数据,并进行一些初步的数据分析:
# 加载包和数据
library(sp)
library(ggplot2)
data(meuse)
head(meuse)
meuse$x<- meuse$x/ 1000 # 整理坐标
meuse$y<- meuse$y/ 1000
summary(meuse)
We have (x, y) locations (in some non-standard coordinate system, we won’t worry about this), and want to model the soil concentration of various metals (cadmium, copper, lead, zinc) in m g . k g − 1 mg.kg^{-1} mg.kg−1. The dataset also contains some other variables which may be useful for prediction, such as elevation, and two measures of distance to the river (dist, dist.m). We’ll use the normalized version, dist, as this choice will come in useful later on. In all that follows, we’ll be attempting to model zinc concentration.
A sensible place to start is plotting the spatial locations of the points, to see the domain we’re working with (we’ll generally be using ggplot2, but everything here could be done similarly with base graphics):
我们有(x,y)位置,并且想要以 m g . k g − 1 mg.kg^{-1} mg.kg−1为单位模拟各种金属(镉、铜、铅、锌)在土壤中的浓度.该数据集还包含一些可能对预测有用的其他变量,例如海拔和到河流的两个距离度量(dist、dist.m)。我们将使用规范化版本dist,因为这个选择稍后会派上用场。 在接下来的内容中,我们将尝试模拟锌浓度。
一个明智的开始是绘制空间点位置,以