Inductive Graph Neural Networks for Spatiotemporal Kriging (IGNNK)
This is the code corresponding to the experiments conducted for the AAAI 2021 paper “Inductive Graph Neural Networks for Spatiotemporal Kriging” (Yuankai Wu, Dingyi Zhuang, Aurélie Labbe and Lijun Sun).
Motivations
In many applications, placing sensors with fully spatial coverage may be impractical. Installation and maintenance costs of devices can also limit the number of sensors deployed in a network. A better kriging model can achieve higher estimation accuracy/reliability with less number of sensors, thus reducing the operation and maintenance cost of a sensor network. The kriging results can produce a fine-grained and high-resolution realization of spatiotemporal data, which can be used to enhance real-world applications such as travel time estimation and disaster evaluation.
A limitation with traditional methods is that they are essentially transductive, for new sensors/nodes introduced to the network, we cannot directly apply a previously trained model; instead, we have to retrain the full model even with only minor changes. Conversely(反之), we develop an Inductive Graph Neural Network Kriging (IGNNK) model in this work.
传统方法本质上是转导式,当节点变动时不能直接应用训练过的模型,需要重新训练整个模型。
Tasks
The goal of spatiotemporal(时空) kriging is to perform signal interpolation(执行信号插值) for unsampled locations given the observed signals from sampled locations during the same period. We first randomly select a subset of nodes from all available sensors and create a corresponding subgraph. We mask some of them as missing and train the GNN to reconstruct the full signals of all nodes (including both the observed and the masked nodes) on the subgraph.
时空克里金插值目标:基于相同时期的采样位置信号对未采样位置执行信号插值。
Datasets
The datasets manipulated in this code can be downloaded on the following locations:
- the METR-LA traffic data: https://github.com/liyaguang/DCRNN;
- the NREL solar energy: https://www.nrel.gov/grid/solar-power-data.html
- the USHCN weather condition: https://www.ncdc.noaa.gov/ushcn/introduction
- the SeData traffic data: https://github.com/zhiyongc/Seattle-Loop-Data
- the PeMS traffic data: https://github.com/liyaguang/DCRNN
Dependencies
numpy
pytorch:PyTorch本身是一个基于Python的科学计算库,特点是可以在GPU上运算。
matplotlib
pandas
scipy:Python 算法库和数学工具包
scikit-learn:六大功能:分类,回归,聚类,数据降维,模型选择和数据预处理
geopandas:GeoPandas的目标是使在python中使用地理空间数据更容易。它结合了Pandas和Shapely的能力,提供了Pandas的地理空间操作和多种Shapely的高级接口。GeoPandas可以让您轻松地在python中进行操作,否则将需要空间数据库,如PostGIS。
pytorch
1、torch:
张量的有关运算 : 如创建、索引、连接、转置、加减乘除、切片等。
2、torch.nn:
包含搭建神经网络层的模块(Modules)和一系列loss函数。如全连接、卷积、BN批处理、dropout、CrossEntryLoss、MSELoss等。
3、torch.nn.functional:
常用的激活函数relu、leaky_relu、sigmoid等。
4、torch.autograd:
提供Tensor所有操作的自动求导方法。
5、torch.optim:
各种参数优化方法,例如SGD、AdaGrad、Adam、RMSProp等
6、torch.utils.data:
用于加载数据。
7、torchvision包
torchvision是PyTorch中专门用来处理图像的库,这个包中常用的几个模块:
torchvision.datasets:是用来进行数据加载的
torchvision.models:为我们提供了已经训练好的模型,让我们可以加载之后,直接使用。包括AlexNet、VGG、ResNet
torchvision.transforms:为我们提供了一般的图像转换操作类。
torchvision.utils:将给定的Tensor保存成image文件。
8、from PIL import Image
我们一般在pytorch中处理的图像无非这几种格式:
PIL:使用python自带图像处理库读取出来的图片格式
numpy:使用python-opencv库读取出来的图片格式
tensor:pytorch中训练时所采取的向量格式(当然也可以说图片)
from PIL import Image是在进行PIL与Tensor的转换,也就是图片格式的转换。
9、matplotlib
这是Pytorch的一个绘图库,是Python中最常用的可视化工具之一,可以非常方便地创建2D图表和一些基本的3D图表。
scipy
import numpy as np
import scipy as sp
由于 Scipy 以 Numpy 为基础,因此很多基础的 Numpy 函数可以在scipy 命名空间中直接调用
Scipy是一个用于数学、科学、工程领域的常用软件包,可以处理插值、积分、优化、图像处理、常微分方程数值解的求解、信号处理等问题。它用于有效计算Numpy矩阵,使Numpy和Scipy协同工作,高效解决问题。
Scikit-learn
Scikit-learn的功能主要被分为六大部分:分类,回归,聚类,数据降维,模型选择和数据预处理。
分类:识别给定对象的类型,分类属于监督学习的范畴,最常见的应用场景包括图像识别和垃圾邮件检测。目前Scikit-learn已经实现的算法包括:支持向量机(SVM),逻辑回归,随机森林,最近邻,决策树等。
回归:是指预测与给定对象相关联的连续值属性,最常见的应用场景包括预测股票价格和预测药物反应等。目前Scikit-learn 已经实现的算法包括:支持向量回归(SVR),弹性网络(Elastic Net),最小角回归(LARS ),贝叶斯回归等。
聚类:是指自动识别具有相似属性的对象,并将其分组为多个集合,属于无监督学习的范畴,最常见的应用场景包括顾客细分和试验结果分组。目前Scikit-learn已经实现的算法包括:K-均值聚类,谱聚类,均值偏移,分层聚类等。
降维:是指使用主成分分析(PCA)、非负矩阵分解(NMF)或特征选择等降维技术来减少要考虑的随机变量的个数,其主要应用场景包括可视化处理和效率提升。
模型选择: 是指对于给定参数和模型的比较、验证和选择,其主要目的是通过参数调整来提升精度。目前Scikit-learn实现的模块包括:格点搜索,交叉验证等。
数据预处理: 是指数据的特征提取和归一化,是机器学习过程中的第一个也是最重要的一个环节。这里归一化是指将输入数据转换为具有零均值和单位权方差的新变量,但因为大多数时候都做不到精确等于零,因此会设置一个可接受的范围,一般都要求落在0-1之间。而特征提取是指将文本或图像数据转换为可用于机器学习的数字变量。
Files
utils.py file: preprocess datasets;
实用工具:预处理数据集
basic_structure.py file: pytorch implementation of basic graph neural network structure
用pytorch实现基础图神经网络结构
IGNNK_D_METR_LA.ipynb file: a training example on METR_LA dataset
在METR_LA dataset上训练的例子
IGNNK_U_Central_missing.ipynb file: present the kriging of central US precipitation(降水量) (USHCN(美国健康中心) weather condition)
呈现美国中部降水量的克里金法
This notebook present the performance of kriging in a continuous area(连续区域) in the central US.
Basic GNNs implementation (basic_structure.py)
Graph convolutional networks - K_GCN in basic_structure.py(图卷积神经网络)
Kipf, Thomas N., and Max Welling. “Semi-Supervised Classification with Graph Convolutional Networks.” (ICLR 2016).
Chebynet - C_GCN(图神经网络)
Micha ̈el Defferrard, Xavier Bresson, and Pierre Vandergheynst. “Convolutional neural networks ongraphs with fast localized spectral filtering.” (NIPS 2016).
Diffusion convolutional networks - D_GCN(扩散卷积网络)
Li, Y., Yu, R., Shahabi, C., & Liu, Y. “Diffusion convolutional recurrent neural network: Data-driven traffic forecasting.” (ICLR 2017).
Our IGNNK structure is based on the diffusion convolutional networks, one can always builds his own structure using those basic building blocks. We will continue implementing more GNN structures that are suitable for kriging tasks.
Graph attention networks - GAT(图注意力网络)
Veličković, P., Cucurull, G., Casanova, A., Romero, A., Lio, P., & Bengio, Y. “Graph attention networks.” (NIPS 2017).
Training on the METR_LA datasets
You can simply train IGNNK on METR-LA from command line(命令行) by
python IGNNK_train.py "metr" --n_o 150 --h 24 --n_m 50 --n_u 50 --max_iter 750
for other datasets:
NREL
python IGNNK_train.py "nrel" --n_o 100 --h 24 --n_m 30 --n_u 30 --max_iter 750
USHCN
python IGNNK_train.py "ushcn" --n_o 900 --h 6 --n_m 300 --n_u 300 --max_iter 750 --z 350
SeData
python IGNNK_train.py "sedata" --n_o 240 --h 24 --n_m 80 --n_u 80 --max_iter 750
haversine:计算两个点经纬度之间的距离
from haversine import haversine
lyon = (45.7597, 4.8422)
paris = (48.8567, 2.3508)
haversine(lyon, paris)
#392.2172595594006 默认是公里
haversine(lyon, paris, unit=Unit.MILES)
#单位设置为英里
#243.71250609539814
haversine(lyon, paris, unit='mi')
#243.71250609539814
utils.py 实用工具
预处理数据集
from __future__ import division
import os
import zipfile
import numpy as np
import scipy.sparse as sp
import pandas as pd
from math import radians, cos, sin, asin, sqrt
from sklearn.externals import joblib
import joblib
import scipy.io
import torch
from torch import nn
读取文件获得经纬度
def get_long_lat(sensor_index, loc=None):
"""
Input the index out from 0-206 to access the longitude(经度) and latitude(纬度) of the nodes
输入0-206的索引,访问节点的经纬度
"""
if loc is None:
locations = pd.read_csv('D:/PaperCode/IGNNKmaster/data/metr/graph_sensor_locations.csv')
else:
locations = loc
lng = locations['longitude'].loc[sensor_index]#获得经度
lat = locations['latitude'].loc[sensor_index]#获得纬度
return lng, lat
graph_sensor_locations.csv
pandas.DataFrame.to_numpy(dtype=None, copy=False, na_value=NoDefault.no_default)
#将表转换为numpy数组
计算两对经纬度之间的大圆距离
def haversine(lon1, lat1, lon2, lat2):
"""
Calculate the great circle distance between two points
on the earth (specified in decimal degrees)
计算两点之间的大圆距离
在地球上(以十进制度指定)
"""
lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2])
#radians弧度
# haversine
dlon = lon2 - lon1
dlat = lat2 - lat1
a = sin(dlat / 2) ** 2 + cos(lat1) * cos(lat2) * sin(dlon / 2) ** 2
c = 2 * asin(sqrt(a))
r = 6371
return c * r * 1000
加载数据
def load_metr_la_rdata():
if (not os.path.isfile("D:/PaperCode/IGNNKmaster/data/metr/adj_mat.npy")
or not os.path.isfile("D:/PaperCode/IGNNKmaster/data/metr/node_values.npy")):
with zipfile.ZipFile("D:/PaperCode/IGNNKmaster/data/metr/METR-LA.zip", 'r') as zip_ref:
zip_ref.extractall("D:/PaperCode/IGNNKmaster/data/metr/")
A = np.load("D:/PaperCode/IGNNKmaster/data/metr/adj_mat.npy")
X = np.load("D:/PaperCode/IGNNKmaster/data/metr/node_values.npy").transpose((1, 2, 0))
X = X.astype(np.float32)
return A, X
transpose((1, 2, 0)
#(x,y,z)-->(y,z,x)
astype()
#修改数据类型
生成nerl_data数据
def generate_nerl_data():
# %% Obtain all the file names
filepath = 'D:/PaperCode/IGNNKmaster/data/nrel/al-pv-2006'
files = os.listdir(filepath)
#返回指定路径下的文件和文件夹列表
# %% Begin parse(解析) the file names and store them in a pandas Dataframe(表格)
tp = [] # Type 类型
lat = [] # Latitude 纬度
lng = [] # Longitude 经度
yr = [] # Year
pv_tp = [] # PV_type
cap = [] # Capacity MW
time_itv = [] # Time interval(间隔)
file_names = []
for _file in files:
parse = _file.split('_')
if parse[-2] == '5':
tp.append(parse[0])
lat.append(np.double(parse[1]))
lng.append(np.double(parse[2]))
yr.append(np.int(parse[3]))
pv_tp.append(parse[4])
cap.append(np.int(parse[5].split('MW')[0]))
time_itv.append(parse[6])
file_names.append(_file)
else:
pass
files_info = pd.DataFrame(
np.array([tp, lat, lng, yr, pv_tp, cap, time_itv, file_names]).T,
columns=['type', 'latitude', 'longitude', 'year', 'pv_type', 'capacity', 'time_interval', 'file_name']
)
# %% Read the time series into a numpy 2-D array with 137x105120 size
X = np.zeros((len(files_info), 365 * 24 * 12))
for i in range(files_info.shape[0]):
f = filepath + '/' + files_info['file_name'].loc[i]
d = pd.read_csv(f)
assert d.shape[0] == 365 * 24 * 12, 'Data missing!'
X[i, :] = d['Power(MW)']
print(i / files_info.shape[0] * 100, '%')
np.save('D:/PaperCode/IGNNKmaster/data/nrel/nerl_X.npy', X)
files_info.to_pickle('D:/PaperCode/IGNNKmaster/data/nrel/nerl_file_infos.pkl')
# %% Get the adjacency matrix based on the inverse of distance between two nodes
A = np.zeros((files_info.shape[0], files_info.shape[0]))
//shape[0]行数;shape[1]列数;
for i in range(files_info.shape[0]):
for j in range(i + 1, files_info