入门无线AI需要好的数据集
做无线AI的同学想必都遇到了无线数据集难获取,不会用的问题,今天我就来介绍一款国产无线AI数据集,手把手教,小白包会,十分适合无线AI刚刚入门的同学,助力无线AI研究!
这款无线AI数据集名叫WAIR-D (Wireless AI Research Dataset),是华为和浙江大学合作研究而得到的一个数据集,前名为DoraSet,是基于全球真实地图而生成构建的,有两个大类场景,一个是四个基站最多30个用户的场景,一个是单基站最多10000用户的场景,分别对应密集和稀疏场景,如下图所示:
数据集内部结构
从文末的链接下载数据集后,解压可以得到以下文件结构:
下载来后有两个文件,分别是DoraSet_code.rar和DoraSet_data.rar,分别包含任务文件和信道数据文件。前者有一个无线AI定位的代码(可以一键运行),后者包含一个data数据文件夹以及一些必要的Python文件。
先讲一下这个data文件夹,里面有两个scenario,每个scenario里面有很多子文件夹,子文件夹里面有对应的场景图片和信道文件(以npy格式保存)。当运行generator.py生成自定义信道的时候,会在data文件夹下面保存对应的数据。
再讲一下这个tasks文件夹,里面包含了一个无线AI定位任务,根据无线信道等信息来确定用户方位,dataset.py包含了用来读取信道数据的Dataloader,model.py包含了用来训练的神经网络,train.py包含了训练和验证代码。
最后讲一下剩余几个文件, generator.py是用来生成信道的文件,一键即可生成文件。parameters.py是用来调节信道参数的,有众多参数可调,utils.py是工具函数,包含必须的API。
参数详解以及无线定位任务示例
parameters.py如下图所示,
numCores代表要使用的CPU核心数量,如果电脑内存太小了,建议调低此值;
carrier_Freq是信号频率,28_0代表28.0GHz;
BWGHz是带宽,0.04608代表46.08MHz;
subcarriers和carrierSampleInterval与OFDM相关,前者代表生成了多少个OFDM载波,后者代表抽样间隔,所以实际的OFDM载波数smapledCarriers等于前两者相除取整数;
Nt代表三个坐标轴方向的发送天线数量,Nr代表三个坐标轴方向的接收天线数量;
spacing_t是传输天线以波长计数的间隔,spacing_r是接收天线以波长计数的间隔;
saveAsArray表示是否要以npy格式存在硬盘上面,saveAsImage表示是否要以图片形式存在硬盘上面,默认是以npy格式存的因为这样比较节约空间;
maxPathnum代表多径数量;
scenario代表选择哪个场景;
scenarioFolder代表场景存放的文件夹位置,generatedFolder表示生成信道数据存放的位置,其中是生成的文件夹名字就包含了保存的信道信息的一系列参数;
ENVnum代表选取场景数量,BSlist代表选取的基站id,UElist代表选取的用户id。
'''
Copyright (C) 2021. Huawei Technologies Co., Ltd.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
'''
import warnings
import numpy as np
import multiprocessing
from PIL import Image
warnings.filterwarnings('ignore')
numCores = multiprocessing.cpu_count() # number of cores to do generation job, decrease this number if your computer has less memory or available cores
# parameters for generating dataset
carrierFreq = '28_0' # for example, 2_6 for 2.6G, 60_0 for 60.0G
BWGHz = 0.04608 # bandwidth in GHz
subcarriers = 384 # number of subcarriers
carrierSampleInterval = 6 # sample subcarriers with this interval to save computation time
sampledCarriers = int(subcarriers / carrierSampleInterval) # number of sampled subcarriers for deep learning
Nt = [32, 1, 1] # BS antenna array in [x,y,z] axis, e.g., [1, 8, 8], [1, 32, 4]
Nr = [2, 1, 1] # UE antenna array in [x,y,z] axis, e.g., [2, 2, 1], [4, 2, 1]
spacing_t = [0.5, 0.5, 0.5] # transmitter antenna spacing in wavelength
spacing_r = [0.5, 0.5, 0.5] # receiver antenna spacing in wavelength
Pattern_t = {'Power': 0} # omni antenna type for default, transmitter power 0 dBm
Basis_t = np.eye(3) # antenna basis rotation, no rotation for default
Basis_r = np.eye(3) # antenna basis rotation, no rotation for default
saveAsArray = False # save channel as numpy array if True
saveAsImage = True # save channel as image if True
maxPathNum = 1000 # should be >0, max Path number for every BS-UE link, a large number such as 1000 means no limits
scenario = 1 # select a scenario to generate channel, the detailed description of scenarios are listed below
scenarioFolder = f'data/scenario_{scenario}/' # folder of scenario
generatedFolder = f'data/generated_{scenario}_{carrierFreq}_{maxPathNum}_{Nt[0]}_{Nt[1]}_{Nt[2]}_{Nr[0]}_{Nr[1]}_{Nr[2]}_{int(BWGHz * 1000)}_{sampledCarriers}/'
if scenario == 1:
# scenario_1: sparse UE drop in lots of environments
# max 10000 envs, 5 BS and 30 UE drops can be selected for every environment
ENVnum = 1000 # number of environments to pick, max is 10000
BSlist = list(range(5)) # BS index range 0~4 per environment, e.g., [0] picks BS_0, [2,4] picks BS_2 and BS_4
UElist = list(
range(30)) # UE index range 0~29 per environment, e.g., [0] picks UE_0, [2,17,26] picks UE_2, UE_17 and UE_26
BSnum = len(BSlist) # number of BS per environment, max is 5
UEnum = len(UElist) # number of UE per environment, max is 30
elif scenario == 2:
# scenario_2: dense UE drop in some environments
# max 100 envs, 1 BS and 10000 UE drops can be selected for every environment
ENVnum = 30 # number of environments to pick, max is 100
BSlist = list(range(1)) # BS index range 0~0 per environment, e.g., [0] picks BS_0
UElist = list(range(
10000)) # UE index range 0~9999 per environment, e.g., [0] picks UE_0, [2,170,2600] picks UE_2, UE_170 and UE_2600
BSnum = len(BSlist) # number of BS per environment, max is 1
UEnum = len(UElist) # number of UE per environment, max is 10000
else:
raise ('More scenarios are in preparation.')
当parameters.py被设定为上图的时候,运行generator.py,可以看到data文件夹下面出现了我们需要的以generated开头的信道数据文件夹,里面包含1000个场景,每个场景中全部是图片,一个场景有30张图片,分别对应30个用户,以下是加载结束时候的图:
现在我们转到tasks文件夹,试着运行一下train.py文件,这里我使用了python3.11,torch 2.0.1, numpy 1.23.5, 值得注意的是通过踩坑发现numpy版本过高会直接报错,其他包版本似乎没有太大影响。
通过train.py里面的main函数可以发现这个定位任务把前面生成的1000个cases中的前900个作为训练集,后100个作为验证集。
def main():
seed_everything(42)
valid_dataset = DoraSet(cases=100, start=900, set='valid')
valid_loader = torch.utils.data.DataLoader(valid_dataset, batch_size=batchSize, shuffle=False,
num_workers=num_workers)
imageSize = valid_dataset.imageSize
device = torch.device(cudaIdx if torch.cuda.is_available() else "cpu")
model = DoraNet().to(device)
criterion = torch.nn.MSELoss().to(device)
Valid = []
if not evaluation:
train_dataset = DoraSet(cases=900, start=0, set='train')
train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=batchSize, shuffle=True,
num_workers=num_workers)
optimizer = torch.optim.Adam(model.parameters(), lr=lr)
Train = []
runCase = f'runCase_{casesToTrain}'
makeDIRs(runCase)
else:
model.load_state_dict(torch.load(f'models{loadRUNcase}'))
optimizer = []
for epoch in range(1, epochs + 1):
with torch.no_grad():
lossV, envImg, p_UEloc, UEloc, BSloc = run(False, valid_loader, model, criterion, [], epoch, device,
imageSize)
Valid.append(lossV)
if (not evaluation):
lossT, _, _, _, _ = run(True, train_loader, model, criterion, optimizer, epoch, device, imageSize)
Train.append(lossT)
checkPoint(runCase, epoch, epochs, model, Train, Valid, saveModelInterval, saveLossInterval)
else:
break
showResults(envImg, p_UEloc, UEloc, BSloc, 0, imageSize)
在dataset.py中找到DoraSet这个类,查询__getitem__方法,发现返回值是envImg,distance,channel,BSloc,UEloc,sight,scale,angle,gain这9个变量,分别对应环境图片,用户到基站的距离,信道,基站位置,用户位置,是否直射,伸缩尺度大小,角度,增益这9个实际变量,这个定位任务就是利用其中若干个预测用户位置UEloc。
def __getitem__(self, idx):
curEnv = idx // (BSnum * UEnum)
curLink = idx % (BSnum * UEnum)
bsIdx = curLink // UEnum
ueIdx = curLink % UEnum
envImg = self.envs[curEnv]
linklocA = self.linklocAs[curEnv]
ImgSize = linklocA[:2].numpy().astype(np.int)
if scenario == 1:
linklocA = torch.reshape(linklocA[2:], (150, 4))
BSloc = linklocA[::30, :2][BSlist[bsIdx], :]
UEloc = linklocA[UElist[ueIdx], 2:]
else:
BSloc = linklocA[2:4]
linklocA = torch.reshape(linklocA[4:], (10000, 2))
UEloc = linklocA[UElist[ueIdx], :]
distance = self.distances[curEnv][0:1, BSlist[bsIdx] * 30 + UElist[ueIdx]]
gain = self.gains[curEnv][0:1, BSlist[bsIdx] * 30 + UElist[ueIdx]]
angle = self.angles[curEnv][0, BSlist[bsIdx] * 30 + UElist[ueIdx]]
channel = readChannel(readImage, generatedFolder, self.paths[curEnv], BSlist[bsIdx], UElist[ueIdx], scenario)
channel = channel / np.max(np.abs(channel))
channel = torch.FloatTensor(
np.concatenate((np.real(channel)[None, :, :], np.imag(channel)[None, :, :]), axis=0))
sight = self.sights[curEnv][0:1, BSlist[bsIdx] * 30 + UElist[ueIdx]]
scale = torch.FloatTensor([imageSize / np.max(ImgSize)])
return envImg, distance, channel, BSloc, UEloc, sight, scale, angle, gain
训练到最后,发现损失只有0.0001,误差只有两三米。
如何把这个数据集用于自己的研究
如果想把自己的无线AI任务与这个数据集结合,就需要在tasks文件夹内部新建一个文件夹,然后把positioning文件夹中的dataset.py文件放到这个新建的文件夹内部,在新文件夹内部的数据加载代码中加入以下代码
train_dataset = DoraSet(cases=900, start=0, set='train')
train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=batchSize, shuffle=True, num_workers=num_workers)
其中cases后面的数字可以根据自己的需求填写
文章相关资源
官方网站:
移动通信开放数据平台 (mobileai-dataset.com)
官方论文:
[2212.02159] WAIR-D: Wireless AI Research Dataset (arxiv.org)
WAIR-D应用举例: