如何利用你自己的数据生成NILMTK格式数据集

Neighbor_W

已于 2022-05-04 13:01:03 修改

阅读量2.6k

点赞数 2

文章标签：机器学习深度学习数据结构

于 2022-05-04 13:00:24 首次发布

本文链接：https://blog.csdn.net/Neighbor_W/article/details/124566941

版权

文章目录

前言
一、NILMTK Dataset介绍
- 1. Time Series Data
- 2. Meta Data
二、创建步骤
总结

前言

做项目需要用到负荷分解，第一反应去github找开源项目hh。大概查了查，发现大多数项目所使用的数据集结构都是NILMTK格式的。也就是说如果想基于他们的算法做开发，相对简便的一种方式就是提供NILMTK格式的数据集。
官方仅为一些常见的负荷分解数据集（如ukdale、redd）提供了转化的脚本，但如果是个人自己的数据集则需要自己编写。在网上找了找，好像没有发现很多相关资料，于是在这里记录此次的转化过程。

一、NILMTK Dataset介绍

NILMTK是一个专门用于负荷分解研究的工具包，其建立了一个较为流行的标准数据集格式（NILMTK Dataset）。该数据集以HDF5格式存在，而HDF5是由数据部分以及元数据部分构成，因此该数据集也是由这两部分构成。

1. Time Series Data

时间序列数据示意
这部分本质上就是一个table，以时间戳为index，hierarchical label为column name。一级column name表示测量的物理量（physical_quanitity）、二级column name表示类型（type）。他们的取值范围均在一个受控词表内。
在这里插入图片描述

2. Meta Data

在这里插入图片描述
如图所示，metadata这部分内容又包含相对较为稳定的Central metadata部分与随数据集变化的Shipped metadata部分，各方块的具体含义就不详细展开了，有兴趣的可以到这里了解。总的来说metadata的作用有两方面：
1）为series data建立索引，令程序能够根据索引拿到想要的数据；
2）描述一些数据制作的客观信息，如地点、作者、采集设备说明等。

二、创建步骤

1.创建metadata.hdf5

1.1 创建yaml文件

（注意，这里的hdf5并不是最终完整数据集的hdf5，而是一个胚子，后面还需要将序列数据加入到这个胚子中）
官方指出可以将metadata先写入多个yaml文件中，然后再转化为hdf5文件，并为此提供了一个脚本convert_yaml_to_hdf5.py。
yaml文件有三个：dataset.yaml、meter_devices.yaml、building.yaml。
1）dataset.yaml的一个示例如下所示，主要记录了数据集名称、作者、制作时间、制作目的等信息

dataset.yaml（示例）：

name: REDD
long_name: The Reference Energy Disaggregation Data set
creators:
- Kolter, Zico
- Johnson, Matthew
publication_date: 2011
institution: Massachusetts Institute of Technology (MIT)
contact: zkolter@cs.cmu.edu   # Zico moved from MIT to CMU
description: Several weeks of power data for 6 different homes.
subject: Disaggregated power demand from domestic buildings.
number_of_buildings: 6
timezone: US/Eastern   # MIT is on the east coast
geo_location:
  locality: Massachusetts   # village, town, city or state
  country: US   # standard two-letter country code defined by ISO 3166-1 alpha-2
  latitude: 42.360091 # MIT's coorindates
  longitude: -71.09416
related_documents:
- http://redd.csail.mit.edu
- >
  J. Zico Kolter and Matthew J. Johnson.
  REDD: A public data set for energy disaggregation research.
  In proceedings of the SustKDD workshop on
  Data Mining Applications in Sustainability, 2011.
  http://redd.csail.mit.edu/kolter-kddsust11.pdf
schema: https://github.com/nilmtk/nilm_metadata/tree/v0.2

2)meter_devices.yaml则记录了采集数据所用设备的信息

meter_devices.yaml（示例）：

eMonitor:
  model: eMonitor
  manufacturer: Powerhouse Dynamics
  manufacturer_url: http://powerhousedynamics.com
  description: >
    Measures circuit-level power demand.  Comes with 24 CTs.
    This FAQ page suggests the eMonitor measures real (active)
    power: http://www.energycircle.com/node/14103  although the REDD
    readme.txt says all channels record apparent power.
  sample_period: 3   # the interval between samples. In seconds.
  max_sample_period: 50   # Max allowable interval between samples. Seconds.
  measurements:
  - physical_quantity: power   # power, voltage, energy, current?
    type: active   # active (real power), reactive or apparent?
    upper_limit: 5000
    lower_limit: 0
  wireless: false

REDD_whole_house:
  description: >
    REDD's DIY power meter used to measure whole-home AC waveforms
    at high frequency.  To quote from their paper: "CTs from TED
    (http://www.theenergydetective.com) to measure current in the
    power mains, a Pico TA041 oscilloscope probe
    (http://www.picotechnologies.com) to measure voltage for one of
    the two phases in the home, and a National Instruments NI-9239
    analog to digital converter to transform both these analog
    signals to digital readings. This A/D converter has 24 bit
    resolution with noise of approximately 70 µV, which determines
    the noise level of our current and voltage readings: the TED CTs
    are rated for 200 amp circuits and a maximum of 3 volts, so we
    are able to differentiate between currents of approximately
    ((200))(70 × 10−6)/(3) = 4.66mA, corresponding to power changes
    of about 0.5 watts. Similarly, since we use a 1:100 voltage
    stepdown in the oscilloscope probe, we can detect voltage
    differences of about 7mV."
  sample_period: 1
  max_sample_period: 30
  measurements:
  - physical_quantity: power
    type: apparent
    upper_limit: 50000
    lower_limit: 0
  wireless: false

3）building{i}.yaml包含了三部分内容：建筑实例信息、电表实例信息、电器实例信息。电表实例信息较为核心，包含了电表之间的层级关系以及对应数据在hdf5文件中存放的索引。

building1.yaml（示例）：

instance: 1   # this is the first building in the dataset
original_name: house_1   # original name from REDD dataset
elec_meters:
  1:
    site_meter: true
    device_model: REDD_whole_house  # keys into meter_devices dictionary
    data_location: house_1/channel_1.dat
  2:
    site_meter: true
    device_model: REDD_whole_house
    data_location: house_1/channel_2.dat
  3:
    submeter_of: 0 # '0' means 'one of the site_meters'. We don't know
                   # which site meter feeds which appliance in REDD.
    device_model: eMonitor
    data_location: house_1/channel_3.dat
  4:
    submeter_of: 0
    device_model: eMonitor
    data_location: house_4/channel_4.dat

appliances:

- original_name: kitchen_outlets
  room: kitchen
  type: sockets   # sockets is treated as an appliance
  instance: 1
  multiple: true   # likely to be more than 1 socket
  meters: [7]

- original_name: kitchen_outlets
  room: kitchen
  type: sockets
  instance: 2   # 2nd instance of 'sockets' in this building
  multiple: true   # likely to be more than 1 socket
  meters: [8]

- original_name: lighting
  type: light
  instance: 1
  multiple: true   # likely to be more than 1 light
  meters: [9]

- original_name: lighting
  type: light
  instance: 2   # 2nd instance of 'light' in this building
  multiple: true
  meters: [17]

- original_name: lighting
  type: light
  instance: 3   # 3rd instance of 'light' in this building
  multiple: true
  meters: [18]

- original_name: bathroom_gfi   # ground fault interrupter
  room: bathroom
  type: unknown
  instance: 1
  multiple: true
  meters: [12]

1.2 运行转化脚本

注意，如果要使用上面的脚本，yaml文件的命名、内部格式，以及字段取值范围，必须遵从一定的规则，简单示例可以在这个文档中找到，否则该脚本在运行检查时会报错，除非你去修改它的central_meta。

convert_yaml_to_hdf5(yaml_dir, hdf_filename)

2.创建序列数据的DataFrame

假设已有各分项用电的数据，则下面需要为各分项分别创建符合上述格式要求的DataFrame：
1）时间戳转换

a = str(testdf.index[0])  # a数据类型为字符串
timeArray = time.strptime(a, "%Y-%m-%d %H:%M:%S")  # 数据类型为time.struct_time
timeStamp = int(time.mktime(timeArray))  # 数据类型为10位数字时间戳

2）设置多层标签

test.columns = [['power'],['apparent']]

3.添加DataFrame到hdf5中

store = pd.HDFStore(r'test.hdf5')
for i,subset in enumerate(subsets,start=1):
    test = Energys[subset][['时间', f'{subset}']].set_index('时间')
    test.columns = [['power'],['apparent']]
    test.index.name = 'time'
    store.append(f'building1/elec/meter{i}', test)  # 该地址需要与building<i>.yaml下的elec_meters相一致
    store.close()