NILMTK——深扒组合优化(CO)和FHMM细节

前面的博客讲了具体实现,现在深究算法代码实现细节!!!

1.CO

(1)关于train

从以下代码可知,CO首先是对各个电器的功率数据做了train,为了了解其原生实现对代码进行了深究:

classifiers = {'CO':CombinatorialOptimisation()}
predictions = {}
sample_period = 120  ## 采样周期是两分钟
for clf_name, clf in classifiers.items():
     print("*"*20)
     print(clf_name)
     print("*" *20)
     clf.train(top_5_train_elec, sample_period=sample_period)  ### 训练部分

进入代码可知:

def train(self, metergroup, num_states_dict=None, **load_kwargs):
        """Train using 1D CO. Places the learnt model in the `model` attribute.

        Parameters
        ----------
        metergroup : a nilmtk.MeterGroup object
        num_states_dict : dict
        **load_kwargs : keyword arguments passed to `meter.power_series()`

        Notes
        -----
        * only uses first chunk for each meter (TODO: handle all chunks).
        """
        if num_states_dict is None:
            num_states_dict = {}

        if self.model:
            raise RuntimeError(
                "This implementation of Combinatorial Optimisation"
                " does not support multiple calls to `train`.")
       

        num_meters = len(metergroup.meters)
        if num_meters > 12:
            max_num_clusters = 2
        else:
            max_num_clusters = 3
#        print(max_num_clusters)

        for i, meter in enumerate(metergroup.submeters().meters):
            print("Training model for submeter '{}'".format(meter))
            power_series = meter.power_series(**load_kwargs)
            chunk = next(power_series)
            num_total_states = num_states_dict.get(meter)
            if num_total_states is not None:
                num_on_states = num_total_states - 1
            else:
                num_on_states = None
            self.train_on_chunk(chunk, meter, max_num_clusters, num_on_states)
            

            # Check to see if there are any more chunks.
            # TODO handle multiple chunks per appliance.
            try:
                next(power_series)
            except StopIteration:
                pass
            else:
                warn("The current implementation of CombinatorialOptimisation"
                     " can only handle a single chunk.  But there are multiple"
                     " chunks available.  So have only trained on the"
                     " first chunk!")

        print("Done training!")

简单来看传入参数有:metergroup--> top_5_train_elec (五种用电量较高的电器)

                                    num_states_dict=None,(建立一个字典,后期没找到用途)

                                    **load_kwargs  -->sample_period=sample_period(采样周期是120秒,也就是2两分钟)

设置参数有:如果训练的电器数量大于12,将聚类的数量置为2,其他则为3,我们此处为5种电器,因此,max_num_clusters =3

总体过程:遍历这5种电器数据,每类电器进行单独聚类,返回每个电器的聚类结果,此结果为不同电器在不同状态下的功率数值。

每种电器训练过程:

如果model存在,则提示已经训练好,无需二次训练,否则就进行电器的聚类,输入参数为:chunk-->每个电器的功率数据;max_num_clusters 聚类数。并将每个电器的聚类结果记录states,training_metadata参数保存成model。

def train_on_chunk(self, chunk, meter, max_num_clusters, num_on_states):
        # Check if we've already trained on this meter
        meters_in_model = [d['training_metadata'] for d in self.model]
        
        if meter in meters_in_model:
            raise RuntimeError(
                "Meter {} is already in model!"
                "  Can't train twice on the same meter!"
                .format(meter))
        
        states = cluster(chunk, max_num_clusters, num_on_states)
#        print('states',states)
        self.model.append({
            'states': states,
            'training_metadata': meter})

结果为:

聚类详解:

通过_transform_data(X)进行数据格式转换,主要是将pd.Series或者单列的pd.DataFrame转成ndarray数据,然后再对数据进行聚类,得到每个类别的质心值,然后增加设备off状态的功率数据,按理说聚类传入的参数是3,在增加一个状态,应该是四个状态值,事实上只有3个状态,继续深究可知!!

def cluster(X, max_num_clusters=3, exact_num_clusters=None):
    '''Applies clustering on reduced data,
    i.e. data where power is greater than threshold.

    Parameters
    ----------
    X : pd.Series or single-column pd.DataFrame
    max_num_clusters : int

    Returns
    -------
    centroids : ndarray of int32s
        Power in different states of an appliance, sorteda
    '''
    # Find where power consumption is greater than 10
    data = _transform_data(X)

    # Find clusters
    centroids = _apply_clustering(data, max_num_clusters, exact_num_clusters)
#    print('centroids',centroids)
    centroids = np.append(centroids, 0)  # add 'off' state
    centroids = np.round(centroids).astype(np.int32)
    centroids = np.unique(centroids)  # np.unique also sorts
    # TODO: Merge similar clusters
    return centroids

电器状态值缺少的原因:

由_apply_clustering函数的for n_clusters in range(1, max_num_clusters)可知,在max_num_clusters取值为3的情况下,n_clusters 的取值为1,2,因此少了一个状态!额外的,在进行聚类的时候,将每个电器的数据聚类成1簇,2簇,并采用了聚类算法的轮廓系数(sklearn.metrics.silhouette_score)选取了最好的n_clusters,即为n_clusters=2.

def _apply_clustering(X, max_num_clusters, exact_num_clusters=None):
    '''
    Parameters
    ----------
    X : ndarray
    max_num_clusters : int

    Returns
    -------
    centroids : list of numbers
        List of power in different states of an appliance
    '''
    # If we import sklearn at the top of the file then it makes autodoc fail

    from sklearn import metrics

    # Finds whether 2 or 3 gives better Silhouellete coefficient
    # Whichever is higher serves as the number of clusters for that
    # appliance
    num_clus = -1
    sh = -1
    k_means_labels = {}
    k_means_cluster_centers = {}
    k_means_labels_unique = {}

    # If the exact number of clusters are specified, then use that
    if exact_num_clusters is not None:
        labels, centers = _apply_clustering_n_clusters(X, exact_num_clusters)
        return centers.flatten()

    # Exact number of clusters are not specified, use the cluster validity measures
    # to find the optimal number
    for n_clusters in range(1, max_num_clusters):

        try:
            labels, centers = _apply_clustering_n_clusters(X, n_clusters)
#            print('labels, centers',labels, centers)
            k_means_labels[n_clusters] = labels
            k_means_cluster_centers[n_clusters] = centers
            k_means_labels_unique[n_clusters] = np.unique(labels)
            try:
                sh_n = metrics.silhouette_score(
                    X, k_means_labels[n_clusters], metric='euclidean')

                if sh_n > sh:
                    sh = sh_n
                    num_clus = n_clusters
            except Exception:
                num_clus = n_clusters
        except Exception:
            if num_clus > -1:
                return k_means_cluster_centers[num_clus]
            else:
                return np.array([0])
            print(k_means_cluster_centers[num_clus].flatten())
    return k_means_cluster_centers[num_clus].flatten()

选取聚类算法为Kmeans!!!

def _apply_clustering_n_clusters(X, n_clusters):
    """
    :param X: ndarray
    :param n_clusters: exact number of clusters to use
    :return:
    """
    from sklearn.cluster import KMeans
    k_means = KMeans(init='k-means++', n_clusters=n_clusters)
    k_means.fit(X)
    return k_means.labels_, k_means.cluster_centers_

此为训练结果的全过程!

(2)disaggregate_chunk

分解时候的函数是用的disaggregate_chunk,得到房间的总功率曲线,并对5种电器进行分解。

首先是将之前train()过程的质心提取出来,并做一个枚举操作,cartesian函数是做枚举操作,由5个模型,每个模型3个状态,则可得3*3*3*3*3=243行,5列的状态组合数据。

 def _set_state_combinations_if_necessary(self):
        """Get centroids"""
        # If we import sklearn at the top of the file then auto doc fails.
        #枚举所有可能性
        if (self.state_combinations is None or
                self.state_combinations.shape[1] != len(self.model)):
            from sklearn.utils.extmath import cartesian
            centroids = [model['states'] for model in self.model]
#            print(len(centroids),len(centroids[0]))
            self.state_combinations = cartesian(centroids)

接下来,对状态数据进行按列累加,然后调用find_nearest函数,求得负荷数据和状态数据的残差和具体索引值。find_nearest的传入参数有按列累加之后的状态数据,用户的总功率数据。

def find_nearest(known_array, test_array):
    """Find closest value in `known_array` for each element in `test_array`.

    Parameters
    ----------
    known_array : numpy array
        consisting of scalar values only; shape: (m, 1)
    test_array : numpy array
        consisting of scalar values only; shape: (n, 1)

    Returns
    -------
    indices : numpy array; shape: (n, 1)
        For each value in `test_array` finds the index of the closest value
        in `known_array`.
    residuals : numpy array; shape: (n, 1)
        For each value in `test_array` finds the difference from the closest
        value in `known_array`.
    """
    # from http://stackoverflow.com/a/20785149/732596
    #将x中的元素从小到大排列,提取其对应的index(索引),从小到大排序
    index_sorted = np.argsort(known_array)
    known_array_sorted = known_array[index_sorted]
    
    #将功率值插入到known_array_sorted对应的位置,并返回index
    idx1 = np.searchsorted(known_array_sorted, test_array)
    idx2 = np.clip(idx1 - 1, 0, len(known_array_sorted)-1)
    idx3 = np.clip(idx1,     0, len(known_array_sorted)-1)
    
    #上限-实际值
    diff1 = known_array_sorted[idx3] - test_array
    #实际值-下限
    diff2 = test_array - known_array_sorted[idx2]
    
    #找到差据最小的index
    indices = index_sorted[np.where(diff1 <= diff2, idx3, idx2)]
    #残差
    residuals = test_array - known_array[indices]
    return indices, residuals

得到索引值和残差之后,就可通过每个model来分别进行计算了,此时indices_of_state_combinations与indices等价。

 appliance_powers_dict = {}
        for i, model in enumerate(self.model):
            print("Estimating power demand for '{}'"
                  .format(model['training_metadata']))
            
            predicted_power = state_combinations[
                indices_of_state_combinations, i].flatten()
            column = pd.Series(predicted_power, index=mains.index, name=i)
#            print(column)
            appliance_powers_dict[self.model[i]['training_metadata']] = column
        appliance_powers = pd.DataFrame(appliance_powers_dict, dtype='float32')

主要的分解核心代码在于

state_combinations[
                indices_of_state_combinations, i].flatten()

下面看一个小例子:

由此可知,分别按列遍历,然后选取indices_of_state_combinations对应位置的值为当前电器的分解结果。

2.FHMM

(1)关于train

在train函数里边频繁使用了好几次GaussianHMM,没有理解其中深意。先看代码吧,首先传入meters,检查内存是否能支撑状态转换概率矩阵的计算。

def _check_memory(num_appliances):
    """
    Checks if the maximum resident memory is enough to handle the 
    combined matrix of transition probabilities
    """
    # Each transmat is small (usually 2x2 or 3x3) but the combined
    # matrix is dense, using much more memory
    
    # Get the approximate memory in MB
    try:
        # If psutil is installed, we can get the correct total 
        # physical memory of the system
        import psutil
        available_memory = psutil.virtual_memory().total >> 20
    except ImportError:
        # Otherwise use a crude approximation
        available_memory = 16 << 10
    
    
    # We use (num_appliances + 1) here to get a pessimistic approximation:
    # 8 bytes * (2 ** (num_appliances + 1)) ** 2
    required_memory = ((1 << (2 * (num_appliances + 1))) << 3) >> 20
    
    if required_memory >= available_memory:
        warn("The required memory for the model may be more than the total system memory!"
             " Try using fewer appliances if the training fails.")

然后,遍历每个电器,获取每个电器的状态数据,经过一系列判断语句,最终发现他用了聚类来确定每个电器的状态。

 if num_total_states is None:
     states = cluster(meter_data, max_num_clusters)
     num_total_states = len(states)

然后调用hmmlearn的GaussianHMM来进行模型训练。

print("Training model for submeter '{}' with {} states".format(meter, num_total_states))
learnt_model[meter] = hmm.GaussianHMM(num_total_states, "full")
# Fit
learnt_model[meter].fit(X)

对GaussianHMM计算的means_结果进行排序,然后根据means_索引值得到相应的startprob,covars,transmat等,然后在进行一次GaussianHMM,并对参数赋值。

self.meters = []
new_learnt_models = OrderedDict()
for meter in learnt_model:
    startprob, means, covars, transmat = sort_learnt_parameters(
            learnt_model[meter].startprob_, learnt_model[meter].means_,
            learnt_model[meter].covars_, learnt_model[meter].transmat_)
                
    new_learnt_models[meter] = hmm.GaussianHMM(startprob.size, "full")
    new_learnt_models[meter].startprob_ = startprob
    new_learnt_models[meter].transmat_ = transmat
    new_learnt_models[meter].means_ = means
    new_learnt_models[meter].covars_ = covars
    # UGLY! But works.
    self.meters.append(meter)

均值排序计算的代码如下:

def return_sorting_mapping(means):
    means_copy = deepcopy(means)
    means_copy = np.sort(means_copy, axis=0)

    # Finding mapping
    mapping = {}
    for i, val in enumerate(means_copy):
        mapping[i] = np.where(val == means)[0][0]
    return mapping

例子:设定a=[1,5,3,4,2],return_sorting_mapping返回的是均值从小到大的索引值。

其余参数计算皆以均值有序值进行计算,代码如下:

def sort_startprob(mapping, startprob):
    """ Sort the startprob according to power means; as returned by mapping
    """
    num_elements = len(startprob)
    new_startprob = np.zeros(num_elements)
    for i in range(len(startprob)):
        new_startprob[i] = startprob[mapping[i]]
    return new_startprob


def sort_covars(mapping, covars):
    new_covars = np.zeros_like(covars)
    for i in range(len(covars)):
        new_covars[i] = covars[mapping[i]]
    return new_covars


def sort_transition_matrix(mapping, A):
    """Sorts the transition matrix according to increasing order of
    power means; as returned by mapping

    Parameters
    ----------
    mapping :
    A : numpy.array of shape (k, k)
        transition matrix
    """
    num_elements = len(A)
    A_new = np.zeros((num_elements, num_elements))
    for i in range(num_elements):
        for j in range(num_elements):
            A_new[i, j] = A[mapping[i], mapping[j]]
    return A_new

然后又对结果值做了一个GaussianHMM,代码如下:

def create_combined_hmm(model):
    list_pi = [model[appliance].startprob_ for appliance in model]
    list_A = [model[appliance].transmat_ for appliance in model]
    list_means = [model[appliance].means_.flatten().tolist()
                  for appliance in model]

    pi_combined = compute_pi_fhmm(list_pi)
    A_combined = compute_A_fhmm(list_A)
    [mean_combined, cov_combined] = compute_means_fhmm(list_means)

    combined_model = hmm.GaussianHMM(n_components=len(pi_combined), covariance_type='full')
    combined_model.startprob_ = pi_combined
    combined_model.transmat_ = A_combined
    combined_model.covars_ = cov_combined
    combined_model.means_ = mean_combined

又对means,transmat,startprob三个状态数据做了转换,代码如下:

def compute_A_fhmm(list_A):
    """
    Parameters
    -----------
    list_pi : List of PI's of individual learnt HMMs

    Returns
    --------
    result : Combined Pi for the FHMM
    """
    result = list_A[0]
    for i in range(len(list_A) - 1):
        result = np.kron(result, list_A[i + 1])
    return result


def compute_means_fhmm(list_means):
    """
    Returns
    -------
    [mu, cov]
    """
    states_combination = list(itertools.product(*list_means))
    num_combinations = len(states_combination)
    means_stacked = np.array([sum(x) for x in states_combination])
    means = np.reshape(means_stacked, (num_combinations, 1))
    cov = np.tile(5 * np.identity(1), (num_combinations, 1, 1))
    return [means, cov]


def compute_pi_fhmm(list_pi):
    """
    Parameters
    -----------
    list_pi : List of PI's of individual learnt HMMs

    Returns
    -------
    result : Combined Pi for the FHMM
    """
    result = list_pi[0]
    for i in range(len(list_pi) - 1):
        result = np.kron(result, list_pi[i + 1])
    return result

然后得到了模型:

(2)disaggregate_chunk

首先是获取总表的功率数据,然后通过GaussianHMM的predict函数进行预测,然后在进行decode_hmm函数进行解码,一顿操作猛如虎,还是没看懂。

  def disaggregate_chunk(self, test_mains):
        """Disaggregate the test data according to the model learnt previously

        Performs 1D FHMM disaggregation.

        For now assuming there is no missing data at this stage.
        """
        # See v0.1 code
        # for ideas of how to handle missing data in this code if needs be.

        # Array of learnt states
        learnt_states_array = []
        test_mains = test_mains.dropna()
        length = len(test_mains.index)
        temp = test_mains.values.reshape(length, 1)
        learnt_states_array.append(self.model.predict(temp))
        print(learnt_states_array[0].shape)

        # Model
        means = OrderedDict()
        for elec_meter, model in self.individual.items():
            means[elec_meter] = (
                model.means_.round().astype(int).flatten().tolist())
            means[elec_meter].sort()

        decoded_power_array = []
        decoded_states_array = []
        print(means.keys())
        for learnt_states in learnt_states_array:
            [decoded_states, decoded_power] = decode_hmm(
                len(learnt_states), means, means.keys(), learnt_states)
            decoded_states_array.append(decoded_states)
            decoded_power_array.append(decoded_power)

        prediction = pd.DataFrame(
            decoded_power_array[0], index=test_mains.index)

        return prediction

解码函数:

def decode_hmm(length_sequence, centroids, appliance_list, states):
    """
    Decodes the HMM state sequence
    """
    hmm_states = {}
    hmm_power = {}
    total_num_combinations = 1

    for appliance in appliance_list:
        print(centroids[appliance])
        total_num_combinations *= len(centroids[appliance])
    print(total_num_combinations)
    for appliance in appliance_list:
        hmm_states[appliance] = np.zeros(length_sequence, dtype=np.int)
        hmm_power[appliance] = np.zeros(length_sequence)

    for i in range(length_sequence):

        factor = total_num_combinations
        for appliance in appliance_list:
            # assuming integer division (will cause errors in Python 3x)
            factor = factor // len(centroids[appliance])

            temp = int(states[i]) / factor
            hmm_states[appliance][i] = temp % len(centroids[appliance])
            hmm_power[appliance][i] = centroids[
                appliance][hmm_states[appliance][i]]
    return [hmm_states, hmm_power]

每个电器有3个状态均值,共有243种组合方式,然后进行了除和取余操作,实现对数据的分解。

粗略解析只能到这里了,具体还要看HMM等相关理论才能想明白那些操作吧!

代码源于nilmtk包的源码文件.

  • 3
    点赞
  • 43
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 9
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 9
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

樱缘之梦

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值