Gnocchi: 6、基于gnocchi的时间序列算法demo实现

本文链接：https://blog.csdn.net/qingyuanluofeng/article/details/80329492
#!/usr/bin/env python
# -*- coding: utf-8 -*-
# @File    : scipy_demo.py
# @Software: PyCharm

'''
参考: 
https://github.com/gnocchixyz/gnocchi/tree/3.1.4

模拟gnocchi聚合的思路进行聚合的demo
gnocchi聚合算法:
步骤1:对时间序列ts的索引ts.index用采样间隔进行分组得到indexes
步骤2:对分组后的索引indexes通过numpy.unique重计算得到uniqeIndexes
步骤3:调用ndimage.mean方法，进行如下操作
ndimage.mean(ts.value , labels=indexes, index=uniqueIndexes)
即可得到聚合结果aggregatedValues
步骤4: 将uniqueIndexes还原为datetime64[ns]类型的numpy数组
timestamps
步骤5: 重新用步骤3得到的aggregatedValues和步骤4得到的timestamps
构建新的时间序列，该时间序列即为最终聚合的时间序列newTimeSerie
步骤6：根据需要保存的点的个数n,进行切片处理，获取newTimeSerie[-n:]
作为最终保存的时间序列的结果


解释:
scipy.ndimage.measurements.mean(input, labels=None, index=None)[source]
功能：计算数组在labels处的平均值
参数: 
input:数组，
labels:类似数组，可选的。对应每个元素有一个标签
标签数组的一些性状，或者。
所有共享相同label的区域的元素会被用于计算平均值。
index:需要计算的区域
返回值：列表




分析分组的算法:
 (a // b) * b:这个操作的含义获取能够被b整除且最接近a的数
 (numpy.array(ts.index, 'float') // freq) * freq：
 这里就是对数组中每个元素进行处理，获取能够被freq整除，且最接近该元素的值
 假设:
 1,2,3,4,5,6,7,8,9
 freq=3
 那么运算之后的结果是
 0 0 3 3 3 6 6 6 9
 等于变相的是一个分组操作，且以freq的倍数进行划分
'''

'''
ref:
https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.ndimage.measurements.mean.html
scipy.ndimage.measurements.mean

scipy.ndimage.measurements.mean(input, labels=None, index=None)[source]
Calculate the mean of the values of an array at labels.

Parameters:	
input : array_like
Array on which to compute the mean of elements over distinct regions.
labels : array_like, optional
Array of labels of same shape, or broadcastable to the same shape as input. All elements sharing the same label form one region over which the mean of the elements is computed.
index : int or sequence of ints, optional
Labels of the objects over which the mean is to be computed. Default is None, in which case the mean for all values where label is greater than 0 is calculated.
Returns:	
out : list
Sequence of same length as index, with the mean of the different regions labeled by the labels in index.
See also
ndimage.variance, ndimage.standard_deviation, ndimage.minimum, ndimage.maximum, ndimage.sum, ndimage.label

scipy.ndimage.measurements.mean(input, labels=None, index=None)[source]
功能：计算数组在labels处的平均值
参数: 
input:数组，
labels:类似数组，可选的。对应每个元素有一个标签
标签数组的一些性状，或者。
所有共享相同label的区域的元素会被用于计算平均值。
index:需要计算的区域
返回值：列表

Examples

>>>
>>> a = np.arange(25).reshape((5,5))
>>> labels = np.zeros_like(a)
>>> labels[3:5,3:5] = 1
>>> index = np.unique(labels)
>>> labels
array([[0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0],
       [0, 0, 0, 1, 1],
       [0, 0, 0, 1, 1]])
>>> index
array([0, 1])
>>> ndimage.mean(a, labels=labels, index=index)
[10.285714285714286, 21.0]

'''

import numpy as np
from scipy import ndimage
import numpy
import pandas as pd
from scipy import ndimage


def aggregateGnocchiTimeSerie():
    # 步骤0: 构造时间序列数据
    dates = pd.DatetimeIndex(['2018-04-18 11:20:30', '2018-04-18 11:21:30',
                              '2018-04-18 11:22:30', '2018-04-18 11:23:30',
                              '2018-04-18 11:24:30', '2018-04-18 11:25:30',
                              '2018-04-18 11:26:30', '2018-04-18 11:27:30',
                              '2018-04-18 11:28:30', '2018-04-18 11:29:30',
                              '2018-04-18 11:30:30', '2018-04-18 11:31:30',])
    print dates
    ts = pd.Series(np.arange(12), index = dates)
    print "step 0 ############ time series:"
    print ts
    granularity = 300.0
    freq = granularity * 10e8
    floatIndexes = numpy.array(ts.index, 'float')
    print "############ float indexes:"
    print floatIndexes
    # 步骤1： 根据采样间隔对时间序列的索引进行分组
    indexes = (floatIndexes // freq) * freq
    print "step 1 ############ group indexes:"
    print indexes
    # 步骤2： 对已经分组的索引进行去重
    uniqueIndexes, counts = numpy.unique(indexes , return_counts=True)
    print "step 2############ unique indexes:"
    print uniqueIndexes
    print "############ values"
    print ts.values
    # 步骤3： 根据时间序列的值，分组索引，去重索引计算聚合结果
    values = ndimage.mean(ts.values, labels=indexes, index=uniqueIndexes)
    print "step 3 ############ gnocchi mean aggregated result"
    print values
    # 步骤4： 将去重索引还原为原来的时间序列格式
    timestamps = numpy.array(uniqueIndexes, 'datetime64[ns]')
    print "step 4 ############ recover unique indexes"
    print timestamps
    # 步骤5： 用新的聚合结果和恢复的去重索引构建新的时间序列
    timestamps = pd.to_datetime(timestamps)
    print timestamps
    newTimeSerie = pd.Series(values, timestamps)
    print "step 5 ############ get aggregated time serie"
    print newTimeSerie


if __name__ == "__main__":
    aggregateGnocchiTimeSerie()