数值在容器中的分位数

最新推荐文章于 2022-06-01 14:27:28 发布

guotianqing

最新推荐文章于 2022-06-01 14:27:28 发布

阅读量262

点赞数

分类专栏： cpp 文章标签： stl

本文链接：https://blog.csdn.net/guotianqing/article/details/118527094

版权

cpp 专栏收录该内容

68 篇文章 25 订阅

订阅专栏

背景

存在一个数据序列，且持续有新的数据到来，需要知道当前到来的数据在原数据序列的分位数。

假设数据是double类型，序列存储在vector中。

实现方案

有多种方式可以实现该功能。

暴力法

最直观的方式是使用暴力法解决。

对vector排序，然后遍历整个容器，找到给定值的index，得到index/vector.size即为分位数。

代码如下：

void GetQuantileForLoop(const vector<double>& ori, const double val)
{
		vector<double> v(ori.begin(), ori.end());
		sort(v.begin(), v.end());
		size_t index = 0;
		for (index = 0; index < v.size(); ++index){
		    if (val <= v[index])
		    	  break;
		}
		
		double quantile = static_cast<double>(index) / v.size();
		cout << "GetQuantileForLoop|" << fixed << index << "|" << quantile << endl;
}

借助set

set默认支持排序，可以省去sort对vector的排序。

set容器默认支持lower_bound，返回指定值应该插入的第一个位置。

代码如下：

void GetQuantileSetLowerBound(const vector<double>& ori, const double val)
{
		multiset<double> s(ori.begin(), ori.end());
    auto low_iter = s.lower_bound(val);
    size_t location = distance(s.begin(), low_iter);

    double quantile = static_cast<double>(location) / s.size();
    cout << "GetQuantileSetLowerBound|" << fixed << location << "|" << quantile << endl;
}

使用STL算法lower_bound

set容器默认支持lower_bound，其他容器也可以使用STL提供的通用算法lower_bound。

代码如下：

void GetQuantileStlLowerBound(const vector<double>& ori, const double val)
{
		vector<double> v(ori.begin(), ori.end());
    sort(v.begin(), v.end());
    auto low_iter = lower_bound(v.begin(), v.end(), val);
    size_t location = distance(v.begin(), low_iter);
    double quantile = static_cast<double>(location) / v.size();

    cout << "GetQuantileStlLowerBound|" << fixed << location << "|" << quantile << endl;
}

性能分析

功能的实现都已经完成，下面要看一下哪个版本的实现最快了。

测试程序如下：

#include <set>
#include <vector>
#include <iostream>
#include <iterator>
#include <algorithm>

using namespace std;

int main(int argc, char *argv[])
{
    if (argc != 2) {
        cout << "Usage: a.out val" << endl;
        return -1;
    }

    double val = stod(argv[1]);
    vector<double> ori {1.2, 2.1, 3.3, 4.4, 5.5, 6.6, 7.7, 8.8, 9.9, 10.1, 
        1.2, 2.1, 3.3, 4.4, 5.5, 6.6, 7.7, 8.8, 9.9, 10.1,
        1.2, 2.1, 3.3, 4.4, 5.5, 6.6, 7.7, 8.8, 9.9, 10.1,
        1.2, 2.1, 3.3, 4.4, 5.5, 6.6, 7.7, 8.8, 9.9, 10.1,
        1.2, 2.1, 3.3, 4.4, 5.5, 6.6, 7.7, 8.8, 9.9, 10.1,
        1.2, 2.1, 3.3, 4.4, 5.5, 6.6, 7.7, 8.8, 9.9, 10.1,
        1.2, 2.1, 3.3, 4.4, 5.5, 6.6, 7.7, 8.8, 9.9, 10.1,
        1.2, 2.1, 3.3, 4.4, 5.5, 6.6, 7.7, 8.8, 9.9, 10.1,
        1.2, 2.1, 3.3, 4.4, 5.5, 6.6, 7.7, 8.8, 9.9, 10.1,
        1.2, 2.1, 3.3, 4.4, 5.5, 6.6, 7.7, 8.8, 9.9, 10.1,
        1.2, 2.1, 3.3, 4.4, 5.5, 6.6, 7.7, 8.8, 9.9, 10.1,
        1.2, 2.1, 3.3, 4.4, 5.5, 6.6, 7.7, 8.8, 9.9, 10.1};
    
    {
        TimeSpanStats tss("GetQuantileForLoop", 1);
        GetQuantileForLoop(ori, val);
    }
    
    {
        TimeSpanStats tss("GetQuantileSetLowerBound", 1);
        GetQuantileSetLowerBound(ori, val);
    }
    
    {
        TimeSpanStats tss("GetQuantileStlLowerBound", 1);
        GetQuantileStlLowerBound(ori, val);
    }

    return 0;
}

其中，统计函数耗时的实现如下：

#ifndef TIME_SPAN_STATS_H_
#define TIME_SPAN_STATS_H_

#include <string>
#include <chrono>
#include <iostream>

using namespace std::chrono;
using std::string;

class TimeSpanStats
{
public:
    TimeSpanStats(const string& msg, const int32_t threshold): msg_(msg), threshold_(threshold) {
        t_start_ = high_resolution_clock::now();
    }
    ~TimeSpanStats() {
        t_end_ = high_resolution_clock::now();
        duration<double, std::micro> time_span = t_end_ - t_start_;
        int ts = time_span.count();
        if (ts >= threshold_) {
            std::cout << msg_ << "|tooks|" << std::fixed << time_span.count() << "|us" << std::endl;
        }
    }
    
private:
    string msg_;
    int32_t threshold_;
    high_resolution_clock::time_point t_start_;
    high_resolution_clock::time_point t_end_;
};

#endif

测试结果如下：

./a.out 10
GetQuantileForLoop|108|0.900000
GetQuantileForLoop|tooks|68.560000|us
GetQuantileSetLowerBound|108|0.900000
GetQuantileSetLowerBound|tooks|120.666000|us
GetQuantileStlLowerBound|108|0.900000
GetQuantileStlLowerBound|tooks|42.301000|us

由此可见，使用第三种方法速度最快。

对于distance，计算两个迭代器之间的差值，对于线性迭代器来说，它是很快的，但是对于非线性容器，则比较慢了。

小结

STL提供的容器和算法都是经过验证的实践，不管是功能上还是性能上，都是很优秀的。

但是，对于不同的应用场景，还是要选择最合适的容器和算法，这样才能充分发挥STL的优势。

guotianqing

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
数值在容器中的分位数

背景存在一个数据序列，且持续有新的数据到来，需要知道当前到来的数据在原数据序列的分位数。假设数据是double类型，序列存储在vector中。实现方案有多种方式可以实现该功能。暴力法最直观的方式是使用暴力法解决。对vector排序，然后遍历整个容器，找到给定值的index，得到index/vector.size即为分位数。代码如下：void GetQuantileForLoop(const vector<double>& ori, const double val)
复制链接

扫一扫