Modern C++ 内存篇1 - std::allocator VS pmr

深山老宅

已于 2024-02-11 19:05:45 修改

阅读量1k

点赞数 21

分类专栏： modern C++ 文章标签： c++ modern C++ allocator polymorphic pmr

于 2024-02-08 21:47:06 首次发布

本文链接：https://blog.csdn.net/zhaiminlove/article/details/136077845

版权

modern C++ 专栏收录该内容

43 篇文章 3 订阅

订阅专栏

大年三十所写，看到就点个赞吧！祝读者们龙年大吉！当然有问题欢迎评论指正。
在这里插入图片描述

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

1. 前言

从今天起我们开始内存相关的话题，内存是个很大的话题，一时不知从何说起。内存离不开allocator，我们就从allocator开始吧。allocator目前有两种：std::allocator, std::pmr::polymorphic_allocator，各有优缺点。
上来就长篇大论容易显得枯燥，我们还是抛出一个例子然后提出问题，通过问题慢慢深入吧。

2. 分配器例子

下面这个例子是我很久以前从一个网站上copy下来的。是个不错的用来快速学习的例子。作者当时留了个疑问没解决：为什么预分配内存的pmr反而效率更低哪？
这也是本节我们要解决的问题，从中也可以学到allocator和polymorphic_allocator的优缺点对比。

#include<iostream>
#include<memory_resource>
#include<vector>
#include "../PerfSum.hpp"
using namespace std;

void TestPmrVec(){
    char buffer[1000000*4] = {0};
    std::pmr::monotonic_buffer_resource mbr{ std::data(buffer), std::size(buffer) };
    std::pmr::polymorphic_allocator<int> pa{&mbr};
    std::pmr::vector<int> vec{pa};
    //vec.push_back(0);
    //vec.push_back(1);
    PerfSum t;
     for(int i=0;i<1000000;i++){
         vec.push_back(i);
     }
     std::cout<<"End"<<std::endl;
 }

 void TestStdVec(){
     std::vector<int> vec ;
     PerfSum t;
         //vec.push_back(0);
         //vec.push_back(1);
     for(int i=0;i<1000000;i++){
         vec.push_back(i);
     }
     std::cout<<"End"<<std::endl;
 }

int main() {
    std::cout<<"std vector cost:"<<std::endl;
    TestStdVec();
    std::cout<<"pmr vector cost:"<<std::endl;
    TestPmrVec();
}

其中PerfSum.hpp在《Modern C++ idiom3：RAII》中有提到。编译运行结果：

[mzhai@std_polymorphic_pmr]$ g++ compare_speed.cpp -std=c++17 -g
[mzhai@std_polymorphic_pmr]$ ./a.out
std vector cost:
End
 took 19171 microseconds.
pmr vector cost:
End
 took 56134 microseconds.

可见pmr反而比普通的vector慢了大约3倍。
这里我还是坚持我一贯的写作风格：先preview结果给大家，尽量一句话说明白，没时间的读者可以节约时间去干点别的，有时间且有兴趣了解细节的读者可以慢慢往下看。
preview：虽然pmr预分配的内存空间，但是后面vector既有capacity不够时需要copy/move旧的数据到新分配的空间去，pmr::vector是一个个元素move过去的；而普通vector是调用memmove把所有数据一股脑move过去的。

注意：pmr是c++17开始才有的standard library features, gcc从9.1开始支持。

3. pmr慢的原因

启动perf, 查热点：

[mzhai@std_polymorphic_pmr]$ sudo sysctl -w kernel.kptr_restrict=0
sudo sysctl -w kernel.perf_event_paranoid=0
[sudo] password for mzhai:
kernel.kptr_restrict = 0
kernel.perf_event_paranoid = 0
[mzhai@std_polymorphic_pmr]$ perf record -a -g ./a.out
std vector cost:
End
 took 17302 microseconds.
pmr vector cost:
End
 took 58350 microseconds.
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.100 MB perf.data (369 samples) ]
[mzhai@std_polymorphic_pmr]$ perf report

在这里插入图片描述
找到__uninitialized_copy_a的实现，我的机器在目录/usr/include/c++/11/bits/stl_uninitialized.h中：

请添加图片描述
从perf report能隐约看出调用栈，__uninitialized_copy_a是从push_back -> _M_realloc_insert 调过来的，从名字猜也能猜到是vector旧的分配的空间不够了需要reallocate, 分配完新的空间后需要调用__uninitialized_copy_a把旧的数据copy或move过来，但是重点是：这里竟然是for循环，是一个个copy或move过来的！

4. std::allocator快的原因

作为对比，我们查下std::vector 空间不够是怎么做的？
读者可自行调试TestStdVec，我这里直接上代码：

#0  std::__relocate_a_1<int, int> (__first=0x41b2e8, __last=0x41b2e8, __result=0x41b2cc) at /usr/include/c++/11/bits/stl_uninitialized.h:1012
#1  0x000000000040451f in std::__relocate_a<int*, int*, std::allocator<int> > (__first=0x41b2e8, __last=0x41b2e8, __result=0x41b2cc, __alloc=...)
    at /usr/include/c++/11/bits/stl_uninitialized.h:1046
#2  0x000000000040423f in std::vector<int, std::allocator<int> >::_S_do_relocate (__first=0x41b2e8, __last=0x41b2e8, __result=0x41b2cc, __alloc=...)
    at /usr/include/c++/11/bits/stl_vector.h:456
#3  0x0000000000403e5d in std::vector<int, std::allocator<int> >::_S_relocate (__first=0x41b2e8, __last=0x41b2e8, __result=0x41b2cc, __alloc=...)
    at /usr/include/c++/11/bits/stl_vector.h:469
#4  0x000000000040376a in std::vector<int, std::allocator<int> >::_M_realloc_insert<int const&> (this=0x7fffffffdb70, __position=0)
    at /usr/include/c++/11/bits/vector.tcc:468
#5  0x0000000000402f24 in std::vector<int, std::allocator<int> >::push_back (this=0x7fffffffdb70, __x=@0x7fffffffdb3c: 2)

请添加图片描述
直接调用__builtin_memmove把旧数据一股脑memmove过去，能不快吗？！
可能有读者有一点点疑问：想__builtin_memmove真的调用memmove吗？简单看下汇编就知道啦。

5. 何时调用memmove何时调用for循环

通过上面的分析，我们现在知道了pmr慢而普通allocator快的原因了，接着新的问题来了：为什么pmr不走memmove? 什么条件下走memmove哪?

5.1 relocation - _M_realloc_insert

当capacity不够时，会调用_M_realloc_insert做几件事：

分配新空间
把旧数据move到新空间
释放旧空间

看下_M_realloc_insert的庐山真面目：
在这里插入图片描述

/usr/include/c++/11/bits/vector.tcc
423   template<typename _Tp, typename _Alloc>
 424     template<typename... _Args>
 425       void
 426       vector<_Tp, _Alloc>::
 427       _M_realloc_insert(iterator __position, _Args&&... __args)
 
 434     {

 458 #if __cplusplus >= 201103L
 459       if _GLIBCXX17_CONSTEXPR (_S_use_relocate())
 460         {
 461           __new_finish = _S_relocate(__old_start, __position.base(),
 462                      __new_start, _M_get_Tp_allocator());
 463
 464           ++__new_finish;
 465
 466           __new_finish = _S_relocate(__position.base(), __old_finish,
 467                      __new_finish, _M_get_Tp_allocator());
 468         }
 469       else
 470 #endif
 471         {
 472           __new_finish
 473         = std::__uninitialized_move_if_noexcept_a
 474         (__old_start, __position.base(),
 475          __new_start, _M_get_Tp_allocator());
 476
 477           ++__new_finish;
 478
 479           __new_finish
 480         = std::__uninitialized_move_if_noexcept_a
 481         (__position.base(), __old_finish,
 482          __new_finish, _M_get_Tp_allocator());
 483         }

5.2 初次判断是否使用memmove - _S_use_relocate

关键点在_S_use_relocate()的值，此函数的定义如下：

/usr/include/c++/11/bits/stl_vector.h
 430       static constexpr bool
 431       _S_nothrow_relocate(true_type)
 432       {
 433     return noexcept(std::__relocate_a(std::declval<pointer>(),
 434                       std::declval<pointer>(),
 435                       std::declval<pointer>(),
 436                       std::declval<_Tp_alloc_type&>()));
 437       }
 438
 439       static constexpr bool
 440       _S_nothrow_relocate(false_type)
 441       { return false; }
 442
 443       static constexpr bool
 444       _S_use_relocate()
 445       {
 446     // Instantiating std::__relocate_a might cause an error outside the
 447     // immediate context (in __relocate_object_a's noexcept-specifier),
 448     // so only do it if we know the type can be move-inserted into *this.
 449     return _S_nothrow_relocate(__is_move_insertable<_Tp_alloc_type>{});
 450       }

5.2.1 __is_move_insertable<_Tp_alloc_type>

对_S_use_relocate的判断首先看__is_move_insertable<_Tp_alloc_type>{}，无论_Tp_alloc_type是std::allocator 还是std::pmr::polymorphic_allocator，结果都是true.

785   template<typename _Alloc>
786     struct __is_move_insertable
787     : __is_alloc_insertable_impl<_Alloc, typename _Alloc::value_type>::type
788     { };
789
790   // std::allocator<_Tp> just requires MoveConstructible
791   template<typename _Tp>
792     struct __is_move_insertable<allocator<_Tp>>
793     : is_move_constructible<_Tp>
794     { };

std::allocator匹配后者（791行），is_move_constructible为true;
pmr匹配前者（785行）, 匹配下面的两者之一。
此处用了SFINAE思想，如果_Alloc能用_Tp做参数类型构造一个_ValueT*对象，则匹配true的这个模板，否则false, 分别对应__is_move_insertable的结果。std::allocator及polymorphic_allocator都有construct函数，__is_move_insertable都为true。

5.2.2 __relocate_a是否抛出异常

5.2.2.1 异常判断链

对_S_use_relocate的判断其次看std::__relocate_a是否抛出异常，__relocate_a会看__relocate_a_1是否抛出异常，而**__relocate_a_1**会看__relocate_object_a是否抛出异常，__relocate_object_a是否抛出异常取决于（__relocate_a_1还有个特例，请看“__relocate_a_1特例”), 这长长的链我暂时称之为“异常判断链”

984   template<typename _Tp, typename _Up, typename _Allocator>
 985     inline void
 986     __relocate_object_a(_Tp* __restrict __dest, _Up* __restrict __orig,
 987             _Allocator& __alloc)
 988     noexcept(noexcept(std::allocator_traits<_Allocator>::construct(__alloc,
 989              __dest, std::move(*__orig)))
 990          && noexcept(std::allocator_traits<_Allocator>::destroy(
 991                 __alloc, std::__addressof(*__orig))))

我们接下来只用construct来说明noexcept的结果：
（1）std::allocator

/usr/include/c++/11/bits/alloc_traits.h
509       template<typename _Up, typename... _Args>
510     static _GLIBCXX20_CONSTEXPR void
511     construct(allocator_type& __a __attribute__((__unused__)), _Up* __p,
512           _Args&&... __args)
513     noexcept(std::is_nothrow_constructible<_Up, _Args...>::value)
514     {
515 #if __cplusplus <= 201703L
516       __a.construct(__p, std::forward<_Args>(__args)...);

noexcept(std::is_nothrow_constructible<int,int>::value) 为true，故std::allocator此处noexcept为true
（2）pmr

/usr/include/c++/11/bits/alloc_traits.h
│   357               */                                                                      
│   358               template<typename _Tp, typename... _Args>                               
│   359                 static _GLIBCXX20_CONSTEXPR auto                                      
│   360                 construct(_Alloc& __a, _Tp* __p, _Args&&... __args)                   
│   361                 noexcept(noexcept(_S_construct(__a, __p,                              
│   362                                                std::forward<_Args>(__args)...)))      
│   363                 -> decltype(_S_construct(__a, __p, std::forward<_Args>(__args)...))   
│  >364                 { _S_construct(__a, __p, std::forward<_Args>(__args)...); }           
│   365                                                             

 247               template<typename _Tp, typename... _Args>                                  
│   248                 static _GLIBCXX14_CONSTEXPR _Require<__has_construct<_Tp, _Args...>>  
│   249                 _S_construct(_Alloc& __a, _Tp* __p, _Args&&... __args)                
│   250                 noexcept(noexcept(__a.construct(__p, std::forward<_Args>(__args)...)))
│  >251                 { __a.construct(__p, std::forward<_Args>(__args)...); }      

/usr/include/c++/11/memory_resource
240     #if ! __cpp_lib_make_obj_using_allocator                                      
241        template<typename _Tp1, typename... _Args>                              
242        __attribute__((__nonnull__))                                          
243        typename __not_pair<_Tp1>::type                                       
244        construct(_Tp1* __p, _Args&&... __args)                               
245       {   
246       // _GLIBCXX_RESOLVE_LIB_DEFECTS
247       // 2969. polymorphic_allocator::construct() shouldn't pass resource()
248       using __use_tag
249         = std::__uses_alloc_t<_Tp1, polymorphic_allocator, _Args...>;
250       if constexpr (is_base_of_v<__uses_alloc0, __use_tag>)
251         ::new(__p) _Tp1(std::forward<_Args>(__args)...);
252       else if constexpr (is_base_of_v<__uses_alloc1_, __use_tag>)
253         ::new(__p) _Tp1(allocator_arg, *this,
254                 std::forward<_Args>(__args)...);
255       else
256         ::new(__p) _Tp1(std::forward<_Args>(__args)..., *this);
257     }

一路追踪下来发现取决于polymorphic_allocator::construct没有noexcept specifier，故pmr情况下__relocate_object_a为un-noexcept.

5.2.2.2 __relocate_a_1特例

不过除此之外__relocate_a_1还有一个特例：

1000   template<typename _Tp, typename = void>
1001     struct __is_bitwise_relocatable
1002     : is_trivial<_Tp> { }; 
1003
1004   template <typename _Tp, typename _Up>
1005     inline __enable_if_t<std::__is_bitwise_relocatable<_Tp>::value, _Tp*>
1006     __relocate_a_1(_Tp* __first, _Tp* __last,
1007            _Tp* __result, allocator<_Up>&) noexcept
1008     {
1009       ptrdiff_t __count = __last - __first;
1010       if (__count > 0)
1011     __builtin_memmove(__result, __first, __count * sizeof(_Tp));
1012       return __result + __count;
1013     }

如果_Tp（我的例子里是int）是trivial的且分配器是std::allocator，则__relocate_a_1是noexcept的，则走_S_relocate的分支（不走__uninitialized_move_if_noexcept_a）

5.3 再次判断是否使用memmove - _S_use_relocate后

上面两条捋了一遍_S_use_relocate()的结果，但并不是它是true就一定用memmove，还要继续判断。

/usr/include/c++/11/bits/stl_vector.h
 452       static pointer
 453       _S_do_relocate(pointer __first, pointer __last, pointer __result,
 454              _Tp_alloc_type& __alloc, true_type) noexcept
 455       {
 456     return std::__relocate_a(__first, __last, __result, __alloc);
 457       }
 458
 459       static pointer
 460       _S_do_relocate(pointer, pointer, pointer __result,
 461              _Tp_alloc_type&, false_type) noexcept
 462       { return __result; }
 463
 464       static pointer
 465       _S_relocate(pointer __first, pointer __last, pointer __result,
 466           _Tp_alloc_type& __alloc) noexcept
 467       {
 468     using __do_it = __bool_constant<_S_use_relocate()>;
 469     return _S_do_relocate(__first, __last, __result, __alloc, __do_it{});
 470       }

459行永远也走不到，因为_S_use_relocate()位true才会调用到这，而其值为true则一定匹配452行的函数特化版本。
__relocate_a最终调用到__relocate_a_1，上面提到过它有两个版本：
只有_Tp是trivial 且用std::allocator 才会调用memmove。

1004   template <typename _Tp, typename _Up>
1005     inline __enable_if_t<std::__is_bitwise_relocatable<_Tp>::value, _Tp*>
1006     __relocate_a_1(_Tp* __first, _Tp* __last,
1007            _Tp* __result, allocator<_Up>&) noexcept
1008     {
1009       ptrdiff_t __count = __last - __first;
1010       if (__count > 0)
1011     __builtin_memmove(__result, __first, __count * sizeof(_Tp));
1012       return __result + __count;
1013     }
1014
1015   template <typename _InputIterator, typename _ForwardIterator,
1016         typename _Allocator>
1017     inline _ForwardIterator
1018     __relocate_a_1(_InputIterator __first, _InputIterator __last,
1019            _ForwardIterator __result, _Allocator& __alloc)
1020     noexcept(noexcept(std::__relocate_object_a(std::addressof(*__result),
1021                            std::addressof(*__first),
1022                            __alloc)))
1023     {
1024       typedef typename iterator_traits<_InputIterator>::value_type
1025     _ValueType;
1026       typedef typename iterator_traits<_ForwardIterator>::value_type
1027     _ValueType2;
1028       static_assert(std::is_same<_ValueType, _ValueType2>::value,
1029       "relocation is only possible for values of the same type");
1030       _ForwardIterator __cur = __result;
1031       for (; __first != __last; ++__first, (void)++__cur)

综上所述，竟然只有一种情况用memmove: 只有_Tp是trivial 且用std::allocator 才会调用memmove。

6. 看一个简单的class的例子

上面我用的是int，下面用一个简单的类看看，验证下上面的流程图。
我就不分析了，大家执行代码看结果来理解吧。

#include<iostream>
#include<memory_resource>
#include<vector>
#include "../PerfSum.hpp"
using namespace std;

struct MyClass{
        MyClass(int _i):i(_i) {}
        int i;
};

void TestPmrVec(){
    char buffer[1000000*4] = {0};
    std::pmr::monotonic_buffer_resource pool{
        std::data(buffer), std::size(buffer)
    };
    std::pmr::vector<MyClass> vec{&pool};
    PerfSum t;
     for(int i=0;i<1000000;i++){
         vec.push_back(MyClass{i});
     }
     std::cout<<"End"<<std::endl;
 }

 void TestStdVec(){
     std::vector<MyClass> vec ;
     PerfSum t;
     for(int i=0;i<1000000;i++){
         vec.push_back(MyClass{i});
     }
     std::cout<<"End"<<std::endl;
 }

int main() {
        std::cout<<"is_move_constructible<MyClass>: "<<std::is_move_constructible_v<MyClass><<std::endl;
        std::cout<<"is_nothrow_constructible<MyClass>: "<<std::is_nothrow_constructible_v<MyClass,MyClass&&><<std::endl;
        std::cout<<"is_nothrow_destructible<MyClass>: "<<std::is_nothrow_destructible_v<MyClass><<std::endl;
        std::cout<<"trivail<MyClass>: "<<std::is_trivial_v<MyClass><<std::endl;

    std::cout<<"std vector cost:"<<std::endl;
    TestStdVec();
    std::cout<<"pmr vector cost:"<<std::endl;
    TestPmrVec();
}

另一个例子：

    using Pa1 = std::pmr::polymorphic_allocator<int> ;
    Pa1 pa1;
    bool imi1 = std::__is_move_insertable<Pa1>{};
    int itemp1;
    bool c1 =   noexcept(std::allocator_traits<Pa1>::construct(pa1,&itemp1,10));
    bool d1 =   noexcept(std::allocator_traits<Pa1>::destroy(pa1,std::__addressof(itemp1)));
    std::cout<<std::boolalpha<<imi1<<c1<<d1<<std::endl;

    using Pa2 = std::allocator<int>;
    Pa2 pa2;
    bool imi2 = std::__is_move_insertable<Pa2>{};
    int itemp2;
    bool c2 =   noexcept(std::allocator_traits<Pa2>::construct(pa2,&itemp2,20));
    bool d2 =   noexcept(std::allocator_traits<Pa2>::destroy(pa2,std::__addressof(itemp2)));
    std::cout<<std::boolalpha<<imi2<<c2<<d2<<std::endl;

$ ./a.out
truefalsefalse
truetruetrue

7. release版本的差距没那么大

我们废了很大的经历才捋明白何时用memmove何时不用，而且debug版本之间的性能差距达3倍之多，确实值得我们调查一番。但令人失望又惊喜的是：release版本的性能差距竟然只有1.1倍左右：

[mzhai@std_polymorphic_pmr]$ g++ compare_speed.cpp -std=c++17 -O
std vector cost:
End
 took 5349 microseconds.
pmr vector cost:
End
 took 6207 microseconds.
[mzhai@std_polymorphic_pmr]$ ./a.out
std vector cost:
End
 took 4822 microseconds.
pmr vector cost:
End
 took 5160 microseconds.

不由得感叹：现在的编译器真厉害！