Writing a Copy-On-Write Holder Template

By Liu, Weili



This article is about some in-depth research on Copy-On-Write (COW) technique based on Herb Sutter’s book. It extends the discussion of COW string to a more common COW holder. It mentions some pit falls a beginner may fall into and discuss the COW holder in the multithread world.


1. Backgrounds

Copy-On-Write (COW for short. Some use “lazy copy” to annotate the same concept) is a trick to make your program runs faster. It avoids unnecessary data copies while retaining data consistence. This technique is commonly used in some String types. Herb Sutter describes a COW string type in his book [1]. In this article, I will describe a compact COW holder for not only strings but also other types and classes.

Let’s start with the idea of COW. In some cases, contents of a class may change in a function. To ensure the modification does not affect the original data, programmers typically copies a new instance from the old one. If the function is called frequently and the size of all class is more than a few bytes, the copy operation will greatly reduce performance. To solve the problem, we can use COW. If the content of the data is not modified, we will not copy a new instance, we just pass a reference of the old instance. If the content is modified, we will copy to create a new instance before the modification takes place.

I don’t want to talk about COW string because Herb Sutter already give a good solution (may not be the best). I just want to write a common Copy-On-Write template to hold the user-defined classes.


2. Start from the beginning

The secret of COW is that it uses a shared buffer. The shared buffer holds the data (of course), a reference count and other information (I will talk about it later):

template<class X>

class  CopyOnWriteBuffer



     CopyOnWriteBuffer();             //Create the buffer with new data

     ~CopyOnWriteBuffer();            //Release the buffer

     CopyOnWriteBuffer(const X&);     //Create a buffer from existing data.

     int      RefCount;              //Reference count

     X*       Data;                  //The actual data object.


      //disable copying the buffer

     CopyOnWriteBuffer (const CopyOnWriteBuffer &);

     CopyOnWriteBuffer & operator=(const CopyOnWriteBuffer &);



We need a COW holder to hold the buffer. It also manages the buffer and provides an interface to access the buffer. The holder should be “smart”. It knows when to copy.

template<class X>

class CopyOnWriteHolder



     CopyOnWriteHolder();                               //Create an empty buffer

     CopyOnWriteHolder(const X&);                       //Create from existing data

     CopyOnWriteHolder(const CopyOnWriteHolder<X>&);    //Copy constructor

     ~CopyOnWriteHolder();                              //Release buffer if RefCount==0

     //……other functions to add later


     const X* GetConstData()const;                      //Get a const reference of the data

     X* GetData();                                      //Get a writable pointer of the data


     CopyOnWriteBuffer<X>* Data;                        //The buffer





3. Adding functions

We add the constructor and destructor first. These codes are simple and self-explained.



template<class X>

CopyOnWriteBuffer<X>::CopyOnWriteBuffer(): RefCount(1) , Data(new X){}

template<class X>



     delete Data;


template<class X>

CopyOnWriteBuffer<X>:: CopyOnWriteBuffer(const X& data): RefCount(1)


     Data = new X(data);



template<class X>



     if (Data && (--Data->RefCount) < 1){

         delete Data;



template<class X>



     Data = NULL; 


template<class X>

CopyOnWriteHolder<X>::CopyOnWriteHolder(const X& data)


     Data = new CopyOnWriteBuffer<X>(data);



I use new X(data) for copy-constructing the input data. In order to copy the data correctly, the user-defined class X should ensure its copy-constructor copies its contents correctly.

Then I will add the copy-constructor for CopyOnWriteHolder:


template<class X>

CopyOnWriteHolder<X>::CopyOnWriteHolder (const CopyOnWriteHolder<X>& holder)


     if (this != &holder && holder.Data){ //avoid copying itself

         Data = holder.Data;             //set the buffer

          ++Data->RefCount;               //increase reference count



In the const X* GetConstData()const function, the data is intact. So no copy will happen. In X* GetData(), it’s assumed that the data will be modified. So a copy of the buffer will happen if necessary.

template<class X>

X* CopyOnWriteHolder<X>::GetData()


     if (Data==NULL){      

         return NULL;


     if (Data->RefCount > 1){

         CopyOnWriteBuffer<X>* newdata = new CopyOnWriteBuffer<X>(*Data->Data);

         --Data->RefCount; //copy is done, decrease the reference count

         Data = newdata;    //set the data to the new data


     return Data->Data;


template<class X>

const X* CopyOnWriteHolder<X>::GetConstData()const


     return Data?Data->Data:NULL;


OK. Now a simple COW holder is done. We can use it to do the job.


4. Improving the Copy-On-Write holder

Although the above COW holder is complete and usable, we can add more functionality to it.

First of all, we would like to use code such as “holder2 = holder1;” instead of “holder2 = *(new CopyOnWriteHolder<X> (holder1))”. If we don’t overload the operator =, we can’t use this simple assignment. Besides, if the operator = is not correctly overloaded, both codes are buggy (why? I will discuss it later). So it’s necessary to overload the operator =.

First, we introduce a simple operator = function.

template<class X>

CopyOnWriteHolder<X>& CopyOnWriteHolder<X>::operator =(const CopyOnWriteHolder<X>& holder)


     if (this->Data  != holder.Data  && holder.Data){   //avoid copying the same data

         Data = holder.Data;                           //set the buffer

         ++Data->RefCount;                             //increase reference count


     return *this;


We can put all the above codes together. The COW holder seems to be working well. In fact, it is buggy. There are two bugs in the following code


CopyOnWriteHolder<string> cow1("a");

CopyOnWriteHolder<string> cow2("b");

//memory leak!!!!!

cow2 = cow1; //or cow2=*(new CopyOnWriteHolder<string>(cow1)); will do the same

//Incorrect reference count

cow2 = cow1; //cow1.Data->RefCount==3 ?!!!!


It’s clear that the first bug is a memory leak. This leak appears whenever you assign one COW holder to another, because the old CopyOnWriteBuffer is not released. The other bug also has something to do with the first one. We need to update the old CopyOnWriteBuffer’s reference count when the assignment happens. We rewrite the overloaded function of operater =:

template<class X>

CopyOnWriteHolder<X>& CopyOnWriteHolder<X>::operator =(const CopyOnWriteHolder<X>& holder)


     if (this->Data  != holder.Data  && holder.Data){   //avoid copying the same data

         this ->~CopyOnWriteHolder()                    //solve both bugs

         Data = holder.Data;                            //set the buffer

         ++Data->RefCount;                              //increase reference count


     return *this;


From the above samples, we can see that the COW string in [1] is buggy because it does not overload the operator =. A block of code such as ” cow2 = *(new String(cow1));” will result in a memory leak.

Now we can use operator = to assign one COW holder to another. But there is another problem for us. See the following code:

<int> cow1(1);

int *pInt = cow1.GetData();

CopyOnWriteHolder<int> cow2(cow1);

*pInt=1;//oops! We change cow2 too


The problem is that after we get a writable pointer from cow1, the COW holder does not know that the content of the cow1 may change at anytime. A solution is to mark cow1 unsharable[1]. Afterwards, any copy-construct or assignment operation will actually copy the content instead of only passing a reference. To reduce overhead, unsharable marker is embedded into RefCount. The modified code is in the following:

template<class X>

X* CopyOnWriteHolder<X>::GetData()


     if (Data==NULL){      

         return NULL;


     if (Data->RefCount > 1 && Data->RefCount != UNSHARABLE){

         CopyOnWriteBuffer<X>* newdata = new CopyOnWriteBuffer<X>(*Data->Data);

         --Data->RefCount; //copy is done, decrease the reference count

         Data = newdata;    //set the data to the new data


     Data->RefCount= COW_UNSHARABLE;

     return Data->Data;


template<class X>

CopyOnWriteHolder<X>& CopyOnWriteHolder<X>::operator =(const CopyOnWriteHolder<X>& holder)


     if (this->Data!= holder.Data && holder.Data){  //avoid copying the same data

         this->~CopyOnWriteHolder();                    //destroy old data

         if (holder.Data->RefCount == UNSHARABLE){      //copy if unsharable

              Data = new CopyOnWriteBuffer<X>(*holder.Data->Data);


              Data = holder.Data;                       //set the buffer

              ++Data->RefCount;                         //increase reference count



     return *this;


template<class X>

CopyOnWriteHolder<X>::CopyOnWriteHolder (const CopyOnWriteHolder<X>& holder)


     if (this != &holder && holder.Data){          //avoid copying itself

         if (holder.Data->RefCount == UNSHARABLE){ //copy if unsharable

              Data = new CopyOnWriteBuffer<X>(*holder.Data->Data);


              Data = holder.Data;                   //set the buffer

              ++Data->RefCount;                     //increase reference count




template<class X>



     // order of the condition in the following if is very important

     if (Data && (Data->RefCount != UNSHARABLE) || (--Data->RefCount) < 1){

         delete Data;



You can set UNSHARABLE to be a definition or a static variable. It is best to be set to the minimum of int.

The same as other C++ classes, the COW holder does not guarantee the following code:

<int> cow1(1);

int *pInt = cow1.GetData();

CopyOnWriteHolder<int> cow2(2);


*pInt=1;     //oops! The data that pInt points to is actually deleted!!!

This problem is common in all C++ programs. The STL iterators also have this problem. To solve this problem, we need to introduce Garbage Collection. I don’t want discuss Garbage Collection because it is not what this article is about.


5. The multithread world

It is clear that this COW holder is not thread-safe. If more than one threads are sharing the same CopyOnWriteBuffer, the reference count can easily goes wrong. The first reason is that the increment and decrement operations are not atomic. So if one thread injects an access operation (get value, set value, increment or decrement) between another thread’s beginning and end of an access operation on the same object, the result of the operation may not be what user expects (for both or one thread). We need to use synchronization techniques to synchronize the operations on the reference count. Anyway, these operations are time-consuming.

Here is a brief example of how a simple “a++” goes wrong in multithread environment. The actual assembly code may differ from this example due to different compilers and different target machines.

//The actual assembly code of a++ may looks like the following

mov eax,dword ptr [a]

add eax,1

//If other thread executes here and changes a, the change will get lost after the execution of the following line.

mov dword ptr[a], eax

The same problem is true for decrement, assign, get and comparison. So it is necessary to introduce atomic operation of integer.

Is atomic operations enough? Definitely no. Threads interleave anywhere. Certain combination will result in wrong reference count and memory leak.

Here is an example to show how the above code goes wrong in multithread environment, regardless of whether you use atomic integer operations or not.

//In the following code, cow1 and cow2 shares the same data, so RefCount==2

CopyOnWriteHolder<int> cow1(1);

//thread 1

int* p=cow1.GetData();

//thread 2

void f()


     CopyOnWriteHolder<int> cow2(cow1);

//cow2 will be destroy before this function returns, so ~CopyOnWriteHolder() is called


If the following sequence is the actual execution sequence, a memory leak will happen

//Thread 1. RefCount==2 at the beginning

if (Data->RefCount > 1 && Data->RefCount != UNSHARABLE){

     CopyOnWriteBuffer<X>* newdata = new CopyOnWriteBuffer<X>(*Data->Data);

//Thread 2 execution injects here

if (Data && (Data->RefCount != UNSHARABLE) || (--Data->RefCount) < 1){

     delete Data; //RefCount==1, so this line will not be executed.


//Thread 1 goes on executing

     --Data->RefCount; //RefCount==0!!! And the Data is not released!!!!

     Data = newdata;    //memory leak

To solve this problem, we need to ensure the reference count is greater than zero before assigning new data. Combining with atomic integer operation, we have the following revised code:

template<class X>

X* CopyOnWriteHolder<X>::GetData()


     if (Data==NULL){      

         return NULL;


     register int ref=IntAtomicGet(Data->RefCount);

     if (ref > 1 && ref != UNSHARABLE){

         CopyOnWriteBuffer<X>* newdata = new CopyOnWriteBuffer<X>(*Data->Data);

         if (IntAtomicDec(Data->RefCount) < 1){

              delete newdata;


              Data =newdata;



     Data->RefCount = COW_UNSHARABLE;

     return Data->Data;


template<class X>

CopyOnWriteHolder<X>& CopyOnWriteHolder<X>::operator =(const CopyOnWriteHolder<X>& holder)


     if (this->Data!= holder.Data && holder.Data){       //avoid copying the same data

         this->~CopyOnWriteHolder();                    //destroy old data

         if (IntAtomicGet(holder.Data->RefCount) == UNSHARABLE){      //copy if unsharable

              Data = new CopyOnWriteBuffer<X>(*holder.Data->Data);


              Data = holder.Data;                       //set the buffer

              IntAtomicInc(Data->RefCount);             //increase reference count



     return *this;


template<class X>

CopyOnWriteHolder<X>::CopyOnWriteHolder (const CopyOnWriteHolder<X>& holder)


     if (this != &holder && holder.Data){           //avoid copying itself

         if (IntAtomicGet(holder.Data->RefCount) == UNSHARABLE){ //copy if unsharable

              Data = new CopyOnWriteBuffer<X>(*holder.Data->Data);


              Data = holder.Data;                  //set the buffer

              IntAtomicInc(Data->RefCount);        //increase reference count




template<class X>



     // order of the condition in the following if is very important

     if (Data && (IntAtomicGet(Data->RefCount) != UNSHARABLE)||IntAtomicDec(Data->RefCount)<1){

         delete Data;




6. Performance

Using a COW holder is not free. It consumes certain amount of CPU when you use it. So it is necessary to compare the performance between using the object directly and using COW holder.

Unlike COW string, I don’t know what kind of object user wants to hold. So it is impossible to judge the actual performance of the whole COW holder. I choose to compare separate operations of COW holder. In the tests, I choose the default release options in VC 7 compiler. The test machine is using Intel Pentium D 2.66GHz CPU with 1GB DDR2 memory. See Appendix for the test code. The test results are in the following table. In Average column, I rounded up the average time. All test functions are run 100,000,000 times.












Comparison function f0()






t 1f 1()






t 1f 2()






t 1f 3()






t 2f 1()






t 2f 2()






t 3f 1()






t 3f 2()






t 3f 3()






t 3f 4()

Test 0 is for comparison. I write such code to avoid compiler to optimize the loop completely.

Test 1-X are for testing the integer incremental performance.

Test 1-1 is to test the InterlockedIncrement function provided by Windows.

Test 1-2 is to test the IntAtomicInc function written by myself.

Test 1-3 is to test the ++ operation of C++.

Comparison between Test 1-1 and Test 1-2 shows that my code is a bit faster than the Windows’. Performance of Test 1-3 and Test 0 is not as I have imagined. I think it is mainly because the compiler does optimization on the ++ operator. So I decide to use ++ operator in Test 3-X to test its performance in real case.

Test 2-X are for testing performance of getting a writable pointer to the data COW holder contains.

Test 2-1 is to test getting the pointer directly as a member variable.

Test 2-2 is to test the operation with complete logic, but still avoiding copy to data.

We can see that getting the pointer directly is about 4 times faster.

Test 3-X compare the assignment operations. Because the copy-constructor is much the same as operator =, I only test operator =.

Test 3-1 is the default operator of the COW holder. I use 3 assignments in a loop because I need to assignment really takes place (if the data of both operands are the same, the operation will be only one comparison.)

Test 3-2 is a complete copy of the data. It is Int in this test (4 Bytes).

Test 3-3 is a complete copy of the data. It copies a 200 byte long structure.

Test 3-4 is the default operator of the COW holder, without using the IntAtomicInc function.

The result of Test 3-1, Test 3-2, Test 3-3 shows that the operator = consumes much more time than a memcpy() when the data is small. When the data size increases to about 200 bytes, their performance is very close. But it is clear that if the data of the two operands are the same, the assignment will be much faster. I don’t know the actual ratio of assigning the same data, assigning different data and copying the data because their case dependent. In Test 3-4, I’m surprised to see that performance increases nearly 100% without using the IntAtomicInc().


7. Conclusion

If the object you want to hold in COW holder is small. It’s better to use plain copy instead. In my experiment, an object larger than 200 bytes or with a complex copy function is worthy to use COW holder. If you decide that your program is running in single-thread environment or you ensure that the COW holder will not shared between threads, you can use the single-thread version in section 4 to provide a better performance.


Appendix: Test code

The following is some helper functions I used in the test. The following is some helper functions I used in the test. These functions help to compare the raw copy assignment and assignment function of COW holder.The assembly code is written by me with some optimization. You need some knowledge in IA-32 assembly language to read it.

template<class X>

X* CopyOnWriteHolder<X>::GetData1()


     return Data->Data;


template<class X>

void CopyOnWriteHolder<X>::Assign(const CopyOnWriteHolder<X>& holder)


     memcpy(Data->Data,holder.Data->Data,sizeof(X)); //pure memory copy


template<class X>

void CopyOnWriteHolder<X>::Assign1(const CopyOnWriteHolder<X>& holder)


     if (this->Data!= holder.Data && holder.Data){      //avoid copying the same data

         this->~CopyOnWriteHolder();                    //destroy old data

         if (IntAtomicGet(holder.Data->RefCount) == UNSHARABLE){      //copy if unsharable

              Data = new CopyOnWriteBuffer<X>(*holder.Data->Data);


              Data = holder.Data;                       //set the buffer

              ++(Data->RefCount);    //Not using atomic operation here





__declspec(naked) inline int __fastcall IntAtomicInc(volatile int& )



         mov         eax,1

         lock xadd   dword ptr [ecx],eax

         inc         eax




I use a QueryPerformanceCounter() before and after each function to see its performance.

struct teststruct



     char a[100];


//global variables

CopyOnWriteHolder<int> cow1(1);

CopyOnWriteHolder<int> cow2(2);

CopyOnWriteHolder<int> cow3(cow1);

CopyOnWriteHolder<int> cow4(4);

CopyOnWriteHolder<teststruct> cow5(*(new teststruct()));

CopyOnWriteHolder<teststruct> cow6(*(new teststruct()));

CopyOnWriteHolder<teststruct> cow7(cow5);

CopyOnWriteHolder<teststruct> cow8(*(new teststruct()));


void f0()

{  //for comparison and to reduce error due to overhead of the non-functional code

     long longint=1;

     int shortint=1;

     int intt=1;

     int *pint;


     for (int c=0;c<100000000;c++){




//test 1, test the performance of different increment functions

void t 1f 1()


     long longint=1;

     int shortint=1;

     int intt=1;

     int *pint;


     for (int c=0;c<100000000;c++){





void t 1f 2()


     long longint=1;

     int shortint=1;

     int intt=1;

     int *pint;


     for (int c=0;c<100000000;c++){





void t 1f 3()


     long longint=1;

     int shortint=1;

     int intt=1;

     int *pint;


     for (int c=0;c<100000000;c++){

         ++longint;//this line may be optimized by compiler




//Test 2, test performance of getting a writable pointer

void t 2f 1()


     long longint=1;

     int shortint=1;

     int intt=1;

     int *pint;


     for (int c=0;c<100000000;c++){

         pint = cow1.GetData1();




void t 2f 2()


     long longint=1;

     int shortint=1;

     int intt=1;

     int *pint;


     for (int c=0;c<100000000;c++){

         pint = cow1.GetData();




//Test 3, test assignment performance 

void t 3f 1()


     long longint=1;

     int shortint=1;

     int intt=1;

     int *pint;


     for (int c=0;c<100000000;c++){

         cow2=cow1;//assign data

         cow2=cow3;//will not actually assign anything

         cow2=cow4;//assign data




void t 3f 2()


     long longint=1;

     int shortint=1;

     int intt=1;

     int *pint;


     for (int c=0;c<100000000;c++){







void t 3f 3()


     long longint=1;

     int shortint=1;

     int intt=1;

     int *pint;


     for (int c=0;c<100000000;c++){







}void t 3f 4()


     long longint=1;

     int shortint=1;

     int intt=1;

     int *pint;


     for (int c=0;c<100000000;c++){









[1] More Exceptional C++, Herb Sutter, Addison-Wesley, 2002, ISBN 0-201-70434-X.


