读写锁的性能问题及替代方案

最新推荐文章于 2024-05-01 07:08:38 发布

SIGXXL

最新推荐文章于 2024-05-01 07:08:38 发布

阅读量9.1k

点赞数 4

分类专栏： C/C++ Linux

本文链接：https://blog.csdn.net/sigxxl/article/details/23598805

版权

C/C++ 同时被 2 个专栏收录

15 篇文章 0 订阅

订阅专栏

Linux

13 篇文章 0 订阅

订阅专栏

这两天看了一些资料，谈到了读写锁的性能问题，并建议不要使用读写锁，而采取其他方案代替。

本位首先介绍为什么不适合用读写锁，随后给出了替代读写锁的方案。具体内容如下：

---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

《Linux多线程服务端编程》2.3节这样写道：

读写锁（rwlock）是个看上去很美的抽象，它明确区分了read和write两种行为。

初学者常干的一件事情是，一见到某个共享数据结构频繁读而很少写，就把mutex替换为rwlock。甚至首选rwlock来保护共享状态，这是不正确的。

1、从正确性方面来说，一种典型的易犯错误是在持有read lock的时候修改了共享数据。这通常发生在程序的维护阶段，为了新增功能，程序猿不小心在原来read lock保护的函数中调用了会修改状态的函数。这种错误的后果跟无保护并发读写共享数据是一样的。

2、从性能方面来说，读写锁不见得比普通mutex更高效。无论如何reader lock加锁的开销不会比mutex lock小，因为他要更新当前reader的数目。如果临界区很小，锁竞争不激烈，那么mutex往往会更快。（XXL：如果临界区设置的很大，说明程序本身是有问题的）

3、reader lock可能允许提升（upgrade）为writer lock，也可能不允许提升（Pthread rwlock不允许提升）。如果允许把读锁提升为写锁，后果跟使用recursive mutex（可重入）一样，会造成程序其他问题。如果不允许提升，后果跟使用non-recursive mutex一样，会造成死锁。我宁愿程序死锁，留个“全尸”好查验。

4、通常reader lock是可重入的，writer lock是不可重入的。但是为了防止writer饥饿，writer lock通常会阻塞后来的reader lock，因此reader lock在重入的时候可能死锁。另外，在追求低延迟读取的场合也不适用读写锁。

XXL：补充一下rwlock死锁的问题，线程1获取了读锁，在临界区执行代码；这时，线程2获取写锁，在该锁上等待线程1完成读操作，同事线程2阻塞了后续的读操作；线程1仍在进行剩余读操作，但是它通过函数调用等间接方式，再次获取那个读锁，此时，线程1阻塞，因为线程2已经上了写锁；同时，线程2也在等待线程1释放读锁，才能进行写操作。因此发生了死锁，原因就在于，读锁是可重入的。

文章《Real-World Concurrency》（http://queue.acm.org/detail.cfm?id=1454462）也写道：

Be wary of readers/writer locks. If there is a novice error when trying to break up a lock, it is this: seeing that a data structure is frequently accessed for reads and infrequently accessed for writes, one may be tempted to replace a mutex guarding the structure with a readers/writer lock to allow for concurrent readers. This seems reasonable, but unless the hold time for the lock is long, this solution will scale no better (and indeed, may scale worse) than having a single lock. Why? Because the state associated with the readers/writer lock must itself be updated atomically, and in the absence of a more sophisticated (and less space-efficient) synchronization primitive, a readers/writer lock will use a single word of memory to store the number of readers. Because the number of readers must be updated atomically, acquiring the lock as a reader requires the same bus transaction—a read-to-own—as acquiring a mutex, and contention on that line can hurt every bit as much.

There are still many situations where long hold times (e.g., performing I/O under a lock as reader) more than pay for any memory contention, but one should be sure to gather data to make sure that it is having the desired effect on scalability. Even in those situations where a readers/writer lock is appropriate, an additional note of caution is warranted around blocking semantics. If, for example, the lock implementation blocks new readers when a writer is blocked (a common paradigm to avoid writer starvation),one cannot recursively acquire a lock as reader: if a writer blocks between the initial acquisition as reader and the recursive acquisition as reader, deadlock will result when the recursive acquisition is blocked. All of this is not to say that readers/writer locks shouldn’t be used—just that they shouldn’t be romanticized.

事实上，我也看了一些博客，通过实验验证mutex的性能要比rwlock好，例如：http://blog.chinaunix.net/uid-28852942-id-3756043.html

不得不说，这篇文章的作者是个傻逼。这种比较只是把mutex和rwlock的读锁进行比较，只开了两个线程，很难体现出并发的效果，就得出“读写锁提高了并行性，但是就速度而言并不比互斥量快”这种结论，荒谬至极。当然，并不是说结论荒谬，而是实验看起来很可笑，拿去忽悠比他还傻的傻逼应该可以。

我认为，要得出这样的结论，应该模拟出这样的业务场景：很多的读线程（至少得几百个吧，但受地址空间限制，32-bit，单进程的线程数一般为300多个，不过可以调节系统的线程栈参数），较少的写线程。在以上的条件下，在给出读临界区很短和读临界区很长这两种情况，这样才能体现出在临界区较短情况下，mutex的优势。当然，临界区较长时，可能读写锁效果会更好，但是临界区较长，本身就是程序设计的问题，你为什不把临界区设置短一点？！

给一个具体实例：

假设Mutex和RWLock做了如下OO封装，并提供了一些基本操作：

class Mutex  // 意会即可，不必深究如何实现，主要用来说明问题
{
public:
    void Init();
    void Destroy();
    void Lock();
    void Unlock();
};

class RWLock
{
public:
    void Init();
    void Destroy();
    void ReadLock();
    void WriteLock();
    void Unlock();
};

现有一个对象，内部有个vector数据，有很多线程读这个数据，很少线程来写。如果用RWLock来实现，则有：

class RaceData_rwlock
{
public:
    RaceData_rwlock()
    {
        vec.resize(100000);
        rwlock.Init();
    }
    ~RaceData_rwlock()
    {
        rwlock.Destroy();
    }
    void Read() const
    {
        rwlock.ReadLock();
        for (vector<int>::const_iterator it = vec.begin(); it != vec.end(); ++it)
        {
            // read (*it) or other read operation
        }
        rwlock.Unlock();
    }
    void Write(int i)
    {
        rwlock.WriteLock();
        vec.push_back(i);
        rwlock.Unlock();
    }
private:
    mutable RWLock rwlock;
    vector<int> vec;
};

如果用Mutex来实现，则有：

class RaceData_mutex
{
public:
    RaceData_mutex()
    {
        vec.resize(100000);
        mutex.Init();
    }
    ~RaceData_mutex()
    {
        mutex.Destroy();
    }
    void Read() const
    {
        mutex.Lock();
        for (vector<int>::const_iterator it = vec.begin(); it != vec.end(); ++it)
        {
            // read (*it) or other read operation
        }
        mutex.Unlock();
    }
    void Write(int i)
    {
        mutex.Lock();
        vec.push_back(i);
        mutex.Unlock();
    }
private:
    mutable Mutex mutex;
    vector<int> vec;
};

由上面可以看出，临界区里需要遍历vec，说明这个临界区还是很长的，既然很多牛人说rwlock性能不好，那么如何在这种情况下找一个方案代替rwlock呢？

我们利用c++ tr1或boost中的shared_ptr + mutex来实现copy-on-write

在此之前，我们必须对Mutex做一个类似RAII的封装，如下：

class MutexLockGuard
{
public:
    explicit MutexLockGuard(Mutex &m)
        : mutex(m)
    {
        m.Lock();
    }
    ~MutexLockGuard()
    {
        mutex.Unlock();
    }
private:
    Mutex &mutex;
};

这样，我们就不用直接调用mutex的Lock和UnLock操作，借助MutexLockGuard对象，在局部作用域结束时，自动析构，从而自动对mutex进行UnLock。

具体替代方案如下，一些重要内容参见代码的注释：

class RaceData_sharedptr
{
public:
    RaceData_sharedptr()
        : dataPtr(new DataType)
    {
        dataPtr->resize(100000);
        mutex.Init();
    }
    ~RaceData_sharedptr()
    {
        mutex.Destroy();
    }
    void Read() const
    {
        DataPtr dataPtrCopy = GetData(); // 创建local shared_ptr，这样就有两个智能指针绑定在真实对象上，引用计数为2
        // 在读取数据的时候没有加锁
        for (vector<int>::const_iterator it = dataPtrCopy->begin(); it != dataPtrCopy->end(); ++it)
        {
            // read (*it) or other read operation
        }
    }//退出Read()时，自动析构DataPtr，对象的引用计数减1
    void Write(int i) // 不会被读操作阻塞太久，因为读操作的临界区很短
    {
        MutexLockGuard lock(mutex);
        if (!dataPtr.unique())
        { // 这里说明有读线程在读取dataPtr
            DataPtr newDataPtr(new DataType(*dataPtr)); // 其实这一步的开销也不小
            dataPtr.swap(newDataPtr);// 替换为新的副本
        }// 在if语句块结束时，自动析构掉newDataPtr,但newDataPtr所指向的这是对象可能不析构，因为该对象的引用计数可能没减为0，可能还有读线程的shared_ptr绑定在该对象上
        dataPtr->push_back(i);
    }
private:
    typedef vector<int> DataType;
    typedef shared_ptr<DataType> DataPtr;
    mutable Mutex mutex;
    DataPtr dataPtr;

    DataPtr GetData() const
    {
        MutexLockGuard lock(mutex);
        return dataPtr;
    }
};

通过上面的实例可以看出，reader线程可能会读到稍旧的数据，如果对一致性要求不这么高的话。

这个程序说实话我没有进行测试，但《Linux多线程服务端编程》的作者陈硕说“据我们测试，大多数情况下更新都是在原来的数据上进行的，拷贝的比例还不到1%，很高效。更准确的说，这不是copy-on-write，而是copy-on-other-reading”。

最后一句话的意思是，当有其他读线程在读的时候，才在写线程中进行copy。有的方案，在写的时候，不论有没有人在读，都全部创建副本，这样的开销还是比较大的，而本文的实现，则是在有人读的情况下才创建副本，没人读的话则不创建，相对来说，开销还是比较小的。

还有一种做法叫做read-copy-update，和本文的做法很相似，我没有去细看，据说不太好理解，具体的网址如下：

http://www.ibm.com/developerworks/cn/linux/l-rcu/

en.wikipedia.org/wiki/Read-copy-update

本文参考信息：

《Linux多线程服务端编程》 by陈硕