记一次因为QMutex死锁引发的血案

最新推荐文章于 2024-06-25 17:13:21 发布

爱摸鱼的满满爸

最新推荐文章于 2024-06-25 17:13:21 发布

阅读量1.7k

点赞数 1

分类专栏： QT 文章标签： qt

本文链接：https://blog.csdn.net/weixin_43941028/article/details/119999511

版权

QT 专栏收录该内容

1 篇文章 0 订阅

订阅专栏

问题

最近公司服务部同事反馈客户现场触摸屏（QT）项目在断开CanOpen线时屏幕卡死，触摸无反应，重启正常

分析

因断开CanOpen导致某种操作耗尽系统资源（CPU、内存等）
UI卡死必考虑原因->线程死锁（可能性较大）

排查

1.因断开CanOpen导致某种操作耗尽系统资源（CPU、内存等）

通过top命令查看系统资源使用情况

发现内存还有一半未使用，CPU占用也并不高，所以排除该可能性
2.考虑线程死锁导致

因为CanOpen部分代码是意大利同事编写，并且系统运行了很多年一直很稳定，这次因为系统运行时拔出CanOpen线导致屏幕卡死也是偶然发现。
经过测试发现：

如果先拔除CanOpen线缆，再开机，系统运行正常
如果开机过程中，拔除CanOpen线缆，则屏幕UI卡死，重启则正常
屏幕与主板之间心跳会通过CanOpen维持，如果开机时检测没有连接CanOpen则不会发送心跳

总结：
基本确定是因为系统运行中，通过CanOpen维持屏幕与主板之间心跳，此时拔除CanOpen线缆导致心跳包发送过程中出现问题。

解决

项目中线程互斥使用的QMutex，全局搜索加锁相关代码，结合心跳发送代码，锁定以下代码段（去除逻辑代码）：

int Canopen::uploadSdo(quint8 node_id, quint16 index, quint8 subindex, bool wait)
{	
	if (wait) //waite默认true
	{
		m_sdo.mutex.lock(); //第一次加锁操操作
	}
	int ret = m_candev.CANDeviceWrite(msg);//如果运行时拔掉CanOpen线缆，则该操作失败，返回-1
	if (ret <= 0)
		endSdo(COP_SDO_RC_WRITEERR); //执行该方法
	else if (wait)
	{
        m_sdo.waitcond.wait(&m_sdo.mutex);
	}
	if (wait)
		m_sdo.mutex.unlock();
	return ret;
}

void Canopen::endSdo(enumCopErrorCode result)
{
	m_sdo.mutex.lock(); //第二次加锁操作
	m_sdo.waitcond.wakeAll();
	m_sdo.mutex.unlock();
}

可以发现，同一个线程，两次申请持有同一个锁对象，如果没有其他线程持有该锁，则第一个申请时，将会成功持有该锁对象，那么第二次申请同一锁对象呢？此处关键是同一个线程。（如果是Java，那么是可以的，sychornized默认是可重入锁）

我们都知道，QMutex为互斥锁，目的是保护一个对象、数据结构或者代码段，同一时间只有一个线程可以访问它。如果A线程持有一个锁对象，那么B线程在申请该锁对象时将会block直至A线程释放该锁对象，那么同一个线程连续两次申请同一个锁对象又会怎么样呢？

官方解释：

void QMutex::lock()
Locks the mutex. If another thread has locked the mutex then this call will block until that thread has unlocked it.
Calling this function multiple times on the same mutex from the same thread is allowed if this mutex is a recursive mutex. If this mutex is a non-recursive mutex, this function will dead-lock when the mutex is locked recursively.

其中关键：
Calling this function multiple times on the same mutex from the same thread is allowed if this mutex is a recursive mutex. If this mutex is a non-recursive mutex, this function will dead-lock when the mutex is locked recursively.

Mutex可以分为递归锁(recursive mutex)和非递归锁(non-recursive mutex)。
可递归锁也可称为可重入锁(reentrant mutex)，非递归锁又叫不可重入锁(non-reentrant mutex)。

二者唯一的区别是，同一个线程可以多次获取同一个递归锁，不会产生死锁。而如果一个线程多次获取同一个非递归锁，则会产生死锁。

那么此处肯定就是死锁了。

方案

1.使用递归锁QRecursiveMutex
QRecursiveMutex 该类继承自QMutex，用法与QMutex一致(lock()、unlock())

构造：
QMutex::QMutex(QMutex::RecursionMode mode)

Constant	Value	Description
QMutex::Recursive	1	In this mode, a thread can lock the same mutex multiple times and the mutex won’t be unlocked until a corresponding number of unlock() calls have been made. You should use QRecursiveMutex for this use-case.
QMutex::NonRecursive	0	In this mode, a thread may only lock a mutex once.

默认是NonRecursive，也就是非重入

但是官方也建议了：QRecursiveMutex is much more expensive to construct and operate on, so use a plain QMutex whenever you can. => QRecursiveMutex的构造和操作成本要高得多，所以尽可能使用普通的QMutex。

2.调整代码，在第二次lock之前先unlock，但是这个操作需要确保第一次unlock与第二次lock之间不存在临界资源。

爱摸鱼的满满爸

关注

1
点赞
踩
6

收藏

觉得还不错? 一键收藏
0
评论
记一次因为QMutex死锁引发的血案

问题最近公司服务部同事反馈客户现场触摸屏（QT）项目在断开CanOpen线时屏幕卡死，触摸无反应，重启正常分析因断开CanOpen导致某种操作耗尽系统资源（CPU、内存等）UI卡死必考虑原因->线程死锁（可能性较大）排查1.因断开CanOpen导致某种操作耗尽系统资源（CPU、内存等）通过top命令查看系统资源使用情况发现内存还有一半未使用，CPU占用也并不高，所以排除该可能性2.考虑线程死锁导致因为CanOpen部分代码是意大利同事编写，并且系统运行了很多年一直
复制链接

扫一扫