. 这个异常是从解一个死锁问题开始的,客户为了处理一个google issue,加大了锁的保护范围,导致死锁问题的发生,为此我在分析了空指针引用异常问题后,通过将对相关对象操作的行为post到一个线程上,保证资源同步,继而规避问题,具体问题分析如下。
- TvHome主线程卡在如下位置,即卡在getHardwareList上,Services端没有返回
"main" prio=5 tid=1 Native
| group="main" sCount=1 dsCount=0 flags=1 obj=0x751cd000 self=0xf01d1000
| sysTid=8973 nice=-10 cgrp=default sched=0/0 handle=0xf43c6494
| state=S schedstat=( 0 0 0 ) utm=158 stm=16 core=2 HZ=100
| stack=0xff285000-0xff287000 stackSize=8MB
| held mutexes=
kernel: __switch_to+0xa4/0xc4
kernel: binder_thread_read+0x3e0/0x1104
kernel: binder_ioctl+0x8e0/0xac4
kernel: compat_SyS_ioctl+0xd4/0xed8
kernel: __sys_trace+0x4c/0x4c
native: #00 pc 00053b8c /system/lib/libc.so (__ioctl+8)
native: #01 pc 00021b63 /system/lib/libc.so (ioctl+30)
native: #02 pc 0003d3f5 /system/lib/libbinder.so (android::IPCThreadState::talkWithDriver(bool)+204)
native: #03 pc 0003dde3 /system/lib/libbinder.so (android::IPCThreadState::waitForResponse(android::Parcel*, int*)+26)
native: #04 pc 0003713d /system/lib/libbinder.so (android::BpBinder::transact(unsigned int, android::Parcel const&, android::Parcel*, unsigned int)+36)
native: #05 pc 000c2d4f /system/lib/libandroid_runtime.so (android_os_BinderProxy_transact(_JNIEnv*, _jobject*, int, _jobject*, _jobject*, int)+82)
at android.os.BinderProxy.transactNative(Native method)
at android.os.BinderProxy.transact(Binder.java:1127)
at android.media.tv.ITvInputManager$Stub$Proxy.getHardwareList(ITvInputManager.java:1389)
at android.media.tv.TvInputManager.getHardwareList(TvInputManager.java:1572)
- 查服务端接口情况如下,在等锁0x03c7c6d5
"Binder:3770_2" prio=5 tid=9 Blocked
| group="main" sCount=1 dsCount=0 flags=1 obj=0x130003c0 self=0xebd7ee00
| sysTid=3792 nice=-10 cgrp=default sched=0/0 handle=0xd7718970
| state=S schedstat=( 0 0 0 ) utm=57 stm=21 core=1 HZ=100
| stack=0xd761d000-0xd761f000 stackSize=1010KB
| held mutexes=
at com.android.server.tv.TvInputHardwareManager.getHardwareList(TvInputHardwareManager.java:253)
- waiting to lock <0x03c7c6d5> (a java.lang.Object) held by thread 30
at com.android.server.tv.TvInputManagerService$BinderService.getHardwareList(TvInputManagerService.java:1823)
at android.media.tv.ITvInputManager$Stub.onTransact(ITvInputManager.java:541)
at android.os.Binder.execTransact(Binder.java:731)
- 寻查拿锁不放的原因,如下位置Binder:3770_4拿住0x03c7c6d5和0x03bb9f8c,等0x051f1cbf;而AudioPortEventHandler拿住0x051f1cbf,在等0x03bb9f8c,可以看出这里形成一个死锁了。
"Binder:3770_4" prio=5 tid=30 Blocked
| group="main" sCount=1 dsCount=0 flags=1 obj=0x13501988 self=0xd55e8e00
| sysTid=4081 nice=0 cgrp=default sched=0/0 handle=0xd5479970
| state=S schedstat=( 0 0 0 ) utm=52 stm=18 core=1 HZ=100
| stack=0xd537e000-0xd5380000 stackSize=1010KB
| held mutexes=
at android.media.AudioPortEventHandler.unregisterListener(AudioPortEventHandler.java:186)
- waiting to lock <0x051f1cbf> (a java.lang.Object) held by thread 99
at android.media.AudioManager.unregisterAudioPortUpdateListener(AudioManager.java:4463)
at com.android.server.tv.TvInputHardwareManager$TvInputHardwareImpl.release(TvInputHardwareManager.java:835)
- locked <0x03bb9f8c> (a java.lang.Object)
at com.android.server.tv.TvInputHardwareManager$Connection.resetLocked(TvInputHardwareManager.java:642)
at com.android.server.tv.TvInputHardwareManager.releaseHardware(TvInputHardwareManager.java:417)
- locked <0x03c7c6d5> (a java.lang.Object)
at com.android.server.tv.TvInputManagerService$BinderService.releaseTvInputHardware(TvInputManagerService.java:1863)
at android.media.tv.ITvInputManager$Stub.onTransact(ITvInputManager.java:576)
at android.os.Binder.execTransact(Binder.java:731)
"AudioPortEventHandler" prio=5 tid=99 Blocked
| group="main" sCount=1 dsCount=0 flags=1 obj=0x13f40520 self=0xcf798200
| sysTid=7366 nice=0 cgrp=default sched=0/0 handle=0xce301970
| state=S schedstat=( 0 0 0 ) utm=1 stm=0 core=3 HZ=100
| stack=0xce1fe000-0xce200000 stackSize=1042KB
| held mutexes=
at com.android.server.tv.TvInputHardwareManager$TvInputHardwareImpl$1.onAudioPortListUpdate(TvInputHardwareManager.java:755)
- waiting to lock <0x03bb9f8c> (a java.lang.Object) held by thread 30
at android.media.AudioPortEventHandler$1.handleMessage(AudioPortEventHandler.java:120)
- locked <0x051f1cbf> (a java.lang.Object)
at android.os.Handler.dispatchMessage(Handler.java:106)
at android.os.Looper.loop(Looper.java:193)
at android.os.HandlerThread.run(HandlerThread.java:65)
- 如上死锁后面查客户代码之后,确认为客户修改导致,也就请客户思考如何修正,不过我也很好奇为何客户要如此修改,其修改等价于将如下位置的synchronized (this)作用域扩展到整个handleMessage。
http://androidxref.com/9.0.0_r3/xref/frameworks/base/media/java/android/media/AudioPortEventHandler.java#67
- 在等待客户反馈的时候,我开始处理另一个低概率system_server挂了,导致Android重启问题时,理解了客户解法的原因。这个异常发生时,119行这里发生了空指针引用,也就是listeners.get(i)返回了null,导致system_server挂了。
http://androidxref.com/9.0.0_r3/xref/frameworks/base/media/java/android/media/AudioPortEventHandler.java#119
- 此时我们再回头查看上面第3点的分析,结合两个trace,我们可以确认如下两点;而客户将synchronized(this)的范围扩大到整个handleMessage也就是为了规避这个异常的发生。
- AudioPortEventHandler.java在handleMessage时,有进程调用unregisterListener操作,也就是执行mListeners.remove(),而实际listeners是mListeners的一个引用,即对mListeners的操作就是在对listeners操作;
- listeners在传入的MSG不为AUDIOPORT_EVENT_NEW_LISTENER时,会读取成员对应通知,因为第1点情况的存在,所以有可能出现低概率的listeners.get(i)获取到空指针。
- 明确了原因之后,也就有想法对这里进行fix,又是Google bug了。结合上面两种场景异常,问题发生的根源在于对mListeners的操作不同步导致,只要同步了,就可以规避此问题,思路上也就是将资源操作统一到一个地方。
AudioPortEventHandler.java中新增AUDIOPORT_EVENT_REMOVE_LISTENER用于实现做对listener的操作,同时因为unregisterListener不再是同步操作,所以TvInputHarewareManager.java也要对应调整,具体如下
1) frameworks/base/media/java/android/media/AudioPortEventHandler.java
2) frameworks/base/services/core/java/com/android/server/tv/TvInputHardwareManager.java