1.分析过程
发生watchdog重启,原因为systemserver binder耗尽,binder thread都在等pulish provider
Object.wait 调用之后会释放同步锁,线程会休眠,需要通过同个对象锁 notify() 或者notifyAll()唤醒。使用的是时候都需要同步锁,不然会报Exception。详情见java 同步
//swt blocked thread (binder full)
at android.os.Binder.blockUntilThreadAvailable(Native method)
//system_server thread Object.wait
"Binder:971_1D" prio=5 tid=122 Waiting
| group="main" sCount=1 dsCount=0 flags=1 obj=0x12c56460 self=0x7c823fe600
| sysTid=9487 nice=0 cgrp=default sched=1073741824/0 handle=0x7c83c054f0
| state=S schedstat=( 138183285126 25399460384 143755 ) utm=9529 stm=4288 core=3 HZ=100
| stack=0x7c83b0b000-0x7c83b0d000 stackSize=1005KB
| held mutexes=
at java.lang.Object.wait(Native method)
- waiting on <0x031914fd> (a com.android.server.am.ContentProviderRecord)
at com.android.server.am.ActivityManagerService.getContentProviderImpl(ActivityManagerService.java:12103)
- locked <0x031914fd> (a com.android.server.am.ContentProviderRecord)
被占完的provider是 com.android.providers.media/.MediaProvider ,访问的进程是pid=13461
被占完的provider是 com.android.providers.media/.MediaProvider ,访问的进程是pid=13461
//from caller=android.app.ApplicationThreadProxy@6a6d5a4 (pid=13461, userId=0) com.android.providers.media/.MediaProvider
s01-01 19:19:24.091 834 1582 D ActivityManager: getContentProviderImpl: from caller=android.app.ApplicationThreadProxy@af91f0d (pid=13461, userId=0) to get content provider media cpr=ContentProviderRecord{10948d12 u0 com.android.providers.media/.MediaProvider}
01-01 19:19:24.092 834 1580 D ActivityManager: getContentProviderImpl: from caller=android.app.ApplicationThreadProxy@24965ec2 (pid=13461, userId=0) to get content provider media cpr=ContentProviderRecord{10948d12 u0 com.android.providers.media/.MediaProvider}
也可以搜搜binderinfo,确定是那个进程binder到systemserver,也可以尝试搜索 ContentProviderRecord看能否有线索
from 13461:xxxxx to 834:xxx
看上面的Log,caller 是 com.google.android.apps.photos
u0_a105 13461 316 1031048 44860 2 20 0 0 0 fg ffffffff f7658938 S 32 com.google.android.apps.photos
2.解决办法
这个问题可以从两方面去追查,一方面APP 方面加快provider 的启动时间, 严禁同一个APP 多线程并发获取content provider 的情况. 另外一方面,可以在AMS 获取provider 时,引入timeout, 防止出现无限等待死机的情况。
麻烦修改AMS 的代码,导入timeout 机制.
private final ContentProviderHolder getContentProviderImpl(IApplicationThread caller, ......
// Wait for the provider to be published
synchronized (cpr) {
//yulong modify for binder death by jiaerdong@yulong.com 20150608
//mtk71029 add for resolve dead binder death can not notify AMS issue.
+ int wait_count = 0;
//mtk71029 add end.
while (cpr.provider == null) {
if (cpr.launchingApp == null) {
Slog.w(TAG, "Unable to launch app "
+ cpi.applicationInfo.packageName + "/"
+ cpi.applicationInfo.uid + " for provider "
+ name + ": launching app became null");
EventLog.writeEvent(EventLogTags.AM_PROVIDER_LOST_PROCESS,
UserHandle.getUserId(cpi.applicationInfo.uid),
cpi.applicationInfo.packageName,
cpi.applicationInfo.uid, name);
return null;
}
+ //mtk71029 add for resolve binder death can not notify AMS issue.
+ //if we check the process doesn't exist, return and release binder thread.
+ //then the binder death will come, and AMS clear the app state.
+ if(!mANRManager.isJavaProcess(cpr.launchingApp.pid)
+ || Process.getUidForPid(cpr.launchingApp.pid) != cpr.launchingApp.uid){
+ //TODO maybe more action to clean content provider state
+ return null;
+ }
+ //if the app wait the provider more than 4*5000 = 20s, then return null and release the binder.
+ if (wait_count >= 4) {
+ return null;
+ }
+ //mtk71029 add end
try {
if (DEBUG_MU) {
Slog.v(TAG_MU, "Waiting to start provider " + cpr + " launchingApp="
+ cpr.launchingApp);
}
if (conn != null) {
conn.waiting = true;
}
// cpr.wait();
//mtk71029 update for resolve binder death can not notify AMS issue.
//wait 5s, then check state.
cpr.wait(5*1000); //yulong.zhangjian modified MTK patch
wait_count ++;
//mtk71029 update end.
//cpr.wait();
} catch (InterruptedException ex) {
} finally {
if (conn != null) {
conn.waiting = false;
}
}
}
}
return cpr != null ? cpr.newHolder(conn) : null;
3.Solution
总结目前遇到过的 getContentProviderImpl 耗尽binder线程的case还是比较多的,原因在于:
1.provider host process起不来
2.provider host process在publish provider的时候非常的费时。而恰好client端又频繁的访问数据库
Android P上google已经有了timeout超时的patch