Scheduling restart of crashed service解决方案与源码分析

测试发现一个bug,service中某个方法由于空指针导致程序挂掉,接着触发程序的保活机制触发程序重启,但是这个异常service先启动访问未初始化资源导致程序连续循环重启。
下面代码模拟了service子线程显示toast引起程序挂掉

public class MyService extends Service {
    @Override
    public int onStartCommand(Intent intent, int flags, int startId) {
        LogUtil.d("onStartCommand,flags="+super.onStartCommand(intent, flags, startId)+",START_NOT_STICKY="+START_NOT_STICKY);
        LogUtil.d("onStartCommand,super.onStartCommand(intent, flags, startId)="+super.onStartCommand(intent, flags, startId)+",super.onStartCommand(intent, Service.START_NOT_STICKY, startId)="+super.onStartCommand(intent, Service.START_NOT_STICKY, startId));
        // return super.onStartCommand(intent, flags, startId);
        //super.onStartCommand(intent, Service.START_NOT_STICKY, startId);
        return START_NOT_STICKY;
    }
    @Override
    public void onCreate() {
        LogUtil.d("onCreate");
        super.onCreate();
        new Thread(new Runnable() {
            @Override
            public void run() {
                try {
                    Thread.sleep(10_000);
                } catch (InterruptedException e) {
                    e.printStackTrace();
                }
                LogUtil.d("run crash before");
                Toast.makeText(MyService.this,"演示子线程更新UI发生crash",Toast.LENGTH_SHORT).show();
                LogUtil.d("run crash after");
            }
        }).start();
    }

    public MyService() {
        LogUtil.d("MyService");
    }

    @Override
    public void onDestroy() {
        LogUtil.d("onDestroy");
        super.onDestroy();
    }
}

log中打印一个信息很关键。

07-17 09:52:57.674  1022  1037 I ActivityManager: Process com.shan.mvvm (pid 13678) has died: prcp SVC 
07-17 09:52:57.675  1022  1037 W ActivityManager: Scheduling restart of crashed service com.shan.mvvm/.MyService in 1000ms for start-requested

一、解决方案

系统按照程序启动时要求重新启动了service。这就要提到Service的onStartCommand方法中涉及到的启动模式了。

/**
 * Constant to return from {@link #onStartCommand}: compatibility
 * version of {@link #START_STICKY} that does not guarantee that
 * {@link #onStartCommand} will be called again after being killed.
 */
public static final int START_STICKY_COMPATIBILITY = 0;

/**
 * Constant to return from {@link #onStartCommand}: if this service's
 * process is killed while it is started (after returning from
 * {@link #onStartCommand}), then leave it in the started state but
 * don't retain this delivered intent.  Later the system will try to
 * re-create the service.  Because it is in the started state, it will
 * guarantee to call {@link #onStartCommand} after creating the new
 * service instance; if there are not any pending start commands to be
 * delivered to the service, it will be called with a null intent
 * object, so you must take care to check for this.
 * 
 * <p>This mode makes sense for things that will be explicitly started
 * and stopped to run for arbitrary periods of time, such as a service
 * performing background music playback.
 */
public static final int START_STICKY = 1;

/**
 * Constant to return from {@link #onStartCommand}: if this service's
 * process is killed while it is started (after returning from
 * {@link #onStartCommand}), and there are no new start intents to
 * deliver to it, then take the service out of the started state and
 * don't recreate until a future explicit call to
 * {@link Context#startService Context.startService(Intent)}.  The
 * service will not receive a {@link #onStartCommand(Intent, int, int)}
 * call with a null Intent because it will not be restarted if there
 * are no pending Intents to deliver.
 * 
 * <p>This mode makes sense for things that want to do some work as a
 * result of being started, but can be stopped when under memory pressure
 * and will explicit start themselves again later to do more work.  An
 * example of such a service would be one that polls for data from
 * a server: it could schedule an alarm to poll every N minutes by having
 * the alarm start its service.  When its {@link #onStartCommand} is
 * called from the alarm, it schedules a new alarm for N minutes later,
 * and spawns a thread to do its networking.  If its process is killed
 * while doing that check, the service will not be restarted until the
 * alarm goes off.
 */
public static final int START_NOT_STICKY = 2;

/**
 * Constant to return from {@link #onStartCommand}: if this service's
 * process is killed while it is started (after returning from
 * {@link #onStartCommand}), then it will be scheduled for a restart
 * and the last delivered Intent re-delivered to it again via
 * {@link #onStartCommand}.  This Intent will remain scheduled for
 * redelivery until the service calls {@link #stopSelf(int)} with the
 * start ID provided to {@link #onStartCommand}.  The
 * service will not receive a {@link #onStartCommand(Intent, int, int)}
 * call with a null Intent because it will only be restarted if
 * it is not finished processing all Intents sent to it (and any such
 * pending events will be delivered at the point of restart).
 */
public static final int START_REDELIVER_INTENT = 3;

一共四种模式,
START_STICKY (1)模式在服务死掉后被系统自动重启拉活,但是不会保留之前的intent参数;START_STICKY_COMPATIBILITY (0)是START_STICKY 的兼容模式,不保证服务死掉后被系统自动拉活;
START_NOT_STICKY(2)服务死掉系统不会自动去拉活;
START_REDELIVER_INTENT(3)模式在服务死掉后被系统自动重启拉活,并且保留之前的intent参数。
知道了这四种参数含义,我就将START_NOT_STICKY传入到onStartCommand方法中,但是还是会重启,怎么回事呢?排查发现我虽然将START_NOT_STICKY传入到onStartCommand方法中了,但是姿势不对,第一次的错误传参是这样的:

public int onStartCommand(Intent intent, int flags, int startId) {
    super.onStartCommand(intent, Service.START_NOT_STICKY, startId);
}   

大佬们应该知道错误出在什么地方了,实际上super.onStartCommand(intent, Service.START_NOT_STICKY, startId)返回的值还是START_STICKY ,打印log可以看到,实际上直接return START_NOT_STICK即可。
正确的做法是这样子的:

public int onStartCommand(Intent intent, int flags, int startId) {       
    LogUtil.d("onStartCommand,flags="+super.onStartCommand(intent, flags, startId)+",START_NOT_STICKY="+START_NOT_STICKY);
    LogUtil.d("onStartCommand,super.onStartCommand(intent, flags, startId)="+super.onStartCommand(intent, flags, startId)+",super.onStartCommand(intent, Service.START_NOT_STICKY, startId)="+super.onStartCommand(intent, Service.START_NOT_STICKY, startId));
   // return super.onStartCommand(intent, flags, startId);
    //super.onStartCommand(intent, Service.START_NOT_STICKY, startId);
    return START_NOT_STICKY;
}

log打印如下:

onStartCommand,flags=1,START_NOT_STICKY=2
onStartCommand,super.onStartCommand(intent, flags, startId)=1,super.onStartCommand(intent, Service.START_NOT_STICKY, startId)=1

二、源码分析

startService过程 一文中提到启动服务会走到realStartServiceLocked方法,在该方法中通过sendServiceArgsLocked方法设置onStartCommand中的参数。

2.1 ActiveServices.realStartServiceLocked

 //ActiveServices.java
private final void realStartServiceLocked(ServiceRecord r,
        ProcessRecord app, boolean execInFg) throws RemoteException {
……
        app.thread.scheduleCreateService(r, r.serviceInfo,
                mAm.compatibilityInfoForPackageLocked(r.serviceInfo.applicationInfo),
                app.repProcState); //创建service
 ……
    sendServiceArgsLocked(r, execInFg, true); //添加service的启动参数
……
}

2.2 ActiveServices.sendServiceArgsLocked

sendServiceArgsLocked方法会调用ActivityThread的scheduleServiceArgs方法。

 //ActiveServices.java
private final void sendServiceArgsLocked(ServiceRecord r, boolean execInFg,
        boolean oomAdjusted) throws TransactionTooLargeException {
……
        r.app.thread.scheduleServiceArgs(r, slice);
……
}

2.3 ActivityThread.scheduleServiceArgs

scheduleServiceArgs位于ActivityThread.java内部类ApplicationThread中,scheduleServiceArgs方法获取到service的参数集合,遍历其中的参数,通过hander发送消息H.SERVICE_ARGS。

//ActivityThread$ApplicationThread
public final void scheduleServiceArgs(IBinder token, ParceledListSlice args) {
    List<ServiceStartArgs> list = args.getList();

    for (int i = 0; i < list.size(); i++) {
        ServiceStartArgs ssa = list.get(i);
        ServiceArgsData s = new ServiceArgsData();
        s.token = token;
        s.taskRemoved = ssa.taskRemoved;
        s.startId = ssa.startId;
        s.flags = ssa.flags;
        s.args = ssa.args;

        sendMessage(H.SERVICE_ARGS, s);
    }
}

2.4 H.handleMessage

H是ActivityThread.java内部类,它的父类是Handler, H.SERVICE_ARGS消息在H的handleMessage方法中中被处理,接着调用handleServiceArgs方法。

//ActivityThread&H
   public void handleMessage(Message msg) {
        ……
            case SERVICE_ARGS:
                if (Trace.isTagEnabled(Trace.TRACE_TAG_ACTIVITY_MANAGER)) {
                    Trace.traceBegin(Trace.TRACE_TAG_ACTIVITY_MANAGER,
                            ("serviceStart: " + String.valueOf(msg.obj)));
                }
                handleServiceArgs((ServiceArgsData)msg.obj);
                Trace.traceEnd(Trace.TRACE_TAG_ACTIVITY_MANAGER);
                break;
            ……
    }	

2.5 ActivityThread.handleServiceArgs

ActivityThread.java的handleServiceArgs方法首先将service中onStartCommand方法的返回值赋值给int型局部变量res,然后将res作为参数传入到AMS的serviceDoneExecuting方法中。

//ActivityThread.java
private void handleServiceArgs(ServiceArgsData data) {
    Service s = mServices.get(data.token);
    if (s != null) {
        try {
            if (data.args != null) {
                data.args.setExtrasClassLoader(s.getClassLoader());
                data.args.prepareToEnterProcess();
            }
            int res;
            if (!data.taskRemoved) {
                //这里取到service中onStartCommand方法的返回值
                res = s.onStartCommand(data.args, data.flags, data.startId); 
            } else {
                s.onTaskRemoved(data.args);
                res = Service.START_TASK_REMOVED_COMPLETE;
            }

            QueuedWork.waitToFinish();

            try {
                //onStartCommand参数传输到AMS中
                ActivityManager.getService().serviceDoneExecuting(
                        data.token, SERVICE_DONE_EXECUTING_START, data.startId, res); //将返回值传入到AMS中
            } catch (RemoteException e) {
                throw e.rethrowFromSystemServer();
            }
        } catch (Exception e) {
            if (!mInstrumentation.onException(s, e)) {
                throw new RuntimeException(
                        "Unable to start service " + s
                        + " with " + data.args + ": " + e.toString(), e);
            }
        }
    }
}

2.5 ActivityManagerService.serviceDoneExecuting

AMS的serviceDoneExecuting方法调用了ActiveServices.java的serviceDoneExecutingLocked方法。

//ActivityManagerService.java
public void serviceDoneExecuting(IBinder token, int type, int startId, int res) {
    synchronized(this) {
        if (!(token instanceof ServiceRecord)) {
            Slog.e(TAG, "serviceDoneExecuting: Invalid service token=" + token);
            throw new IllegalArgumentException("Invalid service token");
        }
        mServices.serviceDoneExecutingLocked((ServiceRecord)token, type, startId, res);
    }
}

2.6 ActiveServices.serviceDoneExecutingLocked

ActiveServices.java的serviceDoneExecutingLocked方法对onStartCommand不同类型返回值进行了处理,这里重点关注r.stopIfKilled变量,可以看出START_STICKY类型的stopIfKilled为false,代表被杀重启;START_NOT_STICKY类型stopIfKilled为true,代表被杀就停止。然后将ServiceRecord对象传入到serviceDoneExecutingLocked方法中。

//ActiveServices.java
void serviceDoneExecutingLocked(ServiceRecord r, int type, int startId, int res) {
    boolean inDestroying = mDestroyingServices.contains(r);
    if (r != null) {
	//启动阶段就分析ActivityThread.SERVICE_DONE_EXECUTING_START类型
        if (type == ActivityThread.SERVICE_DONE_EXECUTING_START) {
            // This is a call from a service start...  take care of
            // book-keeping.
            r.callStart = true;
            switch (res) {
                case Service.START_STICKY_COMPATIBILITY:
                case Service.START_STICKY: {
                    // We are done with the associated start arguments.
                    r.findDeliveredStart(startId, false, true);
                    // Don't stop if killed.
			//START_STICKY类型的stopIfKilled为false,代表被杀重启
                    r.stopIfKilled = false;
                    break;
                }
                case Service.START_NOT_STICKY: {
                    // We are done with the associated start arguments.
                    r.findDeliveredStart(startId, false, true);
                    if (r.getLastStartId() == startId) {
                        // There is no more work, and this service
                        // doesn't want to hang around if killed.
			//START_NOT_STICKY类型stopIfKilled为true,代表被杀就停止
                        r.stopIfKilled = true;
                    }
                    break;
                }
                case Service.START_REDELIVER_INTENT: {
                    // We'll keep this item until they explicitly
                    // call stop for it, but keep track of the fact
                    // that it was delivered.
                    ServiceRecord.StartItem si = r.findDeliveredStart(startId, false, false);
                    if (si != null) {
                        si.deliveryCount = 0;
                        si.doneExecutingCount++;
                        // Don't stop if killed.
                        r.stopIfKilled = true;
                    }
                    break;
                }
                case Service.START_TASK_REMOVED_COMPLETE: {
                    // Special processing for onTaskRemoved().  Don't
                    // impact normal onStartCommand() processing.
                    r.findDeliveredStart(startId, true, true);
                    break;
                }
                default:
                    throw new IllegalArgumentException(
                            "Unknown service start result: " + res);
            }
            if (res == Service.START_STICKY_COMPATIBILITY) {
                r.callStart = false;
            }
        }
   ……
}

2.7 ActivityManagerService.appDiedLocked

可以搜索一下哪里使用了r.stopIfKilled变量,比如ServiceRecord.java的canStopIfKilled方法就有用到,从方法名也可以看出应该和程序重启有关。在上面提到服务异常重启日志中的第一行Process com.shan.mvvm (pid 13678) has died: prcp SVC 实际上是AMS的appDiedLocked方法中打印的,进一步看下handleAppDiedLocked函数,并且第三个参数allowRestart为true表示允许重启。

//ActivityManagerService.java
final void appDiedLocked(ProcessRecord app, int pid, IApplicationThread thread,
        boolean fromBinderDied, String reason) {
   ……
    // Clean up already done if the process has been re-started.
    if (app.pid == pid && app.thread != null &&
            app.thread.asBinder() == thread.asBinder()) {
        boolean doLowMem = app.getActiveInstrumentation() == null;
        boolean doOomAdj = doLowMem;
        if (!app.killedByAm) {
		 //打印app死掉的信息
            reportUidInfoMessageLocked(TAG,
                    "Process " + app.processName + " (pid " + pid + ") has died: "
                            + ProcessList.makeOomAdjString(app.setAdj, true) + " "
                            + ProcessList.makeProcStateString(app.setProcState), app.info.uid);
            mAllowLowerMemLevel = true;
        } else {
            // Note that we always want to do oom adj to update our state with the
            // new number of procs.
            mAllowLowerMemLevel = false;
            doLowMem = false;
        }
        EventLogTags.writeAmProcDied(app.userId, app.pid, app.processName, app.setAdj,
                app.setProcState);
        if (DEBUG_CLEANUP) Slog.v(TAG_CLEANUP,
            "Dying app: " + app + ", pid: " + pid + ", thread: " + thread.asBinder());
        //app死亡处理
        handleAppDiedLocked(app, false, true);
……
}

2.8 ActivityManagerService.handleAppDiedLocked

handleAppDiedLocked方法调用cleanUpApplicationRecordLocked去清理应用记录,此时allowRestart仍然是true。

//ActivityManagerService.java
final void handleAppDiedLocked(ProcessRecord app,
        boolean restarting, boolean allowRestart) {
    int pid = app.pid;
    boolean kept = cleanUpApplicationRecordLocked(app, restarting, allowRestart, -1,
            false /*replacingPid*/);
……
}

2.9 ActivityManagerService.cleanUpApplicationRecordLocked

cleanUpApplicationRecordLocked方法就有调用ActiveServices的killServicesLocked方法。

//ActivityManagerService.java
final boolean cleanUpApplicationRecordLocked(ProcessRecord app,
        boolean restarting, boolean allowRestart, int index, boolean replacingPid) {
    ……
    mServices.killServicesLocked(app, allowRestart);
	……
}	

2.10 ActiveServices.killServicesLocked

ActiveServices.java的killServicesLocked方法会统计服务crash的次数,由于此时allowRestart 传入的参数为true,当服务次数小于16次是代码会走到else里面调用scheduleServiceRestartLocked方法。

//ActiveServices.java
final void killServicesLocked(ProcessRecord app, boolean allowRestart) {
    // Report disconnected services.
    ……
        // Any services running in the application may need to be placed
        // back in the pending list.
		//allowRestart为true,BOUND_SERVICE_MAX_CRASH_RETRY为16
        if (allowRestart && sr.crashCount >= mAm.mConstants.BOUND_SERVICE_MAX_CRASH_RETRY
                && (sr.serviceInfo.applicationInfo.flags
                    &ApplicationInfo.FLAG_PERSISTENT) == 0) {
            Slog.w(TAG, "Service crashed " + sr.crashCount
                    + " times, stopping: " + sr);
            EventLog.writeEvent(EventLogTags.AM_SERVICE_CRASHED_TOO_MUCH,
                    sr.userId, sr.crashCount, sr.shortInstanceName, app.pid);
            bringDownServiceLocked(sr);
        } else if (!allowRestart
                || !mAm.mUserController.isUserRunning(sr.userId, 0)) {
            bringDownServiceLocked(sr);
        } else {
		//这里尝试重启service
            final boolean scheduled = scheduleServiceRestartLocked(sr, true /* allowCancel */);
          ……
        }
    }
……
}

2.11 ActiveServices.scheduleServiceRestartLocked

scheduleServiceRestartLocked方法就会用到canStopIfKilled方法,上文中提到过START_STICKY类型canStopIfKilled方法为false,START_NOT_STICKY则为true,如果START_STICKY类型就会继续下面的service重启逻辑并且打印Scheduling restart of crashed service日志。

//ActiveServices.java
/** @return {@code true} if the restart is scheduled. */
private final boolean scheduleServiceRestartLocked(ServiceRecord r, boolean allowCancel) {
    ……
        if (allowCancel) {
            //START_STICKY类型canStopIfKilled方法为false,START_NOT_STICKY则为true
            final boolean shouldStop = r.canStopIfKilled(canceled);
            //如果应该停止则返回,就没有下面的service重启逻辑了
            if (shouldStop && !r.hasAutoCreateConnections()) {
                // Nothing to restart.
                return false;
            }
            reason = (r.startRequested && !shouldStop) ? "start-requested" : "connection";
        } else {
            reason = "always";
        }
       //下面就是服务重启逻辑了
        r.totalRestartCount++;
        if (r.restartDelay == 0) {
            r.restartCount++;
            r.restartDelay = minDuration;
        } else if (r.crashCount > 1) {
            r.restartDelay = mAm.mConstants.BOUND_SERVICE_CRASH_RESTART_DURATION
                    * (r.crashCount - 1);
        } else {
            // If it has been a "reasonably long time" since the service
            // was started, then reset our restart duration back to
            // the beginning, so we don't infinitely increase the duration
            // on a service that just occasionally gets killed (which is
            // a normal case, due to process being killed to reclaim memory).
            if (now > (r.restartTime+resetTime)) {
                r.restartCount = 1;
                r.restartDelay = minDuration;
            } else {
                r.restartDelay *= mAm.mConstants.SERVICE_RESTART_DURATION_FACTOR;
                if (r.restartDelay < minDuration) {
                    r.restartDelay = minDuration;
                }
            }
        }

        r.nextRestartTime = now + r.restartDelay;

        // Make sure that we don't end up restarting a bunch of services
        // all at the same time.
        boolean repeat;
        do {
            repeat = false;
            final long restartTimeBetween = mAm.mConstants.SERVICE_MIN_RESTART_TIME_BETWEEN;
            for (int i=mRestartingServices.size()-1; i>=0; i--) {
                ServiceRecord r2 = mRestartingServices.get(i);
                if (r2 != r && r.nextRestartTime >= (r2.nextRestartTime-restartTimeBetween)
                        && r.nextRestartTime < (r2.nextRestartTime+restartTimeBetween)) {
                    r.nextRestartTime = r2.nextRestartTime + restartTimeBetween;
                    r.restartDelay = r.nextRestartTime - now;
                    repeat = true;
                    break;
                }
            }
        } while (repeat);

    } else {
        // Persistent processes are immediately restarted, so there is no
        // reason to hold of on restarting their services.
        r.totalRestartCount++;
        r.restartCount = 0;
        r.restartDelay = 0;
        r.nextRestartTime = now;
        reason = "persistent";
    }

    if (!mRestartingServices.contains(r)) {
        r.createdFromFg = false;
        mRestartingServices.add(r);
        r.makeRestarting(mAm.mProcessStats.getMemFactorLocked(), now);
    }

    cancelForegroundNotificationLocked(r);

    mAm.mHandler.removeCallbacks(r.restarter);
    //开始重启的任务
    mAm.mHandler.postAtTime(r.restarter, r.nextRestartTime);
    r.nextRestartTime = SystemClock.uptimeMillis() + r.restartDelay;
    //上面的异常重启打印日志第二行
    Slog.w(TAG, "Scheduling restart of crashed service "
            + r.shortInstanceName + " in " + r.restartDelay + "ms for " + reason);
    EventLog.writeEvent(EventLogTags.AM_SCHEDULE_SERVICE_RESTART,
            r.userId, r.shortInstanceName, r.restartDelay);

    return true;
}
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值