[Android稳定性] 第048篇 [原理篇] Android SWT机制介绍

SWT-Android Software Watchdog Timeout

一、为什么需要SWT

System Server进程是Android的一个核心进程，里面为APP运行提供了核心的服务。如果System Server的一些核心服务和重要线程卡住，就会导致相应的功能异常。

如手机发生hang机，输入无响应，无法启动APP等一些不正常的情况。而且，如果没有一种机制，让这些服务复位的话，那么将严重影响客户体验。尤其是当前大多数手机把电池封装在手机里面的这种，想拨电池重启都很难。

所以有必要在核心服务和核心线程卡住的时候，让系统有自动复位的机会。于是，google引入了Sytem Server watchdog机制。这个机制来监控核心服务和核心线程是否卡住。

二、watchdog概述

当应用程序长时间没有响应（ANR）时，系统为了不让应用长时间处于不可操作的状态，会弹出一个“应用无响应”的窗口，供用户选择强制关闭该进程。

ANR机制时针对应用的，而当系统进程长时间没有响应时，Android系统提供了Watchdog机制来管控。

当分析死机重启问题时，经常看到这样一行日志：WATCHDOG KILLING SYSTEM PROCESS . 也就是看门狗杀死了系统进程。

Watchdog机制的实现主要是通过添加两种checker，每隔30s去检测一下是否出现了死锁或和线程池阻塞的情况，如果出现了，则kill掉系统。

Checker主要是如下两类：

MonitorChecker 检查系统核心服务是否被锁时间过长。 HandlerChecker 检查系统核心线程创建的Looper管理的消息队列是否阻塞。实际上，MonitorChecker也是一种线程是FgThread的HandlerChecker。

确认trace有效性：

SWT/ANR Timeout

Exception Type	Condition	Timeout(s)
SWT	Blocked in monitor main/fg/ui/io/display thread, ActivityManagerService/ PowerManagerService handler	60
Blocked in monitor PackageManagerService Handler	600
ANR –InputDispatching Timeout	Input Event (Key/Touch)分发超时	DEFAULT_INPUT_DISPATCHING_TIMEOUT = 5
ANR –Service Timeout	Service在指定时间内未执行完成	SERVICE_TIMEOUT = 20 SERVICE_BACKGROUND_TIMEOUT = 200
ANR –BroadcastQueue Timeout	Broadcast在指定时间内未执行完成	BROADCAST_FG_TIMEOUT = 10 BROADCAST_BG_TIMEOUT = 60
ANR –ContentProvider Timeout	Client在访问某个ContentProvider时设定timeout时间，规定时间内没返回会提示ContentProvider所在AP发生ANR	Client自定义，例如PROVIDER_ANR_TIMEOUT = 20

三、watchdog的启动

Watchdog在SystemServer系统服务进程中被初始化和启动，接下来看看SystemServer怎么把它初始化和启动起来：

/mssi/frameworks/base/services/java/com/android/server/SystemServer.java

/**
       * The main entry point from zygote.
       */
      public static void main(String[] args) {
         new SystemServer().run();
      }

private void run() {
...
  // Start services.
         try {
             t.traceBegin("StartServices");
             startBootstrapServices(t);
             /// M: For mtk systemserver
             sMtkSystemServerIns.startMtkBootstrapServices();
             startCoreServices(t);
             /// M: for mtk other service.
             sMtkSystemServerIns.startMtkCoreServices();
             startOtherServices(t);
             startApexServices(t);
         } catch (Throwable ex) {
             Slog.e("System", "******");
             Slog.e("System", "** Failure starting system services", ex);
             throw ex;
         } finally {
             t.traceEnd(); // StartServices
         }

private void startBootstrapServices(@NonNull TimingsTraceAndSlog t) {
        t.traceBegin("startBootstrapServices");

        // Start the watchdog as early as possible so we can crash the system server
        // if we deadlock during early boot
        t.traceBegin("StartWatchdog");
        final Watchdog watchdog = Watchdog.getInstance();
        watchdog.start();//启动Watchdog
        mDumper.addDumpable(watchdog);
        t.traceEnd();

当SystemServer进程启动之后，就会进入它的main()方法，而它的main()方法有调用了startOtherService()方法，这个方法就把Watchdog启动起来了。startOtherService()方法首先通过Watchdog的getInstance()方法获得Watchdog对象，使用的是单例模式。接着调用init()方法来做进一步操作，最后调用Watchdog的start()方法启动线程进行监控。

3.1 watchdog重要变量

/mssi/frameworks/base/services/core/java/com/android/server/Watchdog.java


/**
  * This class calls its monitor every minute. Killing this process if they don't return
  **/
 public class Watchdog implements Dumpable {
     static final String TAG = "Watchdog";
 
     /** Debug flag. */
     public static final boolean DEBUG = false;
 
     // Set this to true to use debug default values.
     private static final boolean DB = false;
 
     // 注意 1：如果不收紧 com.android.internal.os.ZygoteConnection 中的 invoke-with 超时，请勿将此值降低到三十秒以下，否则应用程序会触发看门狗。
     // 注2：调试值已经低于ZygoteConnection中的等待时间。 应用程序可能不适用于调试版本。 CTS 将失败。
     private static final long DEFAULT_TIMEOUT = DB ? 10 * 1000 : 60 * 1000;
  
      // These are temporally ordered: larger values as lateness increases
      private static final int COMPLETED = 0;
      private static final int WAITING = 1;
      private static final int WAITED_HALF = 2;
      private static final int OVERDUE = 3;
      private final int TIME_SF_WAIT = 20000;
  
      // Track watchdog timeout history and break the crash loop if there is.
      private static final String TIMEOUT_HISTORY_FILE = "/data/system/watchdog-timeout-history.txt";
      private static final String PROP_FATAL_LOOP_COUNT = "framework_watchdog.fatal_count";
      private static final String PROP_FATAL_LOOP_WINDOWS_SECS =
              "framework_watchdog.fatal_window.second";

3.2 watchdog构造方法


private Watchdog() {
          mThread = new Thread(this::run, "watchdog");
  
          // Initialize handler checkers for each common thread we want to check.  Note
          // that we are not currently checking the background thread, since it can
          // potentially hold longer running operations with no guarantees about the timeliness
          // of operations there.
          //
          // The shared foreground thread is the main checker.  It is where we
          // will also dispatch monitor checks and do other work.
           // 创建mMonitorChecker，用于监测前台线程，用于添加需要监测的任务
          mMonitorChecker = new HandlerChecker(FgThread.getHandler(),
                  "foreground thread");
          mHandlerCheckers.add(withDefaultTimeout(mMonitorChecker));
          // Add checker for main thread.  We only do a quick check since there
          // can be UI running on the thread.
          mHandlerCheckers.add(withDefaultTimeout(
                  new HandlerChecker(new Handler(Looper.getMainLooper()), "main thread")));
          // Add checker for shared UI thread.
          mHandlerCheckers.add(withDefaultTimeout(
                  new HandlerChecker(UiThread.getHandler(), "ui thread")));
          // And also check IO thread.
          mHandlerCheckers.add(withDefaultTimeout(
                  new HandlerChecker(IoThread.getHandler(), "i/o thread")));
          // And the display thread.
          mHandlerCheckers.add(withDefaultTimeout(
                  new HandlerChecker(DisplayThread.getHandler(), "display thread")));
          // And the animation thread.
          mHandlerCheckers.add(withDefaultTimeout(
                   new HandlerChecker(AnimationThread.getHandler(), "animation thread")));
          // And the surface animation thread.
          mHandlerCheckers.add(withDefaultTimeout(
                  new HandlerChecker(SurfaceAnimationThread.getHandler(),
                      "surface animation thread")));
          // Initialize monitor for Binder threads.
          addMonitor(new BinderThreadMonitor()); // 添加到MonitorChecker，用于检查binder线程
  
          mInterestingJavaPids.add(Process.myPid());
  
          // See the notes on DEFAULT_TIMEOUT.
          assert DB ||
                  DEFAULT_TIMEOUT > ZygoteConnectionConstants.WRAPPED_PID_TIMEOUT_MILLIS;
  
          //mtk enhance
          exceptionHang = ExceptionLog.getInstance();
  
          mTraceErrorLogger = new TraceErrorLogger();
      }

接下来看看Watchdog的init()方法：

init()方法的实现比较简单，主要是给mActivity对象赋值，mActivity是一个全局的AMS对象，init()方法中会注册重启广播接收器RebotRequestReceiver，用来负责接收系统内部发出的系统重启请求。


  /**
       * Registers a {@link BroadcastReceiver} to listen to reboot broadcasts and trigger reboot.
       * Should be called during boot after the ActivityManagerService is up and registered
       * as a system service so it can handle registration of a {@link BroadcastReceiver}.
       */
      public void init(Context context, ActivityManagerService activity) {
          mActivity = activity;
          context.registerReceiver(new RebootRequestReceiver(),
                  new IntentFilter(Intent.ACTION_REBOOT),
                  android.Manifest.permission.REBOOT, null);
          if (exceptionHang != null) {
              exceptionHang.WDTMatterJava(0);
          }
      }

四、watchdog的监听

Watchdog继承Thread，所以调用start()方法之后，就会进入Watchdog的run()方法，它来做监控工作。

Watchdog主要提供了addMonitor()方法来添加监控服务对象，而在添加这些服务对象到Watchdog监控之前，这些服务必须要实现Watchdog.Monitor接口。比如AMS就首先实现了Watchdog.Monitor接口，然后在它的构造方法里把自己添加到Watchdog中，让Watchdog检测自己是否死锁，代码如下：


     // 注意：此方法在主线程上调用，但可能需要附加各种
     // 其他线程的处理程序。 因此，请注意对 looper 进行明确说明。
     public ActivityManagerService(Context systemContext, ActivityTaskManagerService atm) {


.....
         Watchdog.getInstance().addMonitor(this);
         Watchdog.getInstance().addThread(mHandler);

通过addMonitor()方法和addThread()方法，分别把AMS和mHandler添加到Watchdog中。mHandler是一个AMS中的Handler对象，意思是Watchdog不仅要监控AMS是否死锁，还要监控mHandler分发消息的时候是否阻塞。接下来看看Watchdog的addMonitor()方法：


     public void addMonitor(Monitor monitor) {
         synchronized (mLock) {
             mMonitorChecker.addMonitorLocked(monitor);
         }
     }

AMS实现了Watchdog.Monitor接口，所以这个monitor就是AMS对象。mMontorChecker是一个HandlerChecker对象，HandlerChecker类实现了java的Runnable类：


      /**
       * Used for checking status of handle threads and scheduling monitor callbacks.
       */
      public final class HandlerChecker implements Runnable {
		//...

          private final ArrayList<Monitor> mMonitors = new ArrayList<Monitor>();
		//...

          void addMonitorLocked(Monitor monitor) {
              //当 Handler 正在检查时，我们不更新 mMonitors
              //所有监视器。 如果安全，将在下一个时间表更新 mMonitors
              mMonitorQueue.add(monitor);
          }

mMontors是一个Monitor类型的ArrayList，系统启动完成之后，系统服务都会添加到这个mMonitors中。AMS中还调用了Watchdog的addThread()方法，下面了解一下：


      public void addThread(Handler thread) {
          synchronized (mLock) {
              final String name = thread.getLooper().getThread().getName();
              mHandlerCheckers.add(withDefaultTimeout(new HandlerChecker(thread, name)));
          }
      }

通过mHandlerCheckers的add()方法把Handler添加进去，从而监控那些重要的线程里的Handler是否阻塞。

当调用Watchdog.getInstance().start()时，则进入线程“watchdog”的run()方法, 该方法分成两部分:

前半段用于监测是否触发超时
后半段当触发超时时输出各种信息，最后杀死SystemServer进程

接下来进入run方法，watchdog通过run方法来进行监控

/mssi/frameworks/base/services/core/java/com/android/server/Watchdog.java


private void run() {
          boolean waitedHalf = false;
          while (true) {
              List<HandlerChecker> blockedCheckers = Collections.emptyList();
              String subject = "";
              boolean allowRestart = true;
              int debuggerWasConnected = 0;
              boolean doWaitedHalfDump = false;
              //当我们执行循环时，mWatchdogTimeoutMillis 的值可能会改变。
              //我们存储当前值以对所有处理程序使用一致的值。
              final long watchdogTimeoutMillis = mWatchdogTimeoutMillis;
              final long checkIntervalMillis = watchdogTimeoutMillis / 2;// 30s检查一次
              final ArrayList<Integer> pids;
  
              mSfHang = false;
              if (exceptionHang != null) {
                  exceptionHang.WDTMatterJava(300);
              }
              synchronized (mLock) {
                  long sfHangTime;
                  long timeout = checkIntervalMillis;
                  // Make sure we (re)spin the checkers that have become idle within
                  // this wait-and-check interval
                  for (int i=0; i<mHandlerCheckers.size(); i++) {
                      HandlerCheckerAndTimeout hc = mHandlerCheckers.get(i);
                      // We pick the watchdog to apply every time we reschedule the checkers. The
                      // default timeout might have changed since the last run.
                      hc.checker().scheduleCheckLocked(hc.customTimeoutMillis()
                              .orElse(watchdogTimeoutMillis * Build.HW_TIMEOUT_MULTIPLIER));
                  }
  
                  if (debuggerWasConnected > 0) {
                      debuggerWasConnected--;
                  }

该方法主要功能:

执行所有的Checker的监控方法scheduleCheckLocked()
当mMonitor个数为0(除了android.fg线程之外都为0)且处于poll状态,则设置mCompleted = true;
当上次check还没有完成, 则直接返回.
等待30s后, 再调用evaluateCheckerCompletionLocked来评估Checker状态;
根据waitState状态来执行不同的操作:
当COMPLETED或WAITING,则相安无事;
当WAITED_HALF(超过30s)且为首次, 则输出system_server和3个Native进程的traces;
当OVERDUE, 则输出更多信息.

接下来看看scheduleCheckLocked()方法：


      public final class HandlerChecker implements Runnable {


   public void scheduleCheckLocked(long handlerCheckerTimeoutMillis) {
              mWaitMax = handlerCheckerTimeoutMillis;
              if (mCompleted) {
                  // Safe to update monitors in queue, Handler is not in the middle of work
                  mMonitors.addAll(mMonitorQueue);
                  mMonitorQueue.clear();
              }
              if ((mMonitors.size() == 0 && mHandler.getLooper().getQueue().isPolling())
                      || (mPauseCount > 0)) {
                  // Don't schedule until after resume OR
                  // If the target looper has recently been polling, then
                  // there is no reason to enqueue our checker on it since that
                  // is as good as it not being deadlocked.  This avoid having
                  // to do a context switch to check the thread. Note that we
                  // only do this if we have no monitors since those would need to
                  // be executed at this point.
                  mCompleted = true; //当目标looper正在轮询状态则返回。
                  return;
              }
              if (!mCompleted) {
                  // we already have a check in flight, so no need //有一个check正在处理中，则无需重复发送
                  return;
              }

              mCompleted = false;
              mCurrentMonitor = null;
              mStartTime = SystemClock.uptimeMillis(); // 记录当下的时间
              mHandler.postAtFrontOfQueue(this);
          }

          @Override
          public void run() {
              // Once we get here, we ensure that mMonitors does not change even if we call
              // #addMonitorLocked because we first add the new monitors to mMonitorQueue and
              // move them to mMonitors on the next schedule when mCompleted is true, at which
              // point we have completed execution of this method.
              final int size = mMonitors.size();
              for (int i = 0 ; i < size ; i++) {
                  synchronized (mLock) {
                      mCurrentMonitor = mMonitors.get(i);
                  }
                  mCurrentMonitor.monitor(); //回调具体服务的monitor方法
              }
  
              synchronized (mLock) {
                  mCompleted = true;
                  mCurrentMonitor = null;
              }
          }

该方法主要功能: 向Watchdog的监控线程的Looper池的最头部执行该HandlerChecker.run()方法, 在该方法中调用monitor(),执行完成后会设置mCompleted = true. 那么当handler消息池当前的消息, 导致迟迟没有机会执行monitor()方法, 则会触发watchdog

其中postAtFrontOfQueue(this)，该方法输入参数为Runnable对象，根据消息机制，最终会回调HandlerChecker中的run方法，该方法会循环遍历所有的Monitor接口，具体的服务实现该接口的monitor()方法。而monitor()方法只是简单获取锁，如AMS的monitor方法：

public void monitor() { synchronized (this) { }}

AMS判断是否死锁就是去看能不能拿到自己的锁

run()方法后面的实现则是当触发超时时输出各种信息，最后杀死SystemServer进程

五、流程总结

Watchdog是一个运行在system_server进程的名为”watchdog”的线程:：

Watchdog运作过程，当阻塞时间超过1分钟则触发一次watchdog，会杀死system_server,触发上层重启；
mHandlerCheckers：记录所有的HandlerChecker对象的列表，包括foreground, main, ui, i/o, display线程的handler;
mHandlerChecker.mMonitors：记录所有Watchdog目前正在监控Monitor，所有的这些monitors都运行在foreground线程。
有两种方式加入Watchdog的监控：
- addThread()：用于监测Handler对象，默认超时时长为60s.这种超时往往是所对应的handler线程消息处理得慢；
- addMonitor(): 用于监控实现了Watchdog.Monitor接口的服务.这种超时可能是”android.fg”线程消息处理得慢，也可能是monitor迟迟拿不到锁；

watchdog检查进程，会得到四种状态。COMPLETED检查完成未阻塞；WAITING检查尚未完成，等待中；WAITED_HALF检查尚未完成，等待中，时间过去了一半；OVERDUE检查完成，未返回，发生了阻塞。

六、工具使用

工具名称：hang_SWT_analysis-v6.1

版本说明：

    6.1(2022/10/26)

            1、修复只抓了一次trace而没有显示的问题

            2、addr2line改用llvm-addr2line.exe

            3、兼容DB中不存在\_\_exp\_main.txt的情况

    6.0(2022/03/31)

            命令行增加解析所有native层trace的选项

    5.1(2021/05/10)

            适配Dump native maps files格式的修改

    5.0(2020/12/17)

            1、增加自动更新功能

            2、工具改名为hang\_SWT\_analyze

            3、hang DB分析中增加根据vmlinux解析kernel trace的功能，方便定位到具体文件名及行号

            4、hang DB分析中增加命令行模式，以便单独解析指定的进程名。

    4.0(2020/10/25)

            增加扫描SWT DB的功能

    v3.0(2020/10/14)

            1、解除先必须用NE/KE分析工具的限制，可以直接在hang DB上解析\~

            2、在分析报告最后新增kernel trace统计，即将kernel调用栈相同的进程聚集在一起，便于分析kernel底层卡死问题！

    v2.0(2020/5/25)

            修复加了symbol文件却检查不到的问题

    v1.0(2020/4/3)

使用说明：

    1、安装：

            直接点击该目录中的install.bat。(win10 请选择“以管理员身份运行”) 

            如果有成功安装，则选择DB文件夹后右键会多出“hang\_SWT\_analyze”选项！

    2、使用：

            在hang or SWT DB文件夹，点击右键，选择hang\_SWT\_analyze；

            分析完后，会在DB目录下生成分析报告“hang\_SWT\_report.txt”，该工具也会直接打开该分析报告。

    3、卸载：

            直接点击安装目录中的uninstall.bat。(win10 请选择“以管理员身份运行”)

七、常见SWT/ANR原因有如下几种：

7.1 等锁

线程状态为“Blocked”，通过关键字“held by”进一步确认哪个线程拿住了锁，如有死锁检查code逻辑进行解锁；

线程状态为“Waiting”，表示当前线程需要另外一个线程来notify()，需要根据callstack结合code来做分析，以找到是另外的某个线程拿住了锁。

如果很多线程在等同一把锁，可能产生资源竞争问题，导致某些线程可能拿不到锁。

7.2 SurfaceFlinger卡住

SF hang Time > 40s（Service.sf.status值），sf hang，

直接在”SYS_ANDROID_LOG”搜索”I watchdog”，看是否有“surfaceflinger hang”关键字。如果有，请进一步确认main_log里有"SF-WD"相关log打印，或者与SWT相关的thread block在android.view.SurfaceControl.XXXX，更进一步分析请参考如下链接内容：

Quick Start > SurfaceFlinger Hang issue Guide

7.3 Native方法执行时间过长

线程状态为”Native”，根据native方法找到对应模块的owner，进一步确认该native方法为何执行时间过长，例如是否等待硬件返回或者硬件本身存在问题等。

7.4 Binder Server卡住

线程状态为“Native”，且含有如下callstack：

IPCThreadState::waitForResponse-->IPCThreadState::talkWithDriver，

表示卡在binder对端，下一步要找到对端，找到对端后，从对端thread的callstack中确认卡住的接口，并请对端相关的owner帮忙解决。

怎么寻找binder对端信息？

a. 根据binder thread的sysTid在SYS_BINDER_INFO/SWT_JBT_TRACES/ kernel_log中查找binder通信对端，关键字“outgoing transaction”

b.在SYS_PROCESSES_AND_THREADS通过对端的sysTid查找process name

c. 如果对端是Monkey，比较特别，可以不用关注，除非是严重影响Monkey Test。

d. 上述方法找不到binder对端，请参考“[FAQ22212] 如何根据binder client端查找binder server端？”

e. 如果按照上述方法依然找不到对端，只能按照code逻辑关系请相关module owner帮忙排查。

SystemServer binder used up问题：

Backtrace卡在： at android.os.Binder.blockUntilThreadAvailable(Native method) at com.android.server.Watchdog$BinderThreadMonitor.monitor(Watchdog.java:271)

请参考“[FAQ21386] 怎么初步定位binder used up导致的SWT问题？”

7.5 Zygote fork进程时卡住

线程状态为”Native”，确认callstack中有”Process.zygoteSendArgsAndGetResult”，

对于Zygote fork进程时卡住的问题，一般是由于底层memory问题引起的，请检查是否有memory不足或者memory leak的问题

7.6 Dump时间过长

前面有ANR发生：

a. 前面有ANR发生，”ActivityManager” 线程正在做dump操作，通过如下 callstack确认：

appNotResponding, dumpStackTraces

b. 从“SYS_ANDROID_LOG”中确认是dump哪一个进程花的时间过长

c. 搜索关键字”dumpStackTraces process”来确认ANR发生时所dump的进程

d. 通过dump的上一个进程与下一个进程来确认时间差是否大于60s，或者所有的 dump时间加起来远远超过60s

e. ANR所引起的dump时间过长，需要看是否某个进程dump时间过长，并确认其原因

前面有fatal JE、NE、KE等Exception发生；
自动化测试脚本主动call “dumpsys”去dump系统信息：

a. 通过callstack中确认是其他进程通过binder call发起AMS进行dump

b. 自动化测试脚本一般是进程“adb” call进来的，该类SWT一般也不用关注，只会在eng/userdebug版本下，或者开启mtklog后才会发生

注意：这种dump所导致的SWT，一般来说是系统loading过重，或者需要dump的信息确实过多所引起。终端用户不会发生这类问题，建议不用过多关注。

八、SWT Dump 检查

有时会在SWT db中找不到所需要时间点的trace，需要check swt dump flow是否有问题。

8.1 正常的SWT dump flow：

正常的SWT dump包含两次ART的Pre-Dump，log pattern如下:

/30s，system_server发起1st Pre-Dump请求
03-04 04:38:57.545452 1209 3466 I system_server: libdebuggerd_client: started dumping process 1209

//"Signal Catcher"接收到signal 3开始dump process
03-04 04:38:57.554717 1209 1214 I system_server: Thread[2,tid=1214,WaitingInMainSignalCatcherLoop,Thread*=0x7c1520f400,peer=0x14b80250,"Signal Catcher"]: reacting to signal 3

//ART开始做1st真正dump process
03-04 04:38:57.698137 1209 1214 I system_server: Start dump backtrace tid:1209, name:main

//60s，SWT发生，main log信息打印
03-04 04:39:33.503209 1209 3466 E Watchdog: SWT happen Blocked in monitor com.android.server.am.ActivityManagerService on foreground thread (android.fg), Blocked in handler on main thread (main), Blocked in handler on ui thread (android.ui), Blocked in handler on display thread (android.display), Blocked in handler on ActivityManager (ActivityManager)

//event log信息打印
03-04 04:39:33.513 1209 3466 I watchdog: Blocked in monitor com.android.server.am.ActivityManagerService on foreground thread (android.fg), Blocked in handler on main thread (main), Blocked in handler on ui thread (android.ui), Blocked in handler on display thread (android.display), Blocked in handler on ActivityManager (ActivityManager)

//system_server发起2nd Pre-Dump请求
03-04 04:39:33.584351 1209 3466 I system_server: libdebuggerd_client: started dumping process 1209

//"Signal Catcher"接收到signal 3开始dump process
03-04 04:39:33.586273 1209 1214 I system_server: Thread[2,tid=1214,WaitingInMainSignalCatcherLoop,Thread*=0x7c1520f400,peer=0x14b80250,"Signal Catcher"]: reacting to signal 3

//ART开始做2nd真正dump process
03-04 04:39:33.651755 1209 1214 I system_server: Start dump backtrace tid:1209, name:main

//AEE开始打包temp db：db.c0TqNO
03-04 04:39:41.775003 3749 3749 D AEE_AED : aed_report_dump_open: Using /data/aee_exp/temp/db.c0TqNO
03-04 04:39:53.240054 3959 3959 D AEE_ARCHIVE: aed_archive: pack files in /data/aee_exp/temp/db.c0TqNO
03-04 04:39:53.565818 3959 3959 I AEE_ARCHIVE: aed_archive: pack /data/aee_exp/temp/db.c0TqNO/CURRENT.dbg completed
03-04 04:39:53.566222 3959 3959 I AEE_ARCHIVE: archive done

//temp db打包完成，重命名为db.fatal.00.SWT
03-04 04:39:53.570918 3749 3749 I AEE_AED : * INDEX: '/data/aee_exp/db.fatal.00.SWT' from /data/aee_exp/temp/db.c0TqNO *
03-04 04:39:53.571262 3749 3749 I AEE_AED : aed_report_complete: OK find db slot /data/aee_exp/db.fatal.00.SWT

//db抓取完成，SystemServer Exit
03-04 04:39:46.879632 1209 3466 I Process : Sending signal. PID: 1209 SIG: 9

两次Pre-Dump对应的Real-Dump是在db解压出来的SWT_JBT_TRACES中
----- pid 1209 at 2019-03-04 04:38:57 -----
Cmd line: system_server
……
----- pid 1209 at 2019-03-04 04:39:33 -----
Cmd line: system_server
……

8.2 不完整SWT DB/log的说明和处理方案

1）SWT DB中只有一次有效trace

a) 仅有一次trace，缺失30s predump的backtrace

b) 60s dump backtrace滞后严重

此类情况，还请check是否存在performance问题。如果performance checkpoint均正常，则信息不足需要复现并提供完整SWT DB。

2）SWT DB不生成

a) 有data/anr文件

b) 无data/anr文件

这两种情况，还请排查：

a) AEE起来的比较晚，导致还没有打包完SystemServer就被kill掉，实际上trace还是有dump出来的，需要提供 data/anr 文件进行初步分析

b) ART dump process的时候出了问题，可能是因为Performance问题发生ART dump timeout，通常会有如下timeout log打印：           system\_server: dump timeout : XXX           system\_server: libdebuggerd\_client: received packet of unexpected length from tombstoned: expected 128, received -1         此种情况下建议同步排查系统Performance状况。

c) 存在存储问题，比如说Storage剩余空间不够 

d) 存在watchdog的厂商定制，如果厂商无定制，可以结合mobilelog，有些厂商的**feedback机器**没有mobilelog但是有定制的完整现场log，可以提供这些现场信息具体问题具体分析。

3）关键Native callstack无dump

比如StorageManagerService monitor timeout，但是vold native callstack无打印；

比如PackageManagerService 卡dexopt，但是installd native callstack无打印。

此类情况，可以查看DB中的其他文件来借助分析，比如kernellog。

九、Monkey Test注意事项

1，Monkey Test重点关注系统Stability问题，performance因素引发的ANR无需关注，同时为了增强系统健壮性，要避免debug机制引入的SWT问题（比如ANR dump）。

2，MTBF blacklist

 新版本的MTBF都应该是支持自定义blacklist

  adb push .\\blacklist.txt /sdcard/

com.debug.loggerui
com.longcheertel.cit
com.longcheertel.fasttest
com.longcheertel.proximity
com.longcheertel.runin
com.longcheertel.fsninfo

3，测试工具是否有更新，\resource\MTBFStress\resource

FileFill_CORGI.apk和FileFill_SYSTEM.apk

十、性能问题导致的SWT

10.1 检查 CPU Loading 状况：

DB	关键log	备注
__exp_main.txt	CPU usage from 31076ms to 238ms ago...98% TOTAL: 97% user + 1.1% kernel + 0% iowait + 0% softirq	查看TOTAL CPU usage超过95%，检查CPU usage排名状况
SYS_ANDROID_LOG	CPU usage from	如果__exp_main.txt中没有CPU usage，可以参考本文件中对应时间区间的Total CPU usage
SYS_ANDROID_EVENT_LOG	cpu : [85,20,45,16,0,0]	(total),(user),(system),(iowait),(irq),(softirq)Check：am_anr 以及 watchdog 之前时间点此event的打印取决于”events.cpu”该系统属性设置为”true”
SYS_KERNEL_LOG	set_adaptive_cpu_power_limit	Thermal limitation cpufreq指的是平台ATM启动后，每5次打印出对应的temp和limit power(限制CPU对应最大频率使用的)
SYS_KERNEL_LOG	[Power/PPM] (0x70)(2712)(0)(0-7)(15)(10)(4)(4) (15)(14)(4)(4)	红色部分不是0，代表有限频

10.2 Low Memory

https://online.mediatek.com/QuickStart/QS00152#QSS01632

DB	关键log	备注
SYS_CPU_INFO	Kswapd0	排名前3
KERNEL_LOG	Out of memory: Kill process 1297 (system_server) score Zygote : Zygote failed to write to system_server FD: Connection refused Zygote : Process 1297 exited due to signal 9 (Killed) Zygote : Exit zygote because system server (pid 1297) has t	adj越小说明系统lowmemory越严重 < 100（红色两个加起来，指明当前可用memory，越小说明系统lowmemory越严重）< 200M decrease N level是kernel4.9的AMR机制，指明当前是否根据ZRAM 消耗值是否越级查杀进程，值越大越激进 ≥ 2Alloc buffer fail ,疑似泄露 > 500M
SYS_ANDROID_LOG	• lowmemorykiller: Kill 'com.google.android.apps.wellbeing' (17392), uid 10155, oom_adj 945 to free 11248kB• lowmemorykiller: Reclaimed 11248kB, cache(220756kB) and free(140600kB)-reserved(95248kB) below min(221184kB) for oom_adj 900	adj越小说明系统lowmemory越严重 < 100 cache(xxxx) , ... below min(xxxx) xxxx < 130

出现上述信息，可以怀疑是memory 导致的，先解决low memory问题再进行复测排查。

10.3 IO 检查

https://online.mediatek.com/QuickStart/QS00152#QSS01633

DB	关键log	备注
SYS_CPU_INFO	exe_cq + mmcqd	exe_cq + mmcqd loading 之和占top5
SYS_BLOCKIO	wl:**% , for example:mmc.q:0.rt:30803,195854336,6209.wl:88%...mmc.q:0.wl:71%,0,1003492079,0.vm:……	IO workload 要一直大于 70%，才表示IO loading高必要时借助Perlyzer工具进行解析
SYS_FILE_SYSTEMS	64% /data	1). Data 分区使用建议小于 90%2). Data 分区快满的时候，IO read/write性能会下降，此时IO loading高也是正常的
SYS_ANDROID_LOG	CPU usage98% TOTAL: 3.4% user + 19% kernel + 75% iowait + 0% softirq	务必找到有效时间段的CPU usage

loading高，以及找出具体process信息。

十一，案例

1， Binder Server卡住

https://online.mediatek.com/QuickStart/QS00152#QSS01640

案例1

2，Deadlock

https://online.mediatek.com/QuickStart/QS00152#QSS01634

3，ANR/JE引发SWT

https://online.mediatek.com/QuickStart/QS00152#QSS01639

4，StorageManagerService卡住

https://online.mediatek.com/QuickStart/QS00152#QSS02993

5，Dump时间过长

https://online.mediatek.com/QuickStart/QS00152#QSS01638

2025 年 7 月
日	一	二	三	四	五	六
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31