linux服务器崩溃调查记录

最新推荐文章于 2023-02-10 09:57:21 发布

置顶比比东传承

最新推荐文章于 2023-02-10 09:57:21 发布

阅读量1.2k

点赞数 4

分类专栏： linux系统文章标签： linux java redis

本文链接：https://blog.csdn.net/qq_28175477/article/details/106363403

版权

linux系统专栏收录该内容

2 篇文章 0 订阅

订阅专栏

文章目录

linux服务器崩溃调查

linux服务器崩溃调查

最近从外包公司手里接回项目的代码，服务器频发崩溃现象。项目采用dubbo框架，部署在linux系统上，接手之前已经做了在APP端云直播，支付宝微信支付功能，接手后在实现在直播中付费的功能，上线后服务器频发崩溃现象，起初以为是代码问题，查看服务器错误日志并没有什么影响，后来依次查了 redis 云直播等多处使用情况均未取得成果。最后发现是linux服务器内存不足时，会采用linux内核OOM_killer机制，杀死“最坏的”进程而导致整个APP功能不可使用。经验不足忽略了最应该先调查的问题，对本次调查服务器崩溃问题学到的内容做一个简单的整理。

Linux shell 实时监测进程

https://blog.csdn.net/chen1415886044/article/details/103000827

前言

当程序运行在系统时，我们称之为进程。想要监测这些进程，需要用到ps命令。虽然ps命令在收集运行在系统中的信息时非常有用。但是不足之处在于，ps命令只能显示某个特定时间点的信息不能观察那些频繁换进换出的内存的进程趋势。而想要实时监测进程状态，需要用到top命令。

操作

首先在中断下输入top命令：
输出的第一部分显示的是系统的概括：
第一行显示了当前的时间、系统运行时间、登录的用户数以及系统的平均负载。
第二行显示了进程的概要信息：
top命令的输出中将进程叫做任务(task)，有多少进程处在运行、休眠、停止或是僵化状态（进程完成，但是父进程没有响应）
第三行显示CPU概要信息：
top根据进程的属主和ijncheng的状态，将CPU利用率分成几类变成。
最后两行说明了系统内存的状态
第一行说的是系统的物理内存，内存总数、当前内存使用数、空间内存数。后一行说的是同样的信息，不过是针对系统交换的空间的状态而言的。
剩下的部分显示了当前运行的进程详细列表

名称	作用
PID	进程的ID
USER	进程属主的名字
PR	进程的优先级
NI	进程的谦让度
VIRT	进程占用的虚拟内存总量
RES	进程占用的物流内存总量
SHR	进程和其他进程的共享的内存总量
S	进程的状态（D代表中断的休眠黄台，R代表在运行状态，S代表休眠状态），T代表跟踪状态或停止状态，Z代表僵化状态）
%CPU	进程使用的CPU比例
%MEM	进程使用的内存占用内存的比例
TIME+	自进程启动到目前为止的CPU使用时间总量
COMMAND	进程所对应的命令名称（启动的程序名）

每个交互式命令都是单字符，在top命令运行时键入可以改变top的行为。
例如在运行top命令时，键入f 选择对输入进行排序的字段：
当键入d 时，运行修改轮询间隔。键入q时，退出top

Linux OOM Killer机制

简介

OOM Killer是Linux内核的一个机制，用于监控占用内存过大，尤其是瞬间占用内存很快的进程，在内存将要耗尽时判断哪个进程最坏（打分），分最高就把它杀掉。

如何查看

grep "Ont of memory" /var/log/message

触发时机

内核在触发OOM机制时会调用到out_of_memory()函数，其调用顺序如下：

_alloc_pages // 内存分配时调用
	|-->_alloc_pages_nodemask
		|-->_alloc_pages_slowpath
			|-->_alloc_pages_may_oom	// 调用前会先判断flag：oom_killer_disabled的值,默认为0,表示打开OOM_kill
				|-->ont_of_memory	// 触发

Linux中内存都是以page的形式管理的，不管怎么申请内存，都会调用alloc_page()函数，最终调用到函数out_of_memory()，触发OOM机制。
内核监测到系统内存不足时触发out_of_memory()函数，以下为源码：

/**
 * out_of_memory - kill the "best" process when we run out of memory
 * 当内存溢出时杀死打分最高的进程
 * @oc: pointer to struct oom_control oc: 指向oom_control结构体的一个指针
 *
 * If we run out of memory, we have the choice between either
 * killing a random task (bad), letting the system crash (worse)
 * OR try to be smart about which process to kill. 
 * 如果内存溢出, 我们可以选择随便杀死一个进程使系统崩溃或者明智地选择要杀死的进程。
 * Note that we don't have to be perfect here, we just have to be good.
 * 我们不需要做的完美, 只要做好就行了。
 */
bool out_of_memory(struct oom_control *oc)
{
    // 释放量
	unsigned long freed = 0;
    // oom限制策略?
	enum oom_constraint constraint = CONSTRAINT_NONE;

    // 如果关闭了oom killer机制
	if (oom_killer_disabled)
		return false;

    // memcg是Linux内核中用于管理cgroup中kernel 内存的模块
	if (!is_memcg_oom(oc)) {
        // 阻塞唤醒调用链
		blocking_notifier_call_chain(&oom_notify_list, 0, &freed);
        // 如果有释放量
		if (freed > 0)
			/* Got some memory back in the last second. */
			return true;
	}

	/*
	 * If current has a pending SIGKILL or is exiting, then automatically
	 * select it.  
	 * 如果当前一个待决或存在的终止信号, 就自动选择它。
	 * The goal is to allow it to allocate so that it may
	 * quickly exit and free its memory.
	 * 目的是允许它自动分配从而使它快速5退出并释放其内存
	 */
	if (task_will_free_mem(current)) {
        // 标记 oom受害者?
		mark_oom_victim(current);
        // 唤醒 oom收割者? 名字好中二啊
		wake_oom_reaper(current);
		return true;
	}

	/*
	 * The OOM killer does not compensate for IO-less reclaim.
	 * OOM Killer机制并不会对少IO的内存重申进行补偿 
	 * pagefault_out_of_memory lost its gfp context so we have to
	 * make sure exclude 0 mask - all other users should have at least
	 * ___GFP_DIRECT_RECLAIM to get here.
	 */
	if (oc->gfp_mask && !(oc->gfp_mask & __GFP_FS))
		return true;

	/*
	 * Check if there were limitations on the allocation (only relevant for
	 * NUMA and memcg) that may require different handling.
	 * 检查分配是否有限制, 可能需要不同的处理。
	 */
	constraint = constrained_alloc(oc);
	if (constraint != CONSTRAINT_MEMORY_POLICY)
		oc->nodemask = NULL;
	check_panic_on_oom(oc, constraint);

	if (!is_memcg_oom(oc) && sysctl_oom_kill_allocating_task &&
	    current->mm && !oom_unkillable_task(current, NULL, oc->nodemask) &&
	    current->signal->oom_score_adj != OOM_SCORE_ADJ_MIN) {
		get_task_struct(current);
		oc->chosen = current;
		oom_kill_process(oc, "Out of memory (oom_kill_allocating_task)");
		return true;
	}

	select_bad_process(oc); //选择一个“最坏的”进程杀掉。
	/* Found nothing?!?! */
	if (!oc->chosen) {
		dump_header(oc, NULL);
		pr_warn("Out of memory and no killable processes...\n");
		/*
		 * If we got here due to an actual allocation at the
		 * system level, we cannot survive this and will enter
		 * an endless loop in the allocator. Bail out now.
		 */
		if (!is_sysrq_oom(oc) && !is_memcg_oom(oc))
			panic("System is deadlocked on memory\n");
	}
	if (oc->chosen && oc->chosen != (void *)-1UL)
		oom_kill_process(oc, !is_memcg_oom(oc) ? "Out of memory" :
				 "Memory cgroup out of memory");
	return !!oc->chosen;
}

选一个最坏的进程

/*
 * Simple selection loop. We choose the process with the highest number of
 * 'points'. In case scan was aborted, oc->chosen is set to -1.
 * 简单的选择循环, 我们选择打分最高的进程。为防止扫描退出, oc指针对应的oom_control的chosen flag被设置为-1
 */
static void select_bad_process(struct oom_control *oc)
{
	if (is_memcg_oom(oc))
		mem_cgroup_scan_tasks(oc->memcg, oom_evaluate_task, oc);
	else {
		struct task_struct *p;

        // 加了个读锁, RCU: Read-Copy Update
		rcu_read_lock();
        // 遍历进程
		for_each_process(p)
			if (oom_evaluate_task(p, oc))
				break;
		rcu_read_unlock();
	}

	oc->chosen_points = oc->chosen_points * 1000 / oc->totalpages;
}

杀掉进程

static void oom_kill_process(struct oom_control *oc, const char *message)
{
	struct task_struct *victim = oc->chosen;
	struct mem_cgroup *oom_group;
	static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
					      DEFAULT_RATELIMIT_BURST);

	/*
	 * If the task is already exiting, don't alarm the sysadmin or kill
	 * its children or threads, just give it access to memory reserves
	 * so it can die quickly
	 * 如果任务已经存在, 不要警告系统管理员或杀死该任务的子任务或子线程, 只需要给予它内存保留, 它很快就会结束。
	 */
	task_lock(victim);
	if (task_will_free_mem(victim)) {
		mark_oom_victim(victim);
		wake_oom_reaper(victim);
		task_unlock(victim);
		put_task_struct(victim);
		return;
	}
	task_unlock(victim);

	if (__ratelimit(&oom_rs))
		dump_header(oc, victim);

	/*
	 * Do we need to kill the entire memory cgroup?
	 * Or even one of the ancestor memory cgroups?
	 * 我们需要杀死整个内存组?还是说甚至包括一个此内存组之前的内存组
	 * Check this out before killing the victim task.
	 * 在杀死受害任务之前检查此
	 */
	oom_group = mem_cgroup_get_oom_group(victim, oc->memcg);

	__oom_kill_process(victim, message);

	/*
	 * If necessary, kill all tasks in the selected memory cgroup.
	 * 如果需要, 就杀死被选中的内存组中的所有任务
	 */
	if (oom_group) {
		mem_cgroup_print_oom_group(oom_group);
		mem_cgroup_scan_tasks(oom_group, oom_kill_memcg_member,
				      (void*)message);
  	mem_cgroup_put(oom_group);
	}
}

查看硬盘、内存、CPU以及创建文件夹、复制、删除

查看硬盘： #df-h

在这里插入图片描述

名称	作用
Size	容量
Used	已用
Avail	可用
Use%	已用%
Mounted on	挂载点

查看内存 #free -m 或 free

查看CPU个数 #cat /proc/cpuinfo

查看结果中所有属性名为processor的结果，若结果processor：3，说明cpu数为4（0开始）

给“/home”目录创建bat文件夹目录：

# cd /home
# mkdir bak

文件夹及文件夹下的所有文件的复制 # cp -r 源文件路径目标文件路径

eg：将“/soft/bak” 文件目录下的所有文件夹及文件备份（复制）到"/home/bak"整个文件目录
# cp -r /soft/bak/* /home/bak

删除文件目录下文件 # rm -rf 文件名

eg：删除掉"/soft/bak"文件目录下
# rm -rf /soft/bak

Linux查看redis占用内存

连接本地 redis
# redis-cli -p 端口号
# auth 密码
# info

在这里插入图片描述

名称	作用
used_memory	数据占用内存（字节）
used_memory_human	数据占用内存（带单位的，可读性好）
used_memory_rss	redis占用内存
used_memory_peak	占用内存的峰值
used_memory_peak_human	占用内存的峰值（带单位的，可读性好）
used_memory_lua	lua引擎所占用的内存大小（字节）
mem_fragmentation_ratio	内存碎片率