Linux内核启动分析
在Linux内核启动的过程中,start_kernel()函数起到了重要作用。它负责许多操作系统的初始化工作,例如中断的初始化、内存的初始化、进程调度的初始化和“陷阱门”的初始化等。当start_kernel()运行时,操作系统只有一个进程,即当前进程。而操作系统启动后支持多进程多任务同时运行,那么从何时起操作系统支持多个进程同时运行呢?
start_kernel()的最后一条语句是
rest_init();
就是在这个函数中操作系统产生了第一个用户态进程。让我们看看rest_init()做了哪些工作:
393static noinline void __init_refok rest_init(void)
394{
395 int pid;
396
397 rcu_scheduler_starting();
398 /*
399 * We need to spawn init first so that it obtains pid 1, however
400 * the init task will end up wanting to create kthreads, which, if
401 * we schedule it before we create kthreadd, will OOPS.
402 */
403 kernel_thread(kernel_init, NULL, CLONE_FS);
404 numa_default_policy();
405 pid = kernel_thread(kthreadd, NULL, CLONE_FS | CLONE_FILES);
406 rcu_read_lock();
407 kthreadd_task = find_task_by_pid_ns(pid, &init_pid_ns);
408 rcu_read_unlock();
409 complete(&kthreadd_done);
410
411 /*
412 * The boot idle thread must execute schedule()
413 * at least once to get things moving:
414 */
415 init_idle_bootup_task(current);
416 schedule_preempt_disabled();
417 /* Call into cpu_idle with preempt disabled */
418 cpu_startup_entry(CPUHP_ONLINE);
419}
第403行代码
kernel_thread(kernel_init, NULL, CLONE_FS);
就产生了第一个用户态进程。让我们看看kernel_thread()做了哪些工作:
pid_t kernel_thread(int (*fn)(void *), void *arg, unsigned long flags)
{
return do_fork(flags|CLONE_VM|CLONE_UNTRACED, (unsigned long)fn,
(unsigned long)arg, NULL, NULL);
}
可以看出,kernel_thread()产生了一个新进程,该进程的入口地址由函数指针fn确定。结合上面的代码,能够看到此处的fn为kernel_init。那么kernel_init()又做了什么呢?
930static int __ref kernel_init(void *unused)
931{
932 int ret;
933
934 kernel_init_freeable();
935 /* need to finish all async __init code before freeing the memory */
936 async_synchronize_full();
937 free_initmem();
938 mark_rodata_ro();
939 system_state = SYSTEM_RUNNING;
940 numa_default_policy();
941
942 flush_delayed_fput();
943
944 if (ramdisk_execute_command) {
945 ret = run_init_process(ramdisk_execute_command);
946 if (!ret)
947 return 0;
948 pr_err("Failed to execute %s (error %d)\n",
949 ramdisk_execute_command, ret);
950 }
951
952 /*
953 * We try each of these until one succeeds.
954 *
955 * The Bourne shell can be used instead of init if we are
956 * trying to recover a really broken machine.
957 */
958 if (execute_command) {
959 ret = run_init_process(execute_command);
960 if (!ret)
961 return 0;
962 pr_err("Failed to execute %s (error %d). Attempting defaults...\n",
963 execute_command, ret);
964 }
965 if (!try_to_run_init_process("/sbin/init") ||
966 !try_to_run_init_process("/etc/init") ||
967 !try_to_run_init_process("/bin/init") ||
968 !try_to_run_init_process("/bin/sh"))
969 return 0;
970
971 panic("No working init found. Try passing init= option to kernel. "
972 "See Linux Documentation/init.txt for guidance.");
973}
我们可以看到第945行执行了run_init_process(ramdisk_execute_command),函数的参数就是我们将要执行的第一个用户态程序的名称,run_init_process()函数即负责运行传入的参数对应的程序。可以看到如果程序执行不成功,kernel_init()会尝试执行/sbin/init, /etc/init, /bin/init, /bin/sh这四个程序。
用户态进程成功运行后,0号进程继续回到rest_init()中执行。执行过程中还创建了另一个进程,用来执行kthreadd函数。成功创建该进程后rest_init()继续执行,直至cpu_startup_entry(CPUHP_ONLINE)。cpu_startup_entry()中的关键代码如下:
arch_cpu_idle_prepare();
cpu_idle_loop();
可见0号进程最终进入了cpu_idle_loop()中。cpu_idle_loop()为一个while(1){....}循环,即该函数永远不会返回,所以0号进程永远不会结束。
189static void cpu_idle_loop(void)
190{
191 while (1) {
192 /*
193 * If the arch has a polling bit, we maintain an invariant:
194 *
195 * Our polling bit is clear if we're not scheduled (i.e. if
196 * rq->curr != rq->idle). This means that, if rq->idle has
197 * the polling bit set, then setting need_resched is
198 * guaranteed to cause the cpu to reschedule.
199 */
200
201 __current_set_polling();
202 tick_nohz_idle_enter();
203
204 while (!need_resched()) {
205 check_pgt_cache();
206 rmb();
207
208 if (cpu_is_offline(smp_processor_id()))
209 arch_cpu_idle_dead();
210
211 local_irq_disable();
212 arch_cpu_idle_enter();
213
214 /*
215 * In poll mode we reenable interrupts and spin.
216 *
217 * Also if we detected in the wakeup from idle
218 * path that the tick broadcast device expired
219 * for us, we don't want to go deep idle as we
220 * know that the IPI is going to arrive right
221 * away
222 */
223 if (cpu_idle_force_poll || tick_check_broadcast_expired())
224 cpu_idle_poll();
225 else
226 cpuidle_idle_call();
227
228 arch_cpu_idle_exit();
229 }
230
231 /*
232 * Since we fell out of the loop above, we know
233 * TIF_NEED_RESCHED must be set, propagate it into
234 * PREEMPT_NEED_RESCHED.
235 *
236 * This is required because for polling idle loops we will
237 * not have had an IPI to fold the state for us.
238 */
239 preempt_set_need_resched();
240 tick_nohz_idle_exit();
241 __current_clr_polling();
242
243 /*
244 * We promise to call sched_ttwu_pending and reschedule
245 * if need_resched is set while polling is set. That
246 * means that clearing polling needs to be visible
247 * before doing these things.
248 */
249 smp_mb__after_atomic();
250
251 sched_ttwu_pending();
252 schedule_preempt_disabled();
253 }
254}
该程序的主要工作是当没有其它进程需要执行时,该程序执行;若有进程需要执行,该程序调度执行其它进程。所以0号进程为空闲时执行的进程。
至此,系统启动完成。
Linux系统启动流程总结
由start_kernel()开始,进行各种初始化工作,最后进入rest_init()中,由kernel_thread()产生一个新进程,新进程执行kernel_init(),kernel_init()加载用户态进程,即1号进程。0号进程继续执行rest_init()直至进入cpu_idle_loop(),0号进程只有在CPU空闲时才执行。