From: bri...@wintelcom.net (Alfred Perlstein) Subject: SMP infoletter #1 Date: 1999/10/27 Message-ID: <7v6bqp$1s94$1@FreeBSD.csie.NCTU.edu.tw> X-Deja-AN: 541114716 X-Trace: FreeBSD.csie.NCTU.edu.tw 941011609 61733 140.113.235.250 (27 Oct 1999 08:06:49 GMT) Organization: NCTU CSIE FreeBSD Server NNTP-Posting-Date: 27 Oct 1999 08:06:49 GMT Newsgroups: mailing.freebsd.smp X-Complaints-To: usenet@FreeBSD.csie.NCTU.edu.tw Infoletter #1 This is the start of what I hope to be several informative documents describing the current and ongoing state of SMP in FreeBSD. The purpose is to avoid duplicate research of the current state of FreeBSD's SMP behavior by those who haven't been following FreeBSD-SMP since 'day one'. It also points out some areas that are still unclear to me. This document was written on Tue Oct 26 1999 referencing the HEAD branch of the code, things may have significantly changed since. I also hope that this series helps to shed some light onto the low level routines in the kernel such as trap and interrupt handling, ASTs and scheduling. Where possible direct pointers are given to source code to reduce the amount of digging one must do to locate routines of interest. It is also important to note that the document is the result of the author investigation into the code, and much appreciated help from various members of the FreeBSD development team, (Poul-Henning Kamp (phk), Alan Cox (alc), Matt Dillon (dillon)) and Terry Lambert. As I am not the writer of the code there may be missing or incorrect information contained in this document. Please email any corrections or comments to s...@freebsd.org and please make sure I get a CC. (alf...@freebsd.org) ------------------------------------------------------------ The Big Giant Lock: (src/sys/i386/i386/mplock.s) The current state of SMP in FreeBSD is by means of the Big Giant Lock, (BGL). The BGL is an exclusive counting semaphore, the lock may be recursively acquired by a single CPU, from that point on other CPUs will spin while waiting to acquire the lock. The implementation on i386 is contained in the file src/sys/i386/i386/mplock.s The function 'void MPgetlock(unsigned int *lock)' acquires the BGL. An important side effect of MPgetlock is that it routes all interrupts to the processor that has acquired the lock. This is done so that if an interrupt occurs the handler doesn't need to spin waiting for the BGL. The code that is responsible for routing the interrupts is the GRAB_HWI macro within the MPgetlock code. Which fiddles the local APIC's interrupt priority level. Other MPlock functions exist in mplock.s to initialize, test and release the lock. --- Usage of the BGL: (src/sys/i386/i386/mplock.s) The BGL is pushed down (acquired) on all entry into the kernel, by means of syscall, trap or interrupt. The file src/sys/i386/i386/exception.s contains all the initial entry points for syscalls, traps and interrupts. syscalls and 'altsyscalls' acquire the lock through the macros SYSCALL_LOCK, and ALTSYSCALL_LOCK which map to the functions assembler functions _get_syscall_lock and _get_altsyscall_lock on SMP machines (if SMP is not defined they are not called) _get_syscall_lock and _get_altsyscall_lock are also present in src/sys/i386/i386/mplock.s, they save the contents of the local apic's interrupt priority and call MPgetlock. It would seem that the syscall lock could simply be delayed until entry to the actual system call (write/read/...) however several issues arise: 1) fault on copyin of user's syscall arguments This is actually a non-issue, if a fault occurs the processor will spin to acquire the MPlock, before potentially recursing into the non-re-entrant vm system. Although this leaves the processor in a faulted state for quite some time, it is no different than when CPU 1 has the lock and a process running on CPU 2 page faults. Problem #1 takes care of itself because of the recursive MPlock. 2) ktrace hooks src/sys/kern/kern_ktrace.c The ktrace hooks in the syscalls manipulate kernel resources that are not MP safe, ktrace touches many parts of the kernel that need work to become MP safe, a temporary solution would be to raise the BGL when entering the ktrace code. 3) STOPEVENT aka void stopevent(struct proc*, unsigned int, unsigned int); /home/src/sys/kern/sys_process.c stopevent will be called if the process is marked to sleep via procfs, stopping the process requires entry into the scheduler which is not MP safe. again a temporary hack would be to conditionally set the MPlock if the condition exists. --- SPL issues: (src/sys/i386/isa/ipl_funcs.c) There exists an inherent race condition with the spl() system in a MP environment, consider: system is at splbio: process A process B int s; int s; s = splhigh(); /* spl raised to high however, saved spl 's' has old value of splbio */ s = splhigh(); /* spl still high */ splx(s); /* processor spl now at bio even though B still needs splhigh */ splx(s); Process B may be interrupted in a critical section. Also note that the asymmetric nature of the spl system makes it very difficult to pinpoint down locations in the the bottom half of the kernel (the part that services interrupts) that may collide with the top half (user process context). A short sighted solution would be to enforce spl as an MPlock, an exclusive counting semaphore, however since no locking protocol or ordering of spl pushdown is required deadlock becomes a major problem. The only solution that may work with spl, is adding the pushdown of the BGL when first asserting any level of spl and releasing the MPlock when spl0 is reached. It may also be interesting to see what a separate lock based only on spl would accomplish, moving to a model where the spl entry points become our new BGL might also be something to investigate. Since spl is used only for short time mutual exclusion it may actually work nicely as a course grained locking system for the time being. --- Simple locks: (src/sys/i386/i386/simplelock.s) cursory research into the CVS logs reveals: on the file kern/vfs_syscalls.c: 1.28 Thu Jul 13 8:47:42 1995 UTC by davidg Diffs to 1.27 NOTE: libkvm, w, ps, 'top', and any other utility which depends on struct proc or any VM system structure will have to be rebuilt!!! Much needed overhaul of the VM system. Included in this first round of changes: ... 4) simple_lock's removed. Discussion with several people reveals that the SMP locking primitives used in the VM system aren't likely the mechanism that we'll be adopting. Even if it were, the locking that was in the code was very inadequate and would have to be mostly re-done anyway. The locking in a uni-processor kernel was a no-op but went a long way toward making the code difficult to read and debug. However with the Lite/2 merge they were re-introduced and the kernel is littered with them, the ones in place seem somewhat adequate for short term exclusion. essentially they are spinlocks. What's interesting is that the simplelocks seem to provide for MP sync with lockmgr locks, however the code is littered with calls to unsafe functions such as MALLOC. It looks like someone decided to do the hard stuff first. Why are the simplelocks necessary if the kernel is still guarded by the BGL? (besides use in the lockmgr) --- Scheduler: The scheduler in cpu_switch() (src/sys/i386/i386/swtch.s) saves the current nesting level of the process's MPlock (after masking off the CPUid bits from it) into the PCB (process control block) (lines 317-324) before attempting to switch to another process where it restores the next process's nesting level (lines 453-455). --- -Alfred Perlstein - [bri...@rush.net|alf...@freebsd.org] Wintelcom systems administrator and programmer - http://www.wintelcom.net/ [bri...@wintelcom.net] To Unsubscribe: send mail to majord...@FreeBSD.org with "unsubscribe freebsd-smp" in the body of the message
From: lu...@watermarkgroup.com (Luoqi Chen) Subject: Re: SMP infoletter #1 Date: 1999/10/27 Message-ID: <7v767e$2p0t$1@FreeBSD.csie.NCTU.edu.tw> X-Deja-AN: 541236506 X-Trace: FreeBSD.csie.NCTU.edu.tw 941038638 91167 140.113.235.250 (27 Oct 1999 15:37:18 GMT) Organization: NCTU CSIE FreeBSD Server NNTP-Posting-Date: 27 Oct 1999 15:37:18 GMT Newsgroups: mailing.freebsd.smp X-Complaints-To: usenet@FreeBSD.csie.NCTU.edu.tw I would like to offer some comments here. > > Usage of the BGL: (src/sys/i386/i386/mplock.s) > > The BGL is pushed down (acquired) on all entry into the kernel, by ^^^^^^^^^^^ My understanding is push down is a completely different term. > means of syscall, trap or interrupt. ... > It would seem that the syscall lock could simply be delayed until > entry to the actual system call (write/read/...) however several > issues arise: > > 1) fault on copyin of user's syscall arguments ... > > Problem #1 takes care of itself because of the recursive MPlock. > > 2) ktrace hooks ... > > 3) STOPEVENT aka void stopevent(struct proc*, unsigned int, unsigned int); ... You missed one very important part of the code path, which is also the most difficult one to be dealt with to make the path MP safe: userret(), it involves scheduling (relatively easier) and signal delivery (difficult). > > --- > > SPL issues: (src/sys/i386/isa/ipl_funcs.c) > > There exists an inherent race condition with the spl() system in > a MP environment, consider: > > system is at splbio: > > process A process B > > int s; int s; > s = splhigh(); /* spl raised to high however, > saved spl 's' has old value > of splbio */ > s = splhigh(); /* spl still high */ > splx(s); /* processor spl now at bio > even though B still needs > splhigh */ > splx(s); > > > Process B may be interrupted in a critical section. > > Also note that the asymmetric nature of the spl system makes it > very difficult to pinpoint down locations in the the bottom half > of the kernel (the part that services interrupts) that may collide > with the top half (user process context). > > A short sighted solution would be to enforce spl as an MPlock, an > exclusive counting semaphore, however since no locking protocol or > ordering of spl pushdown is required deadlock becomes a major > problem. > > The only solution that may work with spl, is adding the pushdown > of the BGL when first asserting any level of spl and releasing the > MPlock when spl0 is reached. > > It may also be interesting to see what a separate lock based only > on spl would accomplish, moving to a model where the spl entry > points become our new BGL might also be something to investigate. > I actually have a working implementation for this (I'm willing to provide the patch if anyone is interested to try). But I believe this leads to a dead-end. We should use some kind of interrupt level aware mutex instead, something like this s = splimp(); simple_lock(&mbuf_lock); ... simple_unlock(&mbuf_lock); splx(s); If everyone agrees on this direction, there is an immediate benefit we can reap by moving cpl to per-cpu storage and getting rid of cpl_lock, which might a reduce significant amount of system time (5~10% unscientifically). > Since spl is used only for short time mutual exclusion it may > actually work nicely as a course grained locking system for the > time being. > The reason I believe this is leading us to nowhere is that it is a hack and it could only marginally improve the performance as most of the kernel is running under some spl protection, e.g., it's impossible to move tcp stack outside the BGL under this scheme. > > Why are the simplelocks necessary if the kernel is still guarded > by the BGL? (besides use in the lockmgr) > Under BGL it's not even necessary in the lockmgr, in fact, the only useful simplelocks are the fast interrupt lock and the lock on i/o apic register window (fast interrupt handlers are not under BGL protection, IIRC, the only instance is sio). But BGL is to go and hence simplelock is here to stay. One interest things NetBSD has done was a read/write spinlock, it could be used to protect lists like allproc. I think it is nice to have in our system too. > --- > > Scheduler: > > The scheduler in cpu_switch() (src/sys/i386/i386/swtch.s) saves the > current nesting level of the process's MPlock (after masking off > the CPUid bits from it) into the PCB (process control block) (lines > 317-324) before attempting to switch to another process where it > restores the next process's nesting level (lines 453-455). > One thing that is relatively easy to do in this area is to allow a processor spin waiting for the BGL to pick up another user process. I'm currently looking into this problem myself, one thing I would like to do is to move the nesting level field from the u area to struc proc, so that we could easily tell if a process was (involuntarily) context switched in the user mode and a candidate to schedule on a non-lock holding processor. -lq To Unsubscribe: send mail to majord...@FreeBSD.org with "unsubscribe freebsd-smp" in the body of the message