locking stuff on pre-SMPng from FreeBSD maillist (1)

From: bri...@wintelcom.net (Alfred Perlstein)
Subject: SMP infoletter #1
Date: 1999/10/27
Message-ID: <7v6bqp$1s94$1@FreeBSD.csie.NCTU.edu.tw>
X-Deja-AN: 541114716
X-Trace: FreeBSD.csie.NCTU.edu.tw 941011609 61733 140.113.235.250 (27 Oct 1999 08:06:49 GMT)
Organization: NCTU CSIE FreeBSD Server
NNTP-Posting-Date: 27 Oct 1999 08:06:49 GMT
Newsgroups: mailing.freebsd.smp
X-Complaints-To: usenet@FreeBSD.csie.NCTU.edu.tw

Infoletter #1

This is the start of what I hope to be several informative documents
describing the current and ongoing state of SMP in FreeBSD.

The purpose is to avoid duplicate research of the current state of
FreeBSD's SMP behavior by those who haven't been following FreeBSD-SMP
since 'day one'.  It also points out some areas that are still
unclear to me.

This document was written on Tue Oct 26 1999 referencing the HEAD
branch of the code, things may have significantly changed since.

I also hope that this series helps to shed some light onto the low
level routines in the kernel such as trap and interrupt handling,
ASTs and scheduling.

Where possible direct pointers are given to source code to reduce
the amount of digging one must do to locate routines of interest.

It is also important to note that the document is the result of
the author investigation into the code, and much appreciated help
from various members of the FreeBSD development team, (Poul-Henning
Kamp (phk), Alan Cox (alc), Matt Dillon (dillon)) and Terry Lambert.
As I am not the writer of the code there may be missing or incorrect
information contained in this document.

Please email any corrections or comments to s...@freebsd.org and
please make sure I get a CC. (alf...@freebsd.org)

------------------------------------------------------------

The Big Giant Lock: (src/sys/i386/i386/mplock.s)

The current state of SMP in FreeBSD is by means of the Big Giant
Lock, (BGL).

The BGL is an exclusive counting semaphore, the lock may be
recursively acquired by a single CPU, from that point on other CPUs
will spin while waiting to acquire the lock.

The implementation on i386 is contained in the file
src/sys/i386/i386/mplock.s

The function 'void MPgetlock(unsigned int *lock)' acquires the BGL.

An important side effect of MPgetlock is that it routes all interrupts
to the processor that has acquired the lock.  This is done so that
if an interrupt occurs the handler doesn't need to spin waiting for
the BGL.

The code that is responsible for routing the interrupts is the GRAB_HWI
macro within the MPgetlock code.  Which fiddles the local APIC's
interrupt priority level.

Other MPlock functions exist in mplock.s to initialize, test and
release the lock.

---

Usage of the BGL: (src/sys/i386/i386/mplock.s)

The BGL is pushed down (acquired) on all entry into the kernel, by
means of syscall, trap or interrupt.

The file src/sys/i386/i386/exception.s contains all the initial
entry points for syscalls, traps and interrupts.

syscalls and 'altsyscalls' acquire the lock through the macros
SYSCALL_LOCK, and ALTSYSCALL_LOCK which map to the functions assembler
functions _get_syscall_lock and _get_altsyscall_lock on SMP machines
(if SMP is not defined they are not called)

_get_syscall_lock and _get_altsyscall_lock are also present in
src/sys/i386/i386/mplock.s, they save the contents of the local
apic's interrupt priority and call MPgetlock.

It would seem that the syscall lock could simply be delayed until
entry to the actual system call (write/read/...) however several
issues arise:

1) fault on copyin of user's syscall arguments

This is actually a non-issue, if a fault occurs the processor will
spin to acquire the MPlock, before potentially recursing into the
non-re-entrant vm system.  Although this leaves the processor in
a faulted state for quite some time, it is no different than when
CPU 1 has the lock and a process running on CPU 2 page faults.

Problem #1 takes care of itself because of the recursive MPlock.

2) ktrace hooks

src/sys/kern/kern_ktrace.c

The ktrace hooks in the syscalls manipulate kernel resources that
are not MP safe, ktrace touches many parts of the kernel that need
work to become MP safe, a temporary solution would be to raise the
BGL when entering the ktrace code.

3) STOPEVENT aka void stopevent(struct proc*, unsigned int, unsigned int);

/home/src/sys/kern/sys_process.c

stopevent will be called if the process is marked to sleep via
procfs, stopping the process requires entry into the scheduler
which is not MP safe.

again a temporary hack would be to conditionally set the MPlock if
the condition exists.

---

SPL issues:    (src/sys/i386/isa/ipl_funcs.c)

There exists an inherent race condition with the spl() system in
a MP environment, consider:

  system is at splbio:

  process A          process B

  int s;             int s;
  s = splhigh();                            /* spl raised to high however, 
                                               saved spl 's' has old value
                                               of splbio */
                     s = splhigh();         /* spl still high */
  splx(s);                                  /* processor spl now at bio
                                               even though B still needs
                                               splhigh */
                     splx(s);

Process B may be interrupted in a critical section.

Also note that the asymmetric nature of the spl system makes it
very difficult to pinpoint down locations in the the bottom half
of the kernel (the part that services interrupts) that may collide
with the top half (user process context).

A short sighted solution would be to enforce spl as an MPlock, an
exclusive counting semaphore, however since no locking protocol or
ordering of spl pushdown is required deadlock becomes a major
problem.

The only solution that may work with spl, is adding the pushdown
of the BGL when first asserting any level of spl and releasing the
MPlock when spl0 is reached.

It may also be interesting to see what a separate lock based only
on spl would accomplish, moving to a model where the spl entry
points become our new BGL might also be something to investigate.

Since spl is used only for short time mutual exclusion it may
actually work nicely as a course grained locking system for the
time being.

---

Simple locks:   (src/sys/i386/i386/simplelock.s)

cursory research into the CVS logs reveals:

on the file kern/vfs_syscalls.c:

   1.28 Thu Jul 13 8:47:42 1995 UTC by davidg 
   Diffs to 1.27 

   NOTE: libkvm, w, ps, 'top', and any other utility which depends on struct
         proc or any VM system structure will have to be rebuilt!!!

   Much needed overhaul of the VM system. Included in this first round of
   changes:
 ...
   4) simple_lock's removed. Discussion with several people reveals that the
      SMP locking primitives used in the VM system aren't likely the mechanism
      that we'll be adopting. Even if it were, the locking that was in the code
      was very inadequate and would have to be mostly re-done anyway. The
      locking in a uni-processor kernel was a no-op but went a long way toward
      making the code difficult to read and debug.

However with the Lite/2 merge they were re-introduced and the kernel
is littered with them, the ones in place seem somewhat adequate
for short term exclusion.  essentially they are spinlocks.

What's interesting is that the simplelocks seem to provide for MP
sync with lockmgr locks, however the code is littered with calls
to unsafe functions such as MALLOC.

It looks like someone decided to do the hard stuff first.

Why are the simplelocks necessary if the kernel is still guarded
by the BGL?  (besides use in the lockmgr)

---

Scheduler:

The scheduler in cpu_switch() (src/sys/i386/i386/swtch.s) saves the
current nesting level of the process's MPlock (after masking off
the CPUid bits from it) into the PCB (process control block) (lines
317-324) before attempting to switch to another process where it
restores the next process's nesting level (lines 453-455).

---

-Alfred Perlstein - [bri...@rush.net|alf...@freebsd.org]
Wintelcom systems administrator and programmer
   - http://www.wintelcom.net/ [bri...@wintelcom.net]

To Unsubscribe: send mail to majord...@FreeBSD.org
with "unsubscribe freebsd-smp" in the body of the message

From: lu...@watermarkgroup.com (Luoqi Chen)
Subject: Re:  SMP infoletter #1
Date: 1999/10/27
Message-ID: <7v767e$2p0t$1@FreeBSD.csie.NCTU.edu.tw>
X-Deja-AN: 541236506
X-Trace: FreeBSD.csie.NCTU.edu.tw 941038638 91167 140.113.235.250 (27 Oct 1999 15:37:18 GMT)
Organization: NCTU CSIE FreeBSD Server
NNTP-Posting-Date: 27 Oct 1999 15:37:18 GMT
Newsgroups: mailing.freebsd.smp
X-Complaints-To: usenet@FreeBSD.csie.NCTU.edu.tw

I would like to offer some comments here.
> 
> Usage of the BGL: (src/sys/i386/i386/mplock.s)
> 
> The BGL is pushed down (acquired) on all entry into the kernel, by
             ^^^^^^^^^^^
My understanding is push down is a completely different term.

> means of syscall, trap or interrupt.
...
> It would seem that the syscall lock could simply be delayed until
> entry to the actual system call (write/read/...) however several
> issues arise:
> 
> 1) fault on copyin of user's syscall arguments
...
> 
> Problem #1 takes care of itself because of the recursive MPlock.
> 
> 2) ktrace hooks
...
> 
> 3) STOPEVENT aka void stopevent(struct proc*, unsigned int, unsigned int);
...

You missed one very important part of the code path, which is also the most
difficult one to be dealt with to make the path MP safe: userret(),
it involves scheduling (relatively easier) and signal delivery (difficult).

> 
> ---
> 
> SPL issues:    (src/sys/i386/isa/ipl_funcs.c)
> 
> There exists an inherent race condition with the spl() system in
> a MP environment, consider:
> 
>   system is at splbio:
> 
>   process A          process B
> 
>   int s;             int s;
>   s = splhigh();                            /* spl raised to high however, 
>                                                saved spl 's' has old value
>                                                of splbio */
>                      s = splhigh();         /* spl still high */
>   splx(s);                                  /* processor spl now at bio
>                                                even though B still needs
>                                                splhigh */
>                      splx(s);
> 
> 
> Process B may be interrupted in a critical section.
> 
> Also note that the asymmetric nature of the spl system makes it
> very difficult to pinpoint down locations in the the bottom half
> of the kernel (the part that services interrupts) that may collide
> with the top half (user process context).
> 
> A short sighted solution would be to enforce spl as an MPlock, an
> exclusive counting semaphore, however since no locking protocol or
> ordering of spl pushdown is required deadlock becomes a major
> problem.
> 
> The only solution that may work with spl, is adding the pushdown
> of the BGL when first asserting any level of spl and releasing the
> MPlock when spl0 is reached.
> 
> It may also be interesting to see what a separate lock based only
> on spl would accomplish, moving to a model where the spl entry
> points become our new BGL might also be something to investigate.
> 
I actually have a working implementation for this (I'm willing to provide
the patch if anyone is interested to try). But I believe this leads to a
dead-end. We should use some kind of interrupt level aware mutex instead,
something like this
	s = splimp();
	simple_lock(&mbuf_lock);
	...
	simple_unlock(&mbuf_lock);
	splx(s);

If everyone agrees on this direction, there is an immediate benefit we can
reap by moving cpl to per-cpu storage and getting rid of cpl_lock, which
might a reduce significant amount of system time (5~10% unscientifically).

> Since spl is used only for short time mutual exclusion it may
> actually work nicely as a course grained locking system for the
> time being.
> 
The reason I believe this is leading us to nowhere is that it is a hack and
it could only marginally improve the performance as most of the kernel is
running under some spl protection, e.g., it's impossible to move tcp stack
outside the BGL under this scheme.
> 
> Why are the simplelocks necessary if the kernel is still guarded
> by the BGL?  (besides use in the lockmgr)
> 
Under BGL it's not even necessary in the lockmgr, in fact, the only useful
simplelocks are the fast interrupt lock and the lock on i/o apic register
window (fast interrupt handlers are not under BGL protection, IIRC, the
only instance is sio). But BGL is to go and hence simplelock is here to stay.

One interest things NetBSD has done was a read/write spinlock, it could be
used to protect lists like allproc. I think it is nice to have in our system
too.

> ---
> 
> Scheduler:
> 
> The scheduler in cpu_switch() (src/sys/i386/i386/swtch.s) saves the
> current nesting level of the process's MPlock (after masking off
> the CPUid bits from it) into the PCB (process control block) (lines
> 317-324) before attempting to switch to another process where it
> restores the next process's nesting level (lines 453-455).
> 
One thing that is relatively easy to do in this area is to allow a processor
spin waiting for the BGL to pick up another user process. I'm currently
looking into this problem myself, one thing I would like to do is to move
the nesting level field from the u area to struc proc, so that we could
easily tell if a process was (involuntarily) context switched in the user
mode and a candidate to schedule on a non-lock holding processor.

-lq


To Unsubscribe: send mail to majord...@FreeBSD.org
with "unsubscribe freebsd-smp" in the body of the message