Berkeley DB 源代码分析 (5) --- 事务锁模块


Locking Subsystem Learning Notes
0. locking API
__db_lput/__db_lget are txnal lock put/get, often __TLPUT is called instead,
and __TLPUT calls __db_lput internally. __db_lput will downgrade the lock
rathter than simply releasing it if the db supports dirty reads and the lock
is a write lock; in other circumstances it will release the lock.

__LPUT and __ENV_LPUT simply calls __lock_put, thus the lock is simply

1. So far we only use intentional locks in cds mode. This is because we are not using row level locking in tds mode, we only lock pages in tds mode. (what about queue am?)

2. Need to provide row level locking API, like the existing __db_lget/__db_lput for page level locking. Because qam is using record locking now, it uses __db_lget with a DB_LOCK_RECORD flag to lock a record. However __db_lget can't be directly used because it's locking a db-wide recno, so (dbfileuid, recno) is sufficient to be the lock obj, but in the new am we don't have global recno, we will lock (dbfileuid, pgno, in-page-indx).

qam does not lock pages, it always lock (dbfileuid, recno), so it does not
need intentional locking either.

1. So far we only use intentional locks in cds mode. This is because we are not using row level locking in tds mode, we only lock pages in tds mode. (what about queue am?)

2. Need to provide row level locking API, like the existing __db_lget/__db_lput for page level locking. Because qam is using record locking now, it uses __db_lget with a DB_LOCK_RECORD flag to lock a record. However __db_lget can't be directly used because it's locking a db-wide recno, so (dbfileuid, recno) is sufficient to be the lock obj, but in the new am we don't have global recno, we will lock (dbfileuid, pgno, in-page-indx).

qam does not lock pages, it always lock (dbfileuid, recno), so it does not
need intentional locking either.

3. __db_ilock(DB_LOCK_ILOCK) is the lock object structure used inside DB to represent lock
info, all locks used by db internally are managed using this type e.g. the lock list.
We need to add a in-page-index to lock a row.

__db_lock is the struct in the lock region, used to store each lock's info
inside lock subsystem, to manage database locks;
DB_LOCK is used in a lock request used by outside caller to get/put a lock. it
contains offset pointer for us to find the corresponding internal __db_lock.

the $db/lock/Design is obsolete.

Firstly, there are multiple lock table partitions, each of them has __db_lockpart structure,
containing partition specific info, and containing some of
the free lock objects and free locks. So getting a free lock object/lock can
be parallel in multiple partitions.

The main lock region has __db_lockregion structure, containing global info of the lock region; And it
contains the locker free list, lockers are not put into multiple partitions. It also contains the
locker/lock object hash tables, as expected. also it contains the lock table
partition array. Note that all the partitions live in the same region file as
the main lock region.

There are no direct bucket mutex for each bucket in lockobj hash table,
rather, we use lock partition's mutex to protect a specific bucket of
lockobjs. So if there are same number of partitions and lockobj buckets,
the concurrency of lockobj mgmt is not affected; However partition can be far
less than lockobjs and locks, it's by default 1 on single cpu systems and
10*cpu-number on multicpu systems. I think manually setting a big lock
partition number can promote concurrency, though the doc says otherwise.
And we use only one mutex to protect the entire lockers, so concurrency is
affected. Especially when there are a lot of short but frequent txns, e.g.
when mvcc is turned on; However this must be something we have to do to avoid other issues.

The stat info for lock region are divided into 3 parts, first the DB_LOCK_STAT
contains general stat info and lives in main lock region (__db_lockregion);
And we have an array of stats for lock object hash buckets(DB_LOCK_HSTAT) and
lives in main lock region too, there are same number of lockobj hash buckets
and lockobj stat structures;
lastly we have a DB_LOCK_PSTAT for each lock table partition (in
The reason to divide stat into multiple parts is to bring up concurrency, when
reporting stat into, we integrate all the pieces into a DB_LOCK_STAT in
__lock_stat function.

Both locks and lockobjs have generation number, in order to allow reusing them
and at the same time avoid the issue that a previous use happen at the same
time with a reuse. Locker structs can be reused too but they don't need
generation numbers because each locker has a unique locker id.

5. __lock_region_mutex_count
total mutex needed in lock subsystem: max-number-of-locks +
number-of-partitions + 3. Each lock has a mutex for the thread of control to
block on; each parition has its mutex; lockregion, lockers mgmt and dead lock
detector needs 1 respectively.

6. __lock_region_size
So far we allocate more hash buckets than the things to be stored in the hash
table. this is true for locker/lockobj hash tables.
If the hash is good enough (simply modular operator % will be good enough), we
can put each locker and lockobj in a hash bucket, i.e. we never need to
traverse the bucket list for a locker/lockobj. Consequently there can be potentially a huge
number of hash bucket, and I think this is one of the reasons we don't use one
mutex for each hash bucket.

7. __lock_env_refresh
It frees all allocated memory space if using private env. From here we can see what
structures we allocated. The only issue I find is: we are only freeing free
lockers, in-use lockers are not freed. The in-use lockers normally are freed
when the locker is closed(txn end; db/dbc handle close), but may not be freed
if above close op does not execute.

8. __lock_region_init
Init shared lock region structure (__db_lockregion), allocate mutexes,
and init all members.
Allocate memory for locker/lockobj hash tables, lock partition array and lockobj hash bucket stat array, all lockers, locks and lockobjs, and link locks and lockobjs into free list of each partition evenly, linking lockers into a free list.

This function is only called when we are creating rather than joining a lock

9. __lock_open

Open and init __db_locktab structure, which is the per-process struct for lock region. call __lock_region_init to create the lock region if needed.

10. __lock_fix_list -- sort lock req 's objs
this is the only function in use in file locklist.c. It arrange lockobjs in
the lockreq parameter of lock_vec so that they are grouped/sorted by file.

11. __lock_promote
Looks in a lockobj's waiter list WL for any locks L that doesn't conflict with any
locks in the same lockobj's holder list HL, and move such an lock L from WL to
HL, and mark L as DB_LSTAT_PENDING, and unlock the L->mtx_lock mutex so that the corresponding thread of control can continue executing in OS. And if the lockobj 's WL is empty and it is in a DD's dd_objs
list, remove it from the dd_objs since it won't block anyone.

It should be called whenever a lock becomes available so that some waiters may hold the lock and continue, (i.e. it is used as a scheduler), e.g. a lock in the lockobj's holder list is released, downgraded, or given back to the parent txn.

12. __lock_inherit_locks
Return a child txn's locks back to its parent txn. For each lock L the child
txn CT has, it L is also held by the parent txn T, then the L->refcount++;
else L is put into T's locker's lock list. Do a scheduling by calling
__lock_promote at the end so that any sibling txns waiting for any returned
locks can go on.

13. __lock_remove_waiter
Remove a lock from a lockobj's waiter list. if the lockobj has no locks in its
waiter list, remove it from dd's dd_objs list because it won't block any one.
At last, unlock the lock's mutex, so that any thread of control waiting for
the mutex can continue. A lock's mutex is used to suspend the thread of
control T for it to wait for the lock. if we want T to suspend, the code to
acquire the lock->mtx_lock is executed in T and thus T suspends, since the
lock->mtx_lock was created and mutex-locked. and whenever we want any thread
of control to go since they got this lock, we mutex-unlock the lock->mtx_lock,
and any such T does not mutex-unlock this mutex, so next time we can still
suspend such a T by having it acquire the lock->mtx_lock.

14. __lock_trade
transfer a lock from a locker A to another locker B, so that the lockobj A
locks now is locked by B.

15. __lock_allocobj and __lock_alloclock
__lock_allocobj Allocate lockobj from a partition other than the specified one P. we do so
because P does not have free lockobj space, and we want to find such a free
lockobj space from other partitions and move it to P's free list. Such an
action is called a lockobj 'steal' seen from the stat results.

Similarly we have __lock_alloclock. Free locks are also in partitions, so we
iterate each partition to find one free space for a lock, and steal it into
the current partition.

16. __lock_getobj
Find a specified lock obj from lockobj hash table. If the lockobj does not
exist yet and we will create one, first allocate a lockobj space from a partition P.
If P does not have free lockobj space, steal from other partitions; then init
the lockobj fields.

Note that lock subsystem allows external use, so the size
of a lockobj may not be standard (sizeof (__db_ilock)). so if the external
lock obj is larger than sizeof(db_ilock), we allocate memory from lock region and store the (off, size) into lockobj->lockobj. then we always copy bytes into either lockobj-> or lockobj->objdata.

And we will need to free the region space when releasing a lockobj, if we
found lockobj->lockobj.size > sizeof(lockobj->objdata).

17. __lock_freelock
Optionally free a lock L from its locker's lock list, and optionally put it
into its partition's free lock list. The lock's mutex is destroyed.

Question: what does lock promotion mean?

A locker's info is stored in shared mem, and txns/db/dbc handles only store a
locker ID, when the locker is used inside locking subsystem, we find the
locker via __lock_getlocker and use it only inside locking subsystem.

18. __lock_put_pp, __lock_put, __lock_put_nolock, __lock_put_internal
__lock_put_internal is the ultimate funtion called. all the rest are wrappers.

__lock_put_internal releases a lock regarding/disregarding its reference count.  Note that any lock/lockobj release will increment its generation to reuse it and prevent issues from the reuse.

It first unlink the lock from its lockobj's wait/holder list. If it was in holder list,
we call __lock_promote to schedule so that waiters can proceed. then if the
lockobj has no locks in both its watier and holder list, free the lockobj.
Finally call __lock_freelock to free the lock, it will be removed from its
locker's locklist.

19. __lock_get_pp, __lock_get, __lock_get_api, __lock_get_internal
__lock_get_internal is the ultimate function that does the tasks. It decides
whether a lock should be granted to the locker. If not, the thread of control
executing this function blocks on the requested lock's mutex. The locker and
lockobj are supposed to be both already created in the lock region.

Figure out if we can grant this lock or if it should wait.
By default, we can grant the new lock if it does not conflict
with anyone on the holders list OR anyone on the waiters list.
The reason that we don't grant if there's a conflict is that
this can lead to starvation (a writer waiting on a popularly
read item will never be granted).  The downside of this is
that a waiting reader can prevent an upgrade from reader to writer,
which is not uncommon.

There are two exceptions to the no-conflict rule.  First, if
a lock is held by the requesting locker AND the new lock does
not conflict with any other holders, then we grant the lock.
The most common place this happens is when the holder has a
WRITE lock and a READ lock request comes in for the same
If we do not grant the read lock, then we guarantee deadlock.
Second, dirty readers are granted if at all possible while
avoiding starvation, see below.
In case of conflict, we put the new lock on the end of the
waiters list, unless we are upgrading or this is a dirty reader in
which case the locker goes at or near the front of the list.

lock holders on a lockobj have compatible locks, so, if the locks are
DB_LOCK_READ, other locks are read locks too. that's why the grant_dirty
variable assignment only tests one lock.

locks can be downgraded from DB_LOCK_WRITE to DB_LOCK_WWRITE. a page can't be
read by a dirty reader when it's DB_LOCK_WRITE. when we are done writing the page
but txn not ended yet, and in dirty read txns, such action means the lock is
downgraded to DB_LOCK_WWRITE, and thus it can be read by dirty readers. such
behavior prevents reading an invalid page --- during the memory manipulations
to the page it is completely broken/invalid to other txns. For the same
reason, when a page is being dirty-read, it should not be write otherwise
there maybe memory errors, thus, dirty read lock conflicts with WRITE locks;
and it should not be normally-read because the page is dirty, but the conflict
matrix does not prevent so, just no locker does this.

when a locker get its requested lock, it's status is set to DB_LSTAT_PENDING.
and when the thread of control actually runs, it's stat will be set to
DB_LSTAT_HELD. when it's waiting for a lock it's stat is DB_LSTAT_WAITING.

DB_LOCK_UPGRADE is a flag modifying DB_LOCK_WRITE, meaning the locker wants to
upgrade from READ to WRITE lock; it's not a type of lock.

op flow:

1. decide to grant/wait a lock, and where to wait in the waiter list of the

2. create lock struct in lock region and wire up with locker/lockobj.

3. if granted the lock, continue, return; otherwise, do deadlock detect before
waiting for the lock(acquiring the lock's mutex, which will suspend the thread
of control since that mutex was acquired when the lock was created); when got
the lock and continue, handle time expires and other details.


1. Reduce deadlocks
If a locker already locks a lockobj and it wants more locks on it, it should
get the locks ASAP, otherwise cyclic wait has a better chance to form.
so if the new lockreq doesn't conflict with any other holders on the lockobj,
it gets the lock, otherwise it wait as the first one and will get the lock

also if the lockreq wants to upgrade to write lock, it should
have higher priority to proceed --- if it conflict with a holder, it's the 1st
in the waiter list; otherwise it get the lock, no need to be compatible with
all waiters. By doing so, lock upgrade can complete asap, otherwise it's more
likely to starve than read locks, read locks can starve lockers who want
to upgrade locks.

2. Avoid starvation
A lockreq should not conflict with any holders. and it shouldn't conflict with
any waiters (except the above 2 situations) so that waiters don't starve.

3. Promote throughput
For dirty readers(lockers that has DB_READ_UNCOMMITTED), they should get locks
asap, which is why they want to read dirty data. thus they will be put at the
1st or 2nd in the waiter list, after the write lock.

20. __lock_vec_pp, __lock_vec_pp, __lock_vec_api, __lock_vec
__lock_vec is the ultimate function that does the tasks, and it calls
__lock_put_internal and __lock_get_internal to do the work.

DB_LOCK_WAIT: an internal type of lock used to lock nothing, but wait for an
event to happen. such a lock is used as a condition var.