signature=41b774dbc1d0d38463c3b9d9aed12f1d,SQLite4

Artifact

41b08c1d31c156d3916558aad89b7e7ae8a381c5:

File

www/lsmusr.wiki

— part of check-in

[ee694a32e4]

at

2012-11-17 10:51:01

on branch trunk

— Avoid writing pages to disk out of sequential order (one problem still remains).

(user:

dan,

size: 65810)

[annotate]

[blame]

[check-ins using]

LSM Users Guide

Table of Contents

1. Introduction to LSM

2. Using LSM in Applications

3. Basic Usage

3.1. Opening and Closing Database Connections

3.2. Writing to a Database

3.3. Reading from a Database

3.4. Database Transactions and MVCC

4. Data Durability

5. Compressed and Encrypted Databases

6. Performance Tuning

6.1. Performance Related Configuration Options

6.2. Using Worker Threads or Processes

6.2.1. Architectural Overview

6.2.2. Automatic Work and Checkpoint Scheduling

6.2.3. Explicit Work and Checkpoint Scheduling

6.2.4. Compulsary Work and Checkpoint Scheduling

6.3. Database File Optimization

Overview

This document describes the LSM embedded database library and use thereof.

It is intended to be part user-manual and part tutorial. It is intended to

complement the LSM API reference manual.

The first section of this document contains

a description of the LSM library and its features.

Section 2 describes how to use LSM from

within a C or C++ application (how to compile and link LSM, what to #include

etc.). The third section describes the essential APIs

that applications use to open and close database connections, and to read from

and write to databases.

The three sections described above contain all the information required to

create applications that use LSM. The remaining sections discuss more

specialized topics. Section 4 discusses the

configuration parameter that influences transaction durability (the guarantees

offered with respect to recently committed transactions if a power failure

occurs). Section 5 explains

the interface provided by LSM that allow external data compression and/or

encryption functions to be used to create compressed and/or encrypted

databases. Section 6 deals with the topic

of performance tuning.

1. Introduction to LSM

LSM is an embedded database library for key-value data, roughly similar

in scope to

Berkeley DB,

LevelDB or

KyotoCabinet.

Both keys and

values are specified and stored as byte arrays. Duplicate keys are not

supported. Keys are always sorted in memcmp() order. LSM supports the following

operations for the manipulation and query of database data:Writing a new key and value into the database.

Deleting an existing key from the database.

Deleting a range of keys from the database.

Querying the database for a specific key.

Iterating through a range of database keys (either forwards or

backwards).

Other salient features are:LSM supports a single-writer/multiple-reader MVCC based

transactional concurrency model. SQL style nested sub-transactions are

supported. Clients may concurrently access a single LSM database from

within a single or multiple application processes.

An entire LSM database is stored in a single file on disk.

Data durability in the face of application or power failure.

LSM may optionally use a write-ahead log file when writing to the

database to ensure committed transactions are not lost if an application

or power failure occurs.

LSM may be configured to use external data compression and/or

encryption routines to create and access compressed and/or encrypted

databases.

Many database systems that support range queries, including SQLite 3, Berkeley DB and Kyoto Cabinet, are

based on one of many variants of the

b-tree data structure.

B-trees are attractive because a b-tree structure minimizes the number of disk

sectors that must be read from disk when searching the database for a specific

key. However, b-tree implementations usually suffer from poor write

localization - updating the contents of a b-tree often involves modifying the

contents of nodes scattered throughout the database file. If the database is

stored on a spinning disk (HDD), then the disk heads must be moved before

writing non-contiguous sector, which is extremely slow. If the database is

stored on solid state storage (SDD) a similar phenomena is encountered due to

the large erase-block sizes. In general, writing to a series of contiguous disk

sectors is orders of magnitude faster than updating to the same number of disk

sectors scattered randomly throughout a large file. Additionally, b-tree

structures are prone to fragmentation, reducing the speed of range queries.

Todo: Should have references for the claims above.

Also, fix the link in the next paragraph to point to a description

of the log-structured-merge tree within lsm.wiki (or its successor).

LSM uses a different data structure that makes the

following performance tradeoffs relative to a b-tree:A very large percentage of the disk sectors modified when writing to

the database are contiguous.

Additionally, in many cases the total number of sectors written

to disk is reduced. This makes writing to an LSM database much

faster than the equivalent b-tree.

LSM databases do not suffer from fragmentation to the same degree

as b-trees. This means that the performance of large range queries

does not degrade as the database is updated as it may with a b-tree.

It is accepted that under some circumstances searching an LSM

database for a given key will involve examining more disk sectors

than it would with a b-tree. In terms of disk sectors accessed when

searching a database of size N, both b-trees and LSM provide O(log(N))

efficiency, but the base of the logarithm is generally larger for a

b-tree than for LSM.

In other words, writing to an LSM database should be very fast and scanning

through large ranges of keys should also perform well, but searching the

database for specific keys may be slightly slower than when using a b-tree

based system. Additionally, avoiding random writes in favour of largely

contiguous updates (as LSM does) can significantly reduce the wear on SSD or

flash memory devices.

Although it has quite different features to LSM in other respects,

LevelDB makes similar performance tradeoffs.

Benchmark test results for LSM are available here. Todo:

Fix this link to point to a page with performance graphs.

2. Using LSM in Applications

LSM is not currently built or distributed independently. Instead, it

is part of the SQLite4 library. To use LSM in an application, the application

links against libsqlite4 and includes the header file "lsm.h" in any files

that access the LSM API.

Pointer to build instructions for sqlite4

3. Basic Usage

3.1. Opening and Closing Database Connections

Opening a connection to a database is a two-step process. The

lsm_new() function is used to create a new

database handle, and the lsm_open() function

is used to connect an existing database handle to a database on disk. This

is because some database connection properties may only be configured

before the database is opened. In that case, one or more calls to the

lsm_config() method are made between the

calls to lsm_new() and lsm_open().

The functions are defined as follows:int lsm_new(lsm_env *env, lsm_db **pDb);

int lsm_open(lsm_db *db, const char *zFile);

Like most lsm_xxx() functions that return type int (the exception is

lsm_csr_valid()), both of the above

return LSM_OK (0) if successful, or an LSM

error code otherwise. The first argument to lsm_new() may be passed either

a pointer to a database environment object

or NULL. Almost all applications should pass NULL. A database environment

object allows the application to supply custom implementations of the various

operating system calls that LSM uses to read and write files, allocate heap

memory, and coordinate between multiple application threads and processes.

This is normally only required if LSM is being used on a platform that is not

supported by default. Passing NULL instructs the library to use the default

implementations of all these things, which is usually the right thing to do.

The second argument to lsm_new() is an output variable. Assuming the call

is successful, *pDb is set to point to the new database handle before

returning.

The first argument passed to lsm_open() must be an existing database

handle. The second is the name of the database file to connect to. Once

lsm_open() has been successfully called on a database handle, it can not be

called again on the same handle. Attempting to do so is an LSM_MISUSE error.

For example, to create a new handle and connect it to database "test.db"

on disk:int rc;

lsm_db *db;

/* Allocate a new database handle */

rc = lsm_new(0, &db);

if( rc!=LSM_OK ) exit(1);

/* Connect the database handle to database "test.db" */

rc = lsm_open(db, "test.db");

if( rc!=LSM_OK ) exit(1);

A database connection can be closed using the lsm_close() function. Calling

lsm_close() disconnects from the database (assuming lsm_open() has been

successfully called) and deletes the handle itself. Attempting to use a

database handle after it has been passed to lsm_close() results in undefined

behaviour (likely a segfault).rc = lsm_close(db);

It is important that lsm_close() is called to close all database handles

created with lsm_new(), particularly if the connection has written to the

database. If an application that has written to a database exits without

closing its database connection, then subsequent clients may have to run

"database recovery" when they open the database, making the lsm_open() call

less responsive. Additionally, not matching each successful lsm_new() call

with a call to lsm_close() is a resource leak.

Counter-intuitively, an lsm_close() call may fail. In this case the database

handle is not closed, so if the application exits it invites the "database

recovery" performance problem mentioned above. The usual reason for an

lsm_close() call failing is that the database handle has been used to create

database cursors that have not been

closed. Unless all database cursors are closed before lsm_close() is called,

it fails with an LSM_BUSY error and the database handle is not closed.

3.2. Writing to a Database

Three API functions are used to write to the database:lsm_insert(): insert a new key/value pair into the database,

overwriting any existing entry with the same key.

lsm_delete(): remove a specific key from the database.

lsm_delete_range(): remove an open-ended range (one that does not

include its endpoints) of keys from the database.

Each of these functions returns LSM_OK (0) if successful, or an LSM error

code otherwise (some non-zero value).

The following example code inserts a key/value pair into the database. The

key is a 1-byte blob consisting of the value 0x61, and the value is a 3 byte

blob consisting of 0x6F, 0x6E and 0x65, in that order. An application might

interpret these as utf-8 or ASCII codepoints, but LSM treats them as opaque

blobs of data.rc = lsm_insert(db, "a", 1, "one", 3);

Remove the entry with the key "a" (single 0x61 octet) from the database:rc = lsm_delete(db, "a", 1);

Remove all entries with keys that are greater than "c" but less than "f".

In this context key blobs are compared in the normal way - using memcmp(),

with longer keys being considered larger than their prefixes.rc = lsm_delete_range(db, "c", 1, "f", 1);

The example above removes all keys between "c" and "f", but does not

remove the endpoints "c" and "f" themselves. To do this, requires three

separate calls - one to remove the open-ended range of keys and two to

remove the two endpoints. As follows:/* Should be checking return codes! */

lsm_delete(db, "c", 1);

lsm_delete_range(db, "c", 1, "f", 1);

lsm_delete(db, "f", 1);

3.3. Reading from a Database

All data read from an LSM database is read via a cursor handle. Cursor

handles are opened using the

lsm_csr_open() API, as follows:lsm_csr *csr;

rc = lsm_csr_open(db, &csr);

Once an application has finished using a database cursor, it must be closed

using the lsm_csr_close() API. The

lsm_csr_close() function does not return any value. It cannot fail.lsm_csr_close(csr);

Database cursors support the following functions for positioning the cursor:lsm_csr_seek() - move the cursor to point to a nominated

database key.

lsm_csr_first() - move the cursor to point to the first entry in

the database (the one with the smallest key).

lsm_csr_last() - move the cursor to point to the last entry in

the database (the one with the largest key).

lsm_csr_next() - move the cursor to point to the next entry in

the the database.

lsm_csr_prev() - move the cursor to point to the previous entry

in the database.

Once a cursor has been positioned, it supports the following functions

for retrieving the details of the current entry:lsm_csr_valid() - determine whether or not the cursor currently

points to a valid entry.

lsm_csr_key() - retrieve the key associated with the

database entry the cursor points to.

lsm_csr_value() - retrieve the value associated with the

database entry the cursor points to.

lsm_csr_cmp() - compare a key supplied by the application with

the key associated with the entry the cursor points to.

The following example demonstrates using the

lsm_csr_seek() function to search the

database for a specified key,

lsm_csr_valid() to check if the search

was successful, and lsm_csr_value()

to retrieve the value associated with the key within the database.rc = lsm_csr_seek(csr, "b", 1, LSM_SEEK_EQ);

if( lsm_csr_valid(csr) ){

const void *pVal; int nVal;

rc = lsm_csr_value(csr, &pVal, &nVal);

if( rc==LSM_OK ){

/* pVal now points to a buffer nVal bytes in size containing the

** value associated with database key "b". */

}

}

The example code below iterates forwards through all entries (in key

order, from smallest to largest) in the database. Function

lsm_csr_first() is used to position

the cursor to point to the first entry in the database, and

lsm_csr_next() is used to advance to

the next entry. After lsm_csr_next() is called to advance past the final

entry in the database, the cursor is left pointing to no entry at all,

lsm_csr_valid() returns 0, and the loop is finished. API function

lsm_csr_key() is used to retrieve the

key associated with each database entry visited.for(rc = lsm_csr_first(csr); lsm_csr_valid(csr); rc = lsm_csr_next(csr)){

const void *pKey; int nKey;

const void *pVal; int nVal;

rc = lsm_csr_key(csr, &pKey, &nKey);

if( rc==LSM_OK ) rc = lsm_csr_value(csr, &pVal, &nVal);

if( rc!=LSM_OK ) break;

/* At this point pKey points to the current key (size nKey bytes) and

** pVal points to the corresponding value (size nVal bytes). */

}

The example code above could be modified to iterate backwards through

the entries in the database (again in key order, but this time from largest

to smallest) by replacing the call to lsm_csr_first() with

lsm_csr_last() and the call to lsm_csr_next() with lsm_csr_prev().

The signature of lsm_csr_seek() is:int lsm_csr_seek(lsm_cursor *csr, const void *pKey, int nKey, int eSeek);

The second and third arguments passed to lsm_csr_seek() define the key

to search the database for (pKey must point to the buffer containing the

nKey byte key when this function is called). Assuming no error occurs, if

there an entry with the requested key is present in the database, the cursor

is left pointing to it. Otherwise, if no such entry is present, the final

position of the cursor depends on the value passed as the fourth parameter

to lsm_csr_seek(). Valid values for the fourth parameter to lsm_csr_seek()

are:LSM_SEEK_EQIn this case, if the specified key is not present in the database, the

cursor is not left pointing to any database entry (i.e. calling

lsm_csr_valid() returns 0).LSM_SEEK_LEIf the specified key is not present in the database and the fourth

argument to lsm_csr_seek() is LSM_SEEK_LE (Less than or Equal), the

cursor is left pointing to the database entry with the largest key

that is less than the specified key. Or, if there are no entries in

the database with keys smaller than the specified key, the cursor is

left pointing to no entry at all.LSM_SEEK_GEIf the specified key is not present in the database and the fourth

argument to lsm_csr_seek() is LSM_SEEK_GE (Greater than or Equal), the

cursor is left pointing to the database entry with the smallest key

that is greater than the specified key. Or, if there are no entries

in the database with keys larger than the specified key, the cursor is

left pointing to no entry at all.

Calls made to lsm_csr_seek() with LSM_SEEK_EQ as the final argument are

slightly more efficient than those made specifying LSM_SEEK_LE or LSM_SEEK_GE.

So to retrieve a specific entry from a database, LSM_SEEK_EQ should be

preferred. The other two values are primarily useful for implementing

range queries. For example, to iterate backwards through all keys from "ggg"

to "cc", inclusive:for(rc = lsm_csr_seek(csr, "ggg", 3, LSM_SEEK_LE); lsm_csr_valid(csr); rc = lsm_csr_prev(csr)){

const void *pKey; int nKey;

const void *pVal; int nVal;

int res;

/* Compare the key that the cursor currently points to with "cc". If

** the cursor key is less than "cc", break out of the loop. */

rc = lsm_csr_cmp(csr, "cc", 2, &res);

if( rc!=LSM_OK || res<0 ) break;

rc = lsm_csr_key(csr, &pKey, &nKey);

if( rc==LSM_OK ) rc = lsm_csr_value(csr, &pVal, &nVal);

if( rc!=LSM_OK ) break;

/* At this point pKey points to the current key (size nKey bytes) and

** pVal points to the corresponding value (size nVal bytes). */

}

In the example code above, the call to lsm_csr_seek() positions the cursor

to point to the entry with key "ggg", if it exists, or to the largest entry

in the database with a key smaller than "ggg" if such a key can be found,

or to EOF otherwise. The lsm_csr_prev() call advances the cursor to the

next entry in the database file (in key order from largest to smallest), and

the lsm_csr_valid() call returns 0 to break out of the loop once the

cursor is advance past the entry with the smallest key in the database. So

on its own, the "for" statement serves to iterate the cursor in reverse order

through all keys in the database less than or equal to "ggg".

The call to lsm_csr_cmp() call in the

body of the loop is used to enforce the lower bound (keys >= "cc") on the

range query by breaking out of the loop if an entry with a key smaller than

"cc" is ever visited. lsm_csr_cmp() has the following signature:int lsm_csr_cmp(lsm_cursor *csr, const void *pKey, int nKey, int *piRes);

When lsm_csr_cmp() is called, the key specified by the second and third

arguments (pKey and nKey) is compared to the database key that the cursor

currently points to. Assuming no error occurs, depending on whether or not the

cursors key is less than, equal to, or greater than the specified key, *piRes

is set to a value less than, equal to, or greater than zero before returning.

In other words:*piRes = (cursors key) - (specified key)

3.4. Database Transactions and MVCC

LSM supports a single-writer/multiple-reader

MVCC

based transactional concurrency model. This is the same model that SQLite

supports in WAL mode.

A read-transaction must be opened in order to read from the database.

After a read-transaction has been opened, no writes to the database made

by other database connections are visible to the database reader. Instead,

the reader operates on a snapshot of the database as it existed when the

read transaction was first opened. Any number of clients may simultaneously

maintain open read-transactions.

If one is not already open, a read-transaction is opened when a database

cursor is created (the lsm_csr_open() function). It is closed when the number

of open cursors drops to zero.

A write-transaction is required to write to the database. At any point,

at most one database client may hold an open write transaction. If another

client already has an open write transaction, then attempting to open one

is an error (LSM_BUSY). If a read-transaction is already open when the

write-transaction is opened, then the snapshot read by the read-transaction

must correspond to the most recent version of the database. Otherwise,

the attempt to open the write-transaction fails (LSM_BUSY). In other words,

if any other client has written to the database since the current clients

read-transaction was opened, it will not be possible to upgrade to a

write-transaction.

Write-transactions may be opened either implicitly or explicitly. If any

of the following functions are called to write to the database when there

is no write-transaction open, then an implicit write-transaction is opened and

closed (committed) within the call:lsm_insert()

lsm_delete()

lsm_delete_range()

This means, of course, that all three of the above may return LSM_BUSY.

Indicating either that another client currently has an open write-transaction,

or that there is currently an open read-transaction and some other client

has written to the database since it was opened.

When an explicitly opened transaction is closed, it may either be

committed or rolled back (reverted - so that the state of the database is

unchanged). Within a write-transaction there may also be a hierarchy of

nested sub-transactions that may be rolled back or committed independently.

A write-transaction is a property of a database connection - all writes

made by the connection become part of the current transaction (and possibly

sub-transaction).

The functions used to open, commit and rollback explicity transactions

and sub-transactions are, respectively:int lsm_begin(lsm_db *, int);

int lsm_commit(lsm_db *, int);

int lsm_rollback(lsm_db *, int);

In all cases, the second parameter is either the maximum (lsm_commit(),

lsm_rollback()) or minimum (lsm_begin()) the number of nested

write-transactions that will exist following the call (assuming it succeeds).

If the second parameter passed is N,Calling lsm_begin(db, N) attempts opens zero or more

nested write-transactions so that the database connection is left with

at least N open nested write-transactions. If there are already

N or more open nested write-transactions open, then lsm_begin(db,

N) is a no-op. lsm_begin(db, 0) is always a no-op. Calling

lsm_begin(db, 1) when there is no open write-transaction opens a

top-level write-transaction.

Calling lsm_commit(db, N) commits zero or more nested

write-transactions so that the database connection is left with at most

N open write-transactions. If the connection has N or

fewer open nested write-transactions, then lsm_commit(db, N) is a

no-op. Calling lsm_commit(db, 0) commits the outermost transaction

(if any).

Calling lsm_rollback(db, 0) closes and rolls back the

top-level write-transaction. Calling lsm_rollback(db, N)

for any value of N greater than zero closes zero or more nested

write-transactions so that the database connection is left with at most

N open transactions. If, following this, the database connection

has exactly N open nested write-transactions, the outermost is

rolled back, but not closed. Calling lsm_rollback(db, 1) rolls back

(but does not close) the top-level transaction.

Examples follow. With error checking omitted for brevity's sake./* Open a write-transaction. Write some data to the database. Then

** commit and close the write transaction. Following this, the database

** contains:

**

** "j" -> "ten"

** "k" -> "eleven"

*/

lsm_begin(db, 1);

lsm_insert(db, "j", 1, "ten", 3);

lsm_insert(db, "k", 1, "eleven", 6);

lsm_commit(db, 0);

/* Open a write-transaction, perform all manner of writes and other

** operations (not shown). Then roll the top-level transaction back.

** Regardless of the write operations performed, the database remains

** unchanged:

**

** "j" -> "ten"

** "k" -> "eleven"

*/

lsm_begin(db, 1);

/* Do all manner of writes, sub-transactions etc. */

lsm_rollback(db, 0);

/* Open a write-transaction. Write some data to the database. Then

** rollback the top level transaction but do not close it. Write

** different data to the database and commit. Following this block,

** the database is:

**

** "j" -> "ten"

** "k" -> "eleven"

** "m" -> "thirteen"

*/

lsm_begin(db, 1);

lsm_insert(db, "l", 1, "twelve", 3);

lsm_rollback(db, 1);

lsm_insert(db, "m", 1, "thirteen", 6);

lsm_commit(db, 0);

/* Open a write-transaction and 2 nested sub-transactions. Delete a

** database key. Then commit and close the outermost sub-transaction.

** Open another sub-transaction (so that there are again 2 nested

** sub-transactions). Delete a different database key. Then rollback

** and close the outermost sub-transaction. Finally, delete yet another

** db key and commit the outermost transaction. Leaving just:

**

** "k" -> "eleven"

*/

lsm_begin(db, 3);

lsm_delete(db, "j", 1);

lsm_commit(db, 2);

lsm_begin(db, 3);

lsm_delete(db, "k", 1);

lsm_rollback(db, 2);

lsm_delete(db, "m", 1);

lsm_commit(db, 0);

4. Data Durability

The value of the configuration parameter LSM_CONFIG_SAFETY determines

how often data is synced to disk by the LSM library. This is an important

tradeoff - syncing less often can lead to orders of magnitude better

performance, but also exposes the application to the risk of partial or total

data loss in the event of a power failure;LSM_SAFETY_OFF(0)Do not sync to disk at all. This is the fastest mode.

If a power failure occurs while writing to the database,

following recovery the database may be corrupt. All or some data may

be recoverable.

LSM_SAFETY_NORMAL(1)Sync only as much as is necessary to prevent database corruption.

This is the default setting. Although slower than LSM_SAFETY_OFF,

this mode is still much faster than LSM_SAFETY_FULL.

If a power failure occurs while writing to the database, following

recovery some recently committed transactions may have been lost.

But the database file should not be corrupt and older data intact.

LSM_SAFETY_FULL(2)Sync every transaction to disk as part of committing it. This is

the slowest mode.

If a power failure occurs while writing to the database, all

successfully committed transactions should be present.

The database file should not be corrupt.

The following example code sets the value of the LSM_CONFIG_SAFETY

parameter for connection db to LSM_SAFETY_FULL:int iSafety = LSM_SAFETY_FULL;

lsm_config(db, LSM_CONFIG_SAFETY, &iSafety);

The current value of the LSM_CONFIG_SAFETY parameter can also be queried

by setting the initial value of the argument to -1 (or any other negative

value). For example:int iSafety = -1;

lsm_config(db, LSM_CONFIG_SAFETY, &iSafety);

/* At this point, variable iSafety is set to the currently configured value

** of the LSM_CONFIG_SAFETY parameter (either 0, 1 or 2). */

The lsm_config() function may also be used to configure other database

connection parameters.

5. Compressed and Encrypted Databases

LSM does not provide built-in methods for creating encrypted or compressed

databases. Instead, it allows the user to provide hooks to call external

functions to compress and/or encrypt data before it is written to the database

file, and to decrypt and/or uncompress data as it is read from the database

file.

A database connection is configured to call compression functions using a

call to lsm_config() with the second argument set to

LSM_CONFIG_SET_COMPRESSION. The third argument should point to an instance

of type lsm_compress, which is defined as follows:typedef struct lsm_compress lsm_compress;

struct lsm_compress {

u32 iId;

void *pCtx;

int (*xBound)(void *pCtx, int nIn);

int (*xCompress)(void *pCtx, void *pOut, int *pnOut, const void *pIn, int nIn);

int (*xUncompress)(void *pCtx, void *pOut, int *pnOut, const void *pIn, int nIn);

void (*xFree)(void *pCtx);

};

Explain how the hooks work here (same as zipvfs)

Example code? Using zlib? Or something simple like an RLE

implementation?

The database file header of any LSM database contains a 32-bit unsigned

"compression id" field. If the database is not a compressed database, this

field is set to 1. Otherwise, it is set to an application supplied value

identifying the compression and/or encryption scheme in use. Application

compression scheme ids must be greater than or equal to 10000. Values smaller

than 10000 are reserved for internal use.

The lsm_compression_id() API may be used to read the compression id from

a database connection. Because the compression id is stored in the database

header, it may be read before any required compression or encryption hooks

are configured.#define LSM_COMPRESSION_EMPTY 0

#define LSM_COMPRESSION_NONE 1

int lsm_compression_id(lsm_db *db, u32 *piId);

When a database is opened for the first time, before it is first written,

the compression id field is set to LSM_COMPRESSION_EMPTY (0). The first time

a transaction is committed, the database compression id is set to a copy of

the lsm_compress.iId field of the compression hooks for the database handle

committing the transaction, or to LSM_COMPRESSION_NONE (1) if no compression

hooks are configured.

Once the compression id is set to something other than

LSM_COMPRESSION_EMPTY, when a database handle opens a read or write

transaction on the database, the compression id is compared against the

lsm_compress.iId field of the configured compression hooks, or against LSM_COMPRESSION_NONE if no compression hooks are configured. If the compression id

does not match, then an LSM_MISMATCH error is returned and the operation

fails (no transaction or database cursor is opened).

Maybe there should be a way to register a mismatch-handler callback.

Otherwise, applications have to handle LSM_MISMATCH everywhere...

6. Performance Tuning

This section describes the various measures that can be taken in order to

fine-tune LSM in order to improve performance in specific circumstances.

Sub-section 6.1 identifies the

configuration

parameters that can be used to influence database performance.

Sub-section 6.2 discusses methods for shifting the time-consuming processes of

actually writing and syncing the database file to

background threads or processes

in order to make writing to the database more responsive. Finally, 6.

3 introduces "database optimization"

- the process of reorganizing a database file internally so that it is as small

as possible and optimized for search queries.

6.1. Performance Related Configuration Options

The options in this section all take integer values. They may be both

set and queried using the lsm_config()

function. To set an option to a value, lsm_config() is used as follows:/* Set the LSM_CONFIG_AUTOFLUSH option to 1MB */

int iVal = 1 * 1024 * 1024;

rc = lsm_config(db, LSM_CONFIG_AUTOFLUSH, &iVal);

In order to query the current value of an option, the initial value of

the parameter (iVal in the example code above) should be set to a negative

value. Or any other value that happens to be out of range for the parameter -

negative values just happen to be out of range for all integer lsm_config()

parameters./* Set iVal to the current value of LSM_CONFIG_AUTOFLUSH */

int iVal = -1;

rc = lsm_config(db, LSM_CONFIG_AUTOFLUSH, &iVal);

If LSM is running on a system with a 64-bit address space, this option

may be set to either 1 (true) or 0 (false). On a 32-bit platform, it is

always set to 0.

If it is set to true, the entire database file is memory mapped. Or, if

it is false, data is accessed using ordinary OS file read and write

primitives. Memory mapping the database file can significantly improve the

performance of read operations, as database pages do not have to be copied

from operating system buffers into user space buffers before they can be

examined.

This option can only be set before lsm_open() is called on the database

connection.

The default value is 1 (true) on a 64-bit platform, and 0 otherwise.This option may also be set to either 1 (true) or 0 (false). If it is

set to 0, then the library assumes that all database clients are located

within the same process (have access to the same memory space). Assuming

this means the library can avoid using OS file locking primitives to lock

the database file, which speeds up opening and closing read and write

transactions.

This option can only be set before lsm_open() is called on the database

connection.

The default value is 1 (true).This is another option may also be set to either 1 (true) or 0 (false).

If it is set to false, then the library does not write data into the

database log file. This makes writing faster, but also means that if

an application crash or power failure occurs, it is very likely that

any recently committed transactions will be lost.

If this option is set to true, then an application crash cannot cause

data loss. Whether or not data loss may occur in the event of a power

failure depends on the value of the LSM_CONFIG_SAFETY parameter.

This option can only be set if the connection does not currently have

an open write transaction.

The default value is 1 (true).When a client writes to an LSM database, changes are buffered in memory

before being written out to the database file in a batch. This option

is set to the size of the buffer in bytes. The default value is 1048576.

Increasing this value may improve overall write throughput. Decreasing

it reduces memory usage.This option determines how often the database is checkpointed (synced to

disk). A checkpoint is performed after each N bytes (approximately) are

written to the database file, where N is the value of this option. The

default value is 2097152 (2MB).

Increasing this value (say to 4MB or even 8MB) may improve overall write

throughput. However, it is important not to checkpoint too infrequently,

as:Space in the log file may only be reused following a checkpoint.

Once a checkpoint has been completed, all application data written

before the checkpoint is safely stored in the database file. This

means the contents of the log file are no longer required, and so

the next database writer may start writing a new log into the

start of the existing file (overwriting the old data).

As well as consuming disk space, large log files are undesirable

as often the entire log file must be read during database recovery

following an application or system failure.

Similarly, space in the database file freed by the continuous

incremental reorganization of the database file that LSM performs

cannot be reused until after a checkpoint has been performed.

So increasing the value of this parameter causes the space used

by the pre-reorganization versions of data structures within the

file to be recycled more slowly. Increasing the overall size of

database files under some circumstances.

The roles of the LSM_CONFIG_AUTOFLUSH and LSM_CONFIG_AUTOCHECKPOINT options

are discussed in more detail in the following section.

6.2. Using Worker Threads or Processes

Usually, the database file is updated on disk from within calls to

write functions - lsm_insert(), lsm_delete(), lsm_delete_range() and

lsm_commit(). This means that occasionally, a call to one of these four

functions may take signicantly longer than usual as it pauses to write or

sync the database file. See

below for details.

The alternative to updating the database file from within calls to write

functions is to use one or more background threads or processes to perform the

actual work of writing to the database file.Have all client connections set the LSM_CONFIG_AUTOWORK option to 0.

This stops them from writing to the database file.

Arrange for a background thread or process to connect to the database

and call the lsm_work() function

periodically.

below).

Further explanation of, and example code for, the above is

available below.

The following sub-sections provide a high-level description of the

LSM database architecture and descriptions of how various parameters

affect the way clients write to the database file. While this may seem

in some ways an inconvenient amount of detail, it may also be helpful

when diagnosing performance problems or analyzing benchmark results.

6.2.1. Architectural Overview

The LSM library implements two separate data structures that are used

together to store user data. When the database is queried, the library

actually runs parallel queries on both of these data stores and merges the

results together to return to the user. The data structures are:The in-memory tree. The in-memory tree is an append-only b-tree

variant designed to be stored entirely in main-memory (i.e. not

written out to disk). The library strives to limit the total size of

the in-memory tree (by default to 1MB in total).

At any one time, there may actually be two in-memory tree structures

present in memory. One immutable tree marked as "old" waiting to be

written into the database file (see below) and one "live" tree to

which new data may be appended.

The log-structured-merge tree structure for which LSM is

named. This data structure is used to store data within the database

file on disk.

The log-structured-merge tree is made up of a series of "segments".

Each segment is an immutable tree structure stored (more or less)

contiguously within the database file. When the database is queried, the

library runs parallel queries on each of the segments in the database

and merges the results to return to the user.

The only way to insert new data to the database is to add a new

segment. In order to prevent the number of segments from growing too

large, two or more existing segments may be merged together into a

single larger segment at any point. Deleting existing key-value pairs

is accomplished by inserting "delete-keys" into the new segment.

The log-structured-merge tree structure is described in more detail

link to lsm.wiki section here.

When a database client writes a transaction to the database, the new

data is inserted into the "live" in-memory tree structure. At the same time,

the new data is also appended to the log file on disk. The purpose of the log

file is to provide a persistent backup of any data that the system does not

consider to have been safely stored (see below) in the database file. If a

crash or power failure occurs, this data is recovered by reading the log

file.

Once sufficient data has accumulated within the "live" in-memory tree,

it is marked as "old" and a new live tree created. At any point thereafter,

the contents of the old in-memory tree may be used to populate a new segment

within the database file and then discarded. When this happens, the old

in-memory tree is said to have been "flushed to disk". If there is already an

old in-memory tree when the live tree is deemed to have accumulated enough data

to itself become an old tree, the existing old tree must first be flushed to

disk.

The set of segments that make up the log-structured-merge tree at any time

and the order in which they should be queried is termed a "snapshot". The

header of the database file contains a snapshot. A snapshot is also stored in

main memory. When set of segments that make up the log-structured-merge tree

is modified, either by flushing an old in-memory tree to disk or by merging

two or more existing segments, the in-memory snapshot is updated immediately.

This is the snapshot that database clients use when querying or otherwise

operating on the database.

At any point after the in-memory snapshot has been updated, the in-memory

snapshot may be written into the database file header. This is known as

"checkpointing" the database. Depending on the value of the

LSM_CONFIG_SAFETY parameter, it may

be necessary to ensure that all segments referenced by the snapshot have been

synced to disk (safely stored on the persistent media such that they will not

be lost if a power failure occurs) before doing so. It is not necessary for

every version of the in-memory snapshot to be checkpointed. The in-memory

snapshot may be modified multiple times between checkpoints.

Because a checkpointer process is often required to sync the database file

before updating the database header, "checkpointing" often appears to be the

costliest part of transfering data to the database file, at least in terms of

wall-clock time.

Regular database checkpoints are required to ensure that unused space

within the log file and database file can be reused in a timely fashion.

Specifically:Space within the log file cannot be recycled until the corresponding

data has been written into a database segment and a checkpoint

performed.

When two or more existing segments are merged within the database

file, database clients may start using the new, larger, segment

immediately. However the space occupied by the original segments may

not be reused until after a snapshot that refers to the new segment, and

not the old ones, has been checkpointed.

In other words, without checkpoints the system will function, but both the

log and database files will grow indefinitely as the database is modified

(even if the size of the dataset remains constant). Additionally, if a crash

or power failure occurs, the next client to open the database file has to

process all data written to the log file since the most recent checkpoint. If

checkpoints are performed infrequently, this can be a time consuming exercise.

In order to safely write data into the in-memory tree (by calling

lsm_insert, lsm_delete or lsm_delete_range), the database client must hold

the database WRITER lock. At most one client may concurrently hold the WRITER

lock. Holding the WRITER lock is synonymous with having an open write

transaction - the client obtains the WRITER lock when the transaction is

opened and relinquishes it when the transaction is closed.

As well as the WRITER lock, there are two other locks that may be held by

at most one client at any time - the WORKER and CHECKPOINTER locks. The roles

of the three locks are roughly as follows:WRITERThe WRITER lock is required to modify the contents of the in-memory tree.

Including marking an in-memory tree as "old" and starting a new live tree.

It is also required to write to the log file.

WORKERThe WORKER lock is required to write to segments within the database file.

Either when merging two or more existing segments within the database, or

when flushing an in-memory tree to disk to create a new segment.

The WORKER lock is also required to update the database snapshot stored in

main memory (updated so that new clients will use the new segments the worker

creates).

CHECKPOINTERThe CHECKPOINTER lock is required to update the snapshot stored in the

database file header (to checkpoint the database).

The tasks associated with each of the locks above may be performed

concurrently by multiple database connections, located either in the same

application process or different processes.

6.2.2. Automatic Work and Checkpoint Scheduling

By default, database "work" (the flushing and merging of segments, performed

by clients holding the WORKER lock) and checkpointing are scheduled and

performed automatically from within calls to "write" API functions. The

"write" functions are:lsm_insert()

lsm_delete()

lsm_delete_range()

lsm_commit()

It is expected that automatic work and checkpoint scheduling will be

suitable for most applications. The advantage of this model is that it is

simple to use. However, any call to one of the functions listed above may be

co-opted by the system to perform database work or a checkpoint, causing it to

return more slowly than it otherwise would. In some situations, for example

when a writer thread is also responsible for handling user-interface events,

this may be undesirable.

Automatic work and checkpoint scheduling is controlled by four integer

parameters set or queried using the lsm_config() interface. The parameters

are:LSM_CONFIG_AUTOWORKThis is a boolean parameter (default 1). If set, auto-work mode is

enabled for the database connection. Otherwise, it is not.LSM_CONFIG_AUTOFLUSHAn integer parameter (default 1048576). If this parameter is set

to a non-zero value, then after each transaction is committed, the

library checks the total size of the live in-memory tree. If it is

equal to or greater than the configured value of this parameter in

bytes and there is no existing old in-memory tree, then the current

live tree is marked as old.

Additionally, if the LSM_CONFIG_AUTOWORK parameter is set, the

contents of the in-memory tree are immediately flushed to disk.LSM_CONFIG_AUTOMERGEThis parameter must be set to an integer value between 2 and 8,

inclusive. It controls the number of existing segments that

auto-work attempts to merge together at a time. The default value is 4.LSM_CONFIG_AUTOCHECKPOINTIf this parameter is set to an integer value greater than 0, then

a checkpoint is automatically attempted after this many bytes are

written into the database file. The default value is 2097152.

Each segment in the database file is assigned an "age" - an integer zero

or greater indicating how many times the data in the segment has been merged.

A segment created by flushing the in-memory tree to disk is assigned an age

of 1. When two or more segments with age=1 are merged together to create a

larger segment, it is assigned an age of 2. And so on.

If auto-work is enabled, the library periodically checks the state of the

database file to see if there exist N or more segments with the same age

value A, where N is the value assigned to the LSM_CONFIG_AUTOMERGE parameter.

If so, work is done to merge all such segments with age=A into a new, larger

segment assigned age=A+1. At present, "periodically" as used above means

roughly once for every 32KB of data (including overhead) written to the

in-memory tree. The merge operation is not necessarily completed within

a single call to a write API (this would result in blocking the writer thread

for too long in many cases - in large databases segments may grow to be many GB

in size). Currently, the amount of data written by a single auto-work

operation is roughly 32KB multiplied by the number of segments in the database

file. This formula may change - the point is that the library attempts to limit

the amount of data written in order to avoid blocking the writer thread for too

long within a single API call.

Each time a transaction is committed in auto-work mode, the library checks

to see if there exists an "old" in-memory tree (see the LSM_CONFIG_AUTOFLUSH

option above). If so, it attempts to flush it to disk immediately. Unlike

merges of existing segments, the entire in-memory tree must be flushed to

disk before control is returned to the user. It is not possible to

incrementally flush an in-memory tree in the same ways as it is possible to

incrementally merge existing database segments together.

In order to perform auto-work on the database file, either to flush the

contents of an old in-memory tree to disk or to merge existing segments within

the database file together, the client must obtain the WORKER lock. If some

other client is already holding the WORKER lock, this will not be possible.

This is not an error. If this occurs, the writer thread simply returns

immediately, without performing any work on the database file.

Assuming the LSM_CONFIG_AUTOCHECKPOINT parameter is set to a value greater

than zero, after performing database work, the library automatically checks

how many bytes of raw data have been written to the database file since the

last checkpoint (by any client, not just by the current client). If this

value is greater than the value of the LSM_CONFIG_AUTOCHECKPOINT parameter,

a checkpoint is attempted. It is not an error if the attempt fails because the

CHECKPOINTER lock cannot be obtained.

6.2.3. Explicit Work and Checkpoint Scheduling

The alternative to automatic scheduling of work and checkpoint operations

is to explicitly schedule them. Possibly in a background thread or dedicated

application process. In order to disable automatic work, a client must set

the LSM_CONFIG_AUTOWORK parameter to zero. This parameter is a property of

a database connection, not of a database itself, so it must be cleared

separately by all processes that may write to the database. Otherwise, they

may attempt automatic database work or checkpoints./* Disable auto-work on connection db */

int iVal = 0;

lsm_config(db, LSM_CONFIG_AUTOWORK, &iVal);

The lsm_work() function is used to explicitly perform work on the database:int lsm_work(lsm_db *db, int nMerge, int nByte, int *pnWrite);

Parameter nByte is passed a limit on the number of bytes of data that

should be written to the database file before the call returns. It is a

hint only, the library does not honor this limit strictly.

If the database has an old in-memory tree when lsm_work() is called, it is

flushed to disk. If this means that more than nByte bytes of data is written

to the database file, no further work is performed. Otherwise, the number

of bytes written is subtracted from nByte before proceeding.

If parameter nMerge is greater than 1, then the library searches for

nMerge or more segments of the same age within the database file and performs

up to nByte bytes of work to merge them together. If the merge is completed

before the nByte limit is exceeded, the library searches for another set of

nMerge or more segments to work on, and so on. If at any point no such set of

nMerge segments can be found, the call returns without performing any

further work.

Calling lsm_work() with the nMerge argument set to 1 is used to "optimize"

the database (see below). Passing a value of zero or less for the nMerge

parameter is an error.

In any case, before returning the value of *pnWrite is set to the actual

number of bytes written to the database file.

The example code below might be executed in a background thread or process

in order to perform database work and checkpointing. In this case all other

clients should set the LSM_CONFIG_AUTOWORK parameter to zero.int rc;

lsm_db *db;

int nCkpt = 4*1024*1024;

/* Open a database connection to database "test.db".

**

** Configure the connection to automatically checkpoint the database after

** writing each 4MB of data to it (instead of the default 2MB). As well

** as to auto-work, the LSM_CONFIG_AUTOCHECKPOINT parameter applies to data

** written by explicit calls to lsm_work().

*/

lsm_new(0, &db);

lsm_config(db, LSM_CONFIG_AUTOCHECKPOINT, &nCkpt);

lsm_open(db, "test.db");

while( 1 ){

int nWrite;

/* Attempt up to 512KB of work. Set nWrite to the number of bytes

** actually written to disk. */

rc = lsm_work(db, 2, 512*1024, &nWrite);

if( rc!=LSM_OK && rc!=LSM_BUSY ){

/* Anything other than LSM_OK or LSM_BUSY is a problem. LSM_BUSY

** indicates that some other client has taken the WORKER lock. Any

** other error indicates something has gone quite wrong. */

lsm_close(db);

return rc;

}

/* nWrite may be set to zero here in two scenarios. lsm_work()

** may have failed to obtain the WORKER lock and returned LSM_BUSY,

** indicating that some other connection is working on the database.

** Alternatively, it may be that there was no old in-memory tree to

** flush and no two segments of the same age within the database file,

** meaning the function could find no work to do.

**

** In either case, there is no point in calling lsm_work() again

** immediately. Instead, sleep for a second before continuing. By that

** time, things may have changed (the other process may have relinquished

** the WORKER lock, or an in-memory tree may have been marked as old).

*/

if( nWrite==0 ) sleep(1);

}

Checkpoints can also be requested explicitly, using the lsm_checkpoint()

API:int lsm_checkpoint(lsm_db *db, int *pnCkpt);

If no work has been performed on the database since the most recent

checkpoint (implying that the snapshot has not changed and there is no need

to write it into the database file), lsm_checkpoint() sets *pnCkpt to zero

and returns immediately. Otherwise, it checkpoints the database and sets

*pnCkpt to the number of bytes written to the database file since the

previous checkpoint.

The number of bytes written to the database since the most recent checkpoint

can also be using the lsm_info() API

function. As follows:int nCkpt;

rc = lsm_info(db, LSM_INFO_CHECKPOINT_SIZE, &nCkpt);

int nOld, nLive;

rc = lsm_info(db, LSM_INFO_TREE_SIZE, &nOld, &nLive);

6.2.4. Compulsary Work and Checkpoint Scheduling

Apart from the scenarios described above, there are two there are two

scenarios where database work or checkpointing may be performed automatically,

regardless of the value of the LSM_CONFIG_AUTOWORK parameter.When closing a database connection, and

When the number of segments with a common age in the database file grows

unacceptably high.

Whenever an lsm_close() call would mean that the total number of

connections to a database drops to zero, the connection checks if the

in-memory tree is empty. If not, it is flushed to disk. Both the live and

old in-memory trees are flushed to disk in this case. It also checks if the

database file has been modified since the most recent checkpoint was

performed. If so, it also performs a checkpoint. And, assuming no error

has occurred, deletes the log file.

Additionally, whenever a worker wishes to flush an in-memory tree to a new

age=1 segment, it must first ensure that there are less than N existing age=1

segments, where N is the value that the LSM_CONFIG_AUTOMERGE parameter is

set to. If there are already N or more age=1 segments, they must be merged

into an age=2 segment before a new age=1 segment can be created within the

database file. Similar rules apply to segments of other ages - it is not

possible to create a new age=I segment if there are already N segments with

age=I in the database file. This has two implications:The database is prevented from accumulating too many segments,

regardless of whether or not auto-work is enabled or how infrequently

lsm_work() is called, and

If auto-work is disabled and lsm_work() is not called frequently enough,

it is possible that flushing an in-memory tree may required writing a

tremendous amount of data to disk (possibly even rewriting the entire

database file).

Finally, regardless of age, a database is limited to a maximum of 64

segments in total. If an attempt is made to flush an in-memory tree to disk

when the database already contains 64 segments, two or more existing segments

must be merged together before the new segment can be created.

6.3. Database File Optimization

Database optimization transforms the contents of database file so that

the following are true:All database content is stored in a single

segment. This makes the

database effectively equivalent to an optimally packed b-tree stucture

for search operations - minimizing the number of disk sectors that need

to be visted when searching the database.

The database file contains no (or as little as possible) free space.

In other words, it is no larger than required to contain the single

segment.

Should we add a convenience function lsm_optimize() that does not

return until the database is completely optimized? One that more or less does

the same as the example code below and deals with the AUTOCHECKPOINT issue?

This would help with this user manual if nothing else, as it means a method

for database optimization can be presented without depending on the previous

section.

In order to optimize the database, lsm_work() should be called repeatedly

with the nMerge argument set to 1 until it returns without writing any data

to the database file. For example:int nWrite;

int rc;

do {

rc = lsm_work(db, 1, 2*1024*1024, &nWrite);

}while( rc==LSM_OK && nWrite>0 );

When optimizing the database as above, either the LSM_CONFIG_AUTOCHECKPOINT

parameter should be set to a non-zero value or lsm_checkpoint() should be

called periodically. Otherwise, no checkpoints will be performed, preventing

the library from reusing any space occupied by old segments even after their

content has been merged into the new segment. The result - a database file that

is optimized, except that it is up to twice as large as it otherwise would be.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值