TokuMX, MongoDB and InnoDB versus the insert benchmark with disks-CSDN博客

I used the insert benchmark on servers that use disks in my quest to learn more

about MongoDB internals. The insert benchmark is interesting for a few reasons.

First while inserting a lot of data isn't something I do all of the time it is something

for which performance matters some of the time. Second it subjects secondary

indexes to fragmentation and excessive fragmentation leads to wasted IO and

wasted disk space. Finally it allows for useful optimizations including writeoptimized

algorithms (fractal tree via TokuMX

[http://www.tokutek.com/2013/07/tokumx-fractal-treer-indexes-what-are-they/] , LSM vis

RocksDB [http://rocksdb.org/] and WiredTiger [http://wiredtiger.com/] ) or the

InnoDB insert buffer [https://www.google.com/search?q=innodb+insert+buffer] .

Hopefully I can move onto other workloads after this week.

This test is interesting for another reason that I don't really explore here but will

in a future post. While caching all or most of the database in RAM works great at

eliminating reads it might not do much for avoiding random writes. So a write

heavy workload with a cached database can still be limited by random write IO

and this will be more of an issue as RAM capacity grows on commodity servers

while people try to reuse their favorite update-in-place b-tree for cached

workloads. Some of the impact from that can be viewed in the results for

MongoDB when the database is smaller than 72G. I wonder whether InnoDB can

be improved in this case. The traditional solution is to use snapshots (sequential

IO) and a redo log.

The test server has 72G of RAM and at least 8 10K RPM SAS disks with HW

RAID and a battery-backed write cache so it can do a few thousand random IOPs

given many pending requests if we trade latency for throughput. The insert

benchmark was used with 1 client thread and the test was started with an empty

collection/table. I used the Java client [https://github.com/tmcallaghan/iibenchmongodb]

for MongoDB and TokuMX and the Python client

[http://bazaar.launchpad.net/~mdcallag/mysql-patch/mytools/files/head:/bench/ibench/]

for InnoDB. The MongoDB inserts are done with w:1,j:1 and

journalCommitInterval=2 (or logFlushPeriod=0 with TokuMX). So there is a wait

for fsync but with all of the tests I have done to this point there is not much

difference between j:0 and j:1 as the journal sync does not have much impact

when inserting 1000 documents per insert request. The InnoDB inserts are done

with innodb_flush_log_at_trx_commit=1 so it also waits for fsync. I also used 8kb

pages for InnoDB and disabled the doublewrite buffer. Compression was not

used for InnoDB. Fsync is fast on the test hardware given the RAID write cache.

The clients run as the same host as the server to reduce network latency. The

oplog/binlog was disabled.

TokuMX, MongoDB and InnoDB versus

the insert benchmark with disks

Creative Commons Attribution-Share Alike -- http://creativecommons.org/licenses/by-sa/3.0/us.

Dynamic Views template. Powered by Blogger.

Classic Flipcard Magazine Mosaic Sidebar Snapshot Timeslide

Small Datum search

2014年3月25日Small Datum

http://smalldatum.blogspot.com/ 2/27

I usually have feature requests listed in a post but not this time. I think that

MongoDB needs much more in the way of per-collection and per-index statistics.

That shouldn't be a surprise given my work on the same for MySQL. But that will

wait for another post.

The performance summary isn't a surprise. TokuMX does better than InnoDB

because fractal trees greatly reduce the random IOPs demand. InnoDB does

better than MongoDB. There are a few reasons why InnoDB does better than

MongoDB even though they both use an update-in-place b-tree:

1. Databases with MongoDB are larger than with InnoDB so cache hit rates are

lower when the database is larger than RAM. I don't understand all of the

reasons for the size differences. Including attribute names in every

document is not the majority of the problem. I think there is more secondary

index fragmentation with MongoDB. I have results with and without the

powerOf2Sizes option

[http://docs.mongodb.org/manual/reference/command/collMod/] enabled and that

doesn't explain the difference.

2. The InnoDB insert buffer [http://dev.mysql.com/doc/innodb/1.1/en/innodbperformance-

change_buffering.html] is the primary reason that InnoDB does

better. This is true when comparing InnoDB to many products that use an

update-in-place b-tree, not just MongoDB. Because of the insert buffer

InnoDB is unlikely to stall on disk reads to leaf pages of secondary indexes

during index maintenance. Those reads can be done in the background

using many concurrent IO requests. MongoDB doesn't have this feature. It

blocks on disk reads during secondary index maintenance and won't benefit

from concurrent IO for reads despite the RAID array used by the

server. This note has performance results

[https://www.facebook.com/note.php?note_id=492969385932] for the insert

benchmark and InnoDB when the insert buffer is disabled to show the

benefit from that feature. I have also written about problems since fixed in

InnoDB that prevented the insert buffer from being useful because it

became full [http://mysqlha.blogspot.com/2008/12/other-performanceproblem.

html] .

For the test the client inserts up to 2B rows. But I wasn't willing to wait for

MongoDB and stopped it after less than 600M rows. InnoDB was stopped after

1.8B rows. The columns used for the result table are listed below. There are a lot

more details on these columns in a previous post

[http://smalldatum.blogspot.com/2014/03/redo-logs-in-mongodb-and-innodb.html] . Each

of the sections that follow describe the performance to insert the next 100M

documents/rows.

sizeGB - the database size in GB

bpd - bytes per document/row computed from sizeGB / #docs (or #rows)

MB/s - the average rate for bytes written per second computed from iostat.

This has IO for the database file and the journal/redo logs

GBw - the total number of GB written to the database including journal/redo

logs

secs - the number of seconds to insert data

Results

2014年3月25日Small Datum

http://smalldatum.blogspot.com/ 3/27

irate - the rate of documents or rows inserted per second

notes - more details on the configuration

This has results from inserting 100M documents/rows to an empty

collection/table. Things that interest me that I have previously reported include 1)

MongoDB databases are much larger and 2) MongoDB does much more disk IO

for the same workload and the increase in bytes written isn't explained by the

database being larger. One of the reasons for the high bytes written rate is that

the test takes longer to complete with MongoDB and a hard checkpoint is done

every syncdelay seconds. InnoDB is better at delaying writeback for dirty pages.

The interesting result that I have seen in a few cases with both MongoDB 2.4.9

and 2.6.0 is that results are worse with powerOf2Sizes enabled. I have not take

the time to debug this problem. That is on my TODO list. At first I thought I had a

few bad servers (flaky HW, etc) but I haven't seen the opposite for this workload

(powerOf2Sizes enabled getting better insertion rates). The problem appears to

be intermittent. Note that 2.6 has a fix for JIRA 12216

[https://jira.mongodb.org/browse/SERVER-12216] that doesn't block allocation of

new files when msync is in progress so 2.6 should be somewhat faster than 2.4.

config sizeGB bpd MB/s GBw secs irate notes

innodb 16 171 28.9 124 4290 23308

tokumx 9.2 98 11.1 79 7127 14030

mongo24 43 461 46.4 1539 33230 3009

powerOf2Sizes=0

mongo24 44 472 30.0 1545 51634 1937

powerOf2Sizes=1

mongo26 42 450 47.9 1446 30199 3311

powerOf2Sizes=1

TokuMX and fractal trees are starting to show a benefit relative to InnoDB.

config sizeGB bpd MB/s GBw secs irate notes

innodb 31 166 24.3 238 9781 10224

tokumx 17 91 12.3 90 7328 13646

mongo24 72 386 37.4 1768 47329 2113

powerOf2Sizes=0

mongo24 79 424 24.6 1731 70325 1422

powerOf2Sizes=1

mongo26 76 408 39.3 1611 40992 2439

powerOf2Sizes=1

More of the same as TokuMX gets better relative to others.

config sizeGB bpd MB/s GBw secs irate notes

innodb 45 161 21.7 350 16136 6198

From 0 to 100M rows

From 100M to 200M rows

From 200M to 300M rows

2014年3月25日Small Datum

http://smalldatum.blogspot.com/ 4/27

tokumx 25 89 12.0 84 7071 14142

mongo24 98 350 30.7 2008 65514 1526

powerOf2Sizes=0

mongo24 106 379 19.9 1917 96351 1038

powerOf2Sizes=1

mongo26 108 386 24.9 1933 77677 1287

powerOf2Sizes=1

TokuMX begins to get slower. MongoDB gets a lot slower as the database is

much larger than RAM. Problems unrelated to MongoDB cost me two of the long

running test servers (for 2.4.9 and 2.6.0 with powerOf2Sizes=1).

config sizeGB bpd MB/s GBw secs irate notes

innodb 61 163 21.1 376 17825 5610

tokumx 31 83 12.1 86 7172 13941

mongo24 130 348 14.7 2313 157395 635

powerOf2Sizes=0

MongoDB is getting significantly slower as the database is larger than RAM.

More on this in the next section.

config sizeGB bpd MB/s GBw secs irate notes

innodb 75 161 19.8 462 23337 4285

tokumx 39 83 11.1 84 7584 13186

mongo24 160 344 4.7 2105 441534 202

powerOf2Sizes=0

I wasn't willing to wait for MongoDB to make it to 600M. I stopped the test when it

reached ~540M inserts. The insert rate continues to drop dramatically. InnoDB

does better because of the insert buffer. I assume that for MongoDB it would

drop to ~50/second were I willing to wait. That would happen when there was a

disk read for every secondary index per inserted document, there are 3, and the

disk array can do ~150 disk reads/second when requests are submitted serially.

InnoDB was slightly faster compared to the previous 100M inserts, but it will get

slower in the long run.

I looked at iostat output and the MongoDB host was doing ~260 disk

reads/second and ~375 disk writes/second at test end. For both reads and

writes the average request size was ~8kb. The write stats include writes to

journal and database files. From PMP stack traces [http://poormansprofiler.org/] I

see a single thread busy walking b-tree indexes most of the time. Note that the

write rate for MongoDB has fallen in line with the reduction in the insert rate.

Database pages aren't getting dirty as fast as they used to get because

MongoDB is stalled on secondary index leaf node reads.

config sizeGB bpd MB/s GBw secs irate notes

From 300M to 400M rows

From 400M to 500M rows

From 500M to 600M rows

2014年3月25日Small Datum

http://smalldatum.blogspot.com/ 5/27

... at 600M docs/rows

innodb 89 159 20.1 392 19465 5137

tokumx 46 82 11.6 90 7741 12917

... at 540M documents

mongo24 168 340 2.9 1235 X 123

powerOf2Sizes=0

Alas InnoDB has begun to degrade faster. Even the insert buffer eventually is no

match for a write-optimized algorithm.

config sizeGB bpd MB/s GBw secs irate notes

innodb 148 158 15.4 1515 98413 1016

tokumx 74 79 10.9 92 8436 11853

More of the same.

config sizeGB bpd MB/s GBw secs irate notes

innodb 221 158 12.5 1745 140274 713

tokumx 104 74 11.0 96 8722 11464

TokuMX is all alone.

config sizeGB bpd MB/s GBw secs irate notes

tokumx 142 76 12.6 99 7868 12709

Posted 3 hours ago by Mark Callaghan

Labels: mongodb, mysql

From 900M to 1B rows

From 1.4B to 1.5B rows

From 1.5B to 2B rows

0 Add a comment

Yesterday

Both MongoDB and InnoDB support ACID. For MongoDB this is limited to single

document changes [http://docs.mongodb.org/manual/faq/fundamentals/] while

InnoDB extends that to multi-statement and possibly long-lived transactions. My

goal in this post is to explain how the MongoDB journal is implemented and used

to support ACID. Hopefully this will help to understand performance. I include

comparisons to InnoDB.

There are a few interesting constraints on the support for ACID with MongoDB. It

uses a per-database reader-writer lock

Redo logs in MongoDB and InnoDB

What is ACID?

2014年3月25日Small Datum

http://smalldatum.blogspot.com/ 6/27

[http://docs.mongodb.org/manual/faq/concurrency/] . When a write is in progress all

other uses of that database (writes & reads) are blocked. Reads can be done

concurrent with other reads but block writes. The manual states that the lock is

writer greedy [http://docs.mongodb.org/manual/faq/concurrency/] so that a pending

writer gets the lock before a pending reader. I am curious if this also means that

a pending writer prevents additional readers when the lock is currently in read

mode and added that question to my TODO list. The reader-writer lock code is

kind of complex. By in progress I mean updating in-memory structures from

mmap'd files like the document in the heap storage and any b-tree indexes. For

a multi-document write the lock can be yielded (released) in between documents

so the write might be done from multiple points in time and readers can see the

write in progress. There are no dirty reads. The $isolated option

[http://docs.mongodb.org/manual/reference/operator/update/isolated/] can be used to

avoid yielding the per-db lock for multi-document writes. Even with that option an

error half-way into a multi-document operation results in a half done change as

there is no undo. MyISAM users are familiar with this problem. The cursor

isolation provided by MongoDB isn't that different from READ COMMITTED on a

system that doesn't do MVCC snapshots (see older versions of BerkeleyDB and

maybe IBM DB2 today).

It is bad for an in progress write to stall on a disk read (page fault from an

mmap'd file) while holding the per-database lock. MongoDB has support to yield

the per-db lock [http://docs.mongodb.org/manual/faq/concurrency/] on a page fault

(real or predicted) but I have questions about this. Is the yield only done for the

document pages in the heap storage or extended to index pages? Is anything

done to guarantee that most or all pages (document in heap store, all index

pages to be read) are in memory before applying any writes. Note that if the

document is updated in memory and then a page fault occurs during index

maintenance then I doubt the lock can be yielded. This is another question on

my TODO list. I am not the only person with

[http://stackoverflow.com/questions/22256776/does-mongodb-yield-lock-on-index-pagefault]

that [https://groups.google.com/forum/#!topic/mongodb-user/UEhi7Y74DvE]

question. MongoDB has something like an LRU to predict whether there will be a

page fault on a read and understanding the performance overhead from that is

also on my TODO list. I have seen a lot of CPU overhead from that code on

some benchmarks.

MongoDB doesn't have row locks. The change to a document is visible as soon

as the per-db lock is released. Not only are some writes visible from a multidocument

change before all documents have been modified but all changes are

visible before the change is durable [http://smalldatum.blogspot.com/2014/03/whendoes-

mongodb-make-transaction.html] via the journal. This behavior is different

than what you can get from a DBMS and users should be aware

[https://jira.mongodb.org/browse/DOCS-2908] of that.

InnoDB has a redo log and uses the system tablespace for undo. The changes

written to support undo are made durable via the redo log just like changes to

database tables. The undo information enables consistent reads for long

running transactions. The InnoDB redo log uses buffered IO by default and is

Redo logs

2014年3月25日Small Datum

http://smalldatum.blogspot.com/ 7/27

configured via the innodb_flush_method option

[https://dev.mysql.com/doc/refman/5.6/en/innodbparameters.

html#sysvar_innodb_flush_method] . Redo log pages are 512 bytes and

that might need to change when the conversion from 512 to 4096 byte disk

sectors is complete. Each log page has a checksum.

The innodb_flush_log_at_trx_commit option

[https://dev.mysql.com/doc/refman/5.6/en/innodbparameters.

html#sysvar_innodb_flush_log_at_trx_commit] determines whether the

log is forced on commit or once per second. There are a fixed number of redo

log files and they are preallocated. With buffered IO and 512 byte aligned writes

the first write on a 4kb boundary can incur a disk read to get the page into the

OS cache before applying the change. This is a waste of disk IO and the

workaround seems to be [http://dom.as/2010/11/18/logs-memory-pressure/] some

form of padding to 4kb for some writes. But note that the padding will be

reused/reclaimed on the next log write. An alternative is to use direct IO but there

might be several calls to write or pwrite before the log must be forced and

making each write synchronous will delay log writing. With buffered IO the

filesystem can coalesce the data from multiple writes as these are adjacent in the

log file.

MongoDB doesn't do undo. It does have a redo log called the journal

[http://docs.mongodb.org/manual/core/journaling/] . This uses direct IO on Linux. The

log page size is 8kb and is protected by a checksum. The in-memory log buffer is

compressed via Snappy before the log write is done and the compressed result

is padded to 8kb. The space taken by padding isn't reused/reclaimed for the

next write so a sequence of small inserts with j:1 each write at least 8kb to the

journal. Writes to the journal are done by a background thread (see durThread

in dur.cpp). Note that the background thread iterates over a list of redo log

entries that must be written to the journal, copies them to a string buffer, then

uses Snappy to compress that data, then pads the compressed output to the

next multiple of 8kb, then writes the padded output to the journal file. The dur

section in serverStatus output [http://docs.mongodb.org/manual/reference/serverstatus/]

has counters for the amount of data written to the journal which includes

the padding (journaledMB). The size of the data prior to padding is the

journaledMB counter divided by the compression counter. Note that these

counters are computed over the last few seconds.

MongoDB optionally recycles log files and this is determined when journal

preallocation (preallocj) is enabled. With preallocj 3 journal files are created at

process start and this can delay startup for the time required to create 3 1GB

journal files (see preallocateFiles in dur_journal.cpp). This writes data to the files

so real IO is done including an fsync. In steady state, after process start, old log

files are recycled with preallocj (see removeOldJournalFile in dur_journal.cpp).

Without preallocj the journal files are not preallocated at process start and old

journal files are not reused. There is an undocumented option, --nopreallocj, that

can disable preallocj. There is no option to force preallocj. That is determined by

a short performance test done at process start (see preallocateIsFaster in

dur_journal.cpp). One way to determine whether preallocj is in use is to check

the journal directory for the preallocated files.

Preallocation for both database and journal doesn't mean that files are written an

2014年3月25日Small Datum

http://smalldatum.blogspot.com/ 8/27

extra time -- once during preallocation, at least once during regular use. I was

happy to learn this. Database file preallocation uses posix_fallocate rather than

write/pwrite on Linux (see the run and ensureLength methods in the FileAllocator

class). Journal file preallocation uses write append, but that should only be done

at process start and then the files are recycled (see preallocateFile and

removeOldJournalFile in dur_journal.cpp).

Using strace is a great way to understand complex DBMS software. This shows

the sequence of 8kb writes to the journal during the insert benchmark with a

client that uses j:1 and 1 document per insert:

strace -f -p $( pidof mongod ) -ewrite

write(5,

"g\n\0\0\264\346\0\0\0\0\0\0\216\f/S\374=8\22\224'L\376\377\377\377iiben

"..., 8192) = 8192

write(5,

"\t\v\0\0\264\346\0\0\0\0\0\0\216\f/S\374=8\22\344)L\376\377\377\377iibe

n"..., 8192) = 8192

write(5,

"m\6\0\0\264\346\0\0\0\0\0\0\216\f/S\374=8\22\264\26L\376\377\377\377iib

en"..., 8192) = 8192

write(5,

"R\4\0\0\264\346\0\0\0\0\0\0\216\f/S\374=8\22\224\16L\376\377\377\377iib

en"..., 8192) = 8192

Proper group commit is now supported for InnoDB but I will skip the details. It is

done directly by a thread handling the COMMIT operation for a user's

connection and there is no wait unless another thread is already forcing the log.

My team did the first implementation of group commit but MariaDB and MySQL

did something better. We were thrilled to remove that change from our big patch

for MySQL.

MongoDB has group commit. The journal is forced to disk every

journalCommitInterval [http://docs.mongodb.org/manual/tutorial/manage-journaling/]

milliseconds. When a thread is blocked waiting for the journal to be forced the

interval is reduced to 1/3 of that value. The minimum value for

journalCommitInterval is 2, so the maximum wait in that case should be 1 (2/3

rounded up). This means that MongoDB will do at most 1000 log forces per

second. Some hardware can do 5000+ fast fsyncs courtesy of battery backed

write cache in HW RAID or flash so there are some workloads that will want

MongoDB to force the log faster than it can today. Group commit is done by a

background thread (see durThread, durThreadGroupCommit, and

_groupCommit in dur.cpp). Forcing the journal at journalCommitInterval/3

milliseconds is also done when there is too much data ready to be written to it.

I used the insert benchmark to understand redo log performance. The test used

1 client thread to insert 10M documents/rows into an empty collection/table with 1

document/row per insert. The test was repeated in several configurations to

understand what limited performance. I did this to collect data for several

Group commit

Performance

2014年3月25日Small Datum

http://smalldatum.blogspot.com/ 9/27

questions: how fast can a single-threaded workload sync the log and how much

data is written to the log per small transaction. For the InnoDB tests I used

MySQL 5.6.12, disabled the doublewrite buffer and used an 8k page. The

MongoDB tests used 2.6.0 rc0. The TokuMX tests use 1.4.0. The following

configurations were tested:

inno-sync - fsync per insert with innodb_flush_log_at_trx_commit=1.

inno-lazy - ~1 fsync per second with innodb_flush_log_at_trx_commit=2

toku-sync - fsync per insert with logFlushPeriod=0

mongo-sync - fsync per insert, journalCommitInterval=2, inserts used j:1

mongo-lazy - a few fsyncs/second, journalCommitInterval=300, inserts used

w:1, j:0

mongo-nojournal - journal disabled, inserts used w:1

The following metrics are reported for this test.

bpd - bytes per document (or row). This is the size of the database at test

end divided by the number of documents (or rows) in the database. As I

previously reported [http://smalldatum.blogspot.com/2014/03/insert-benchmarkfor-

innodb-mongodb-and.html] , MongoDB uses much more space than InnoDB

whether or not powerOf2Sizes

[http://docs.mongodb.org/manual/reference/command/collMod/] is enabled. They

are both update-in-place b-trees so I don't understand why MongoDB does

so much worse when subject to a workload that causes fragmentation.

Storing attribute names in every document doesn't explain the difference.

But in this case the results overstate the MongoDB overhead because of

database file preallocation.

MB/s - the average disk write rate during the test in megabytes per second

GBw - the total number of bytes written to disk during the test in GB. This

includes writes to the database files and (when enabled) the redo logs. The

difference between inno-sync and inno-lazy is the overhead of a 4kb redo

log write per insert. The same is not true between mongo-sync and mongolazy.

My educated guess to explain why MongoDB and InnoDB are different

is that for mongo-sync the test takes much longer to finish than mongo-lazy

so there are many more hard checkpoints (write all dirty pages each 60

seconds). InnoDB is much better at keeping pages dirty in the buffer pool

without writeback. In all cases MongoDB is writing much more data to disk. In

the lazy mode it writes ~15X more and in the sync mode it writes ~6X more. I

don't know if MongoDB does hard checkpoints (force all dirty pages to disk

every syncdelay seconds) when the journal is disabled. Perhaps I was too

lazy to read more code.

secs - the number of seconds to insert 10M documents/rows.

bwpi - bytes written per inserted document/row. This is GBw divided by the

number of documents/rows inserted. The per row overhead for inno-sync is

4kb because a redo log force is a 4kb write. The per document overhead for

mongo-sync is 8kb because a redo log force is an 8kb write. So most of the

difference in the write rates between MongoDB and InnoDB is not from the

redo log force overhead.

irate - the rate of documents/rows inserted per second. MongoDB does

fewer than 1000 per second as expected given the use of

Test Results

2014年3月25日Small Datum

http://smalldatum.blogspot.com/ 10/27

journalCommitInterval. This makes for a simple implementation of group

commit but is not good for some workloads (single-threaded with j:1).

logb - this is the total bytes written to the redo log as reported by the DBMS.

Only MongoDB reports this accurately when sync-on-commit is used

because it pads the result to a multiple of the filesystem page size. For

MongoDB the data comes from the dur section of the serverStatus output.

But I changed MongoDB to not reset the counters as upstream code resets

them every few seconds. InnoDB pads to a multiple of 512 bytes and I used

the os_log_written counter to measure it. AFAIK TokuMX doesn't pad and

the counter is LOGGER_BYTES_WRITTEN. So both TokuMX and InnoDB

don't account for the fact that the write is done using the filesystem page

size (multiple of 4kb).

logbpi - log bytes per insert. This is logb divided by the number of

documents/rows inserted. There are two numbers for MongoDB. The first is

the size after padding and compression. It is a bit larger than 8kb. As the

minimum value is 8kb given this workload this isn't a surprise. The second

number is the size prior to compression and padding. This value can be

compared to InnoDB and TokuMX and I am surprised that it is so much

larger for MongoDB. I assume MongoDB doesn't log page images. This is

something for my TODO list.

iibench 1 doc/insert, fsync, 10M rows

bpd MB/s GBw secs bwpi irate

logb logbpi

inno-sync 146 18.9 54.3 3071 5690 3257

7.5G 785

inno-lazy 146 2.8 5.8 2251 613 4442

7.5G 785

toku-sync 125 31.0 86.8 2794 9104 3579

2.3G 251

mongo-sync 492 23.1 312.0 13535 32712 739

83.3G 8733/4772

mongo-lazy 429 40.5 79.8 1969 8365 5078

21.9G 2294/4498

mongo-nojournal 440 34.1 42.0 1226 4401 8154 NA

From all of this I have a few feature requests:

1. Don't try to compress the journal buffer when it is already less than 8kb.

That makes commit processing slower and doesn't reduce the amount of

data written to the journal as it will be padded to 8kb.

2. Provide an option to disable journal compression. For some configurations

of the insert benchmark I get 10% more inserts/second with compression

disabled. Compression is currently done by the background thread before

writing to the log. This adds latency for many workloads. When compression

is enabled it is possible to be more clever and begin compressing the

journal buffer early. Compression requires 3 copies of data -- once to the

TokuMX, MongoDB and InnoDB versus the insert benchmark with disks

“相关推荐”对你有帮助么？