The problem of MySQL Replication unable to catch up is quite
common in MySQL world and in fact I already wrote about it. There are many aspects of managing
mysql replication lag such as using proper hardware and configuring
it properly. In this post I will just look at couple of query
design mistakes which result in low hanging fruit troubleshooting
MySQL Replication Lag
First fact you absolutely need to remember is MySQL Replication
is single threaded, which means if you have any long running write
query it clogs replication stream and small and fast updates which
go after it in MySQL binary log can't proceed. It is either more
than than just about queries - if you're using explicit
transactions all updates from the transactions are buffered
together and when dumped to binary log as one big chunk which can't
be interleaved by any other query execution. So if you have
transaction containing millions of simple updates instead of one
large update to help MySQL replication lag it is not going to
work.
This brings us to rule number one - if you care
about replication latency you must not have any long running
updates. Queries or transactions containing multiple update queries
which add up to long time. I would keep the maximum query length at
about 1/5th of the maximum replication lag you're ready to
tolerate. So if you want your replica to be no more than 1 minute
behind keep the longest update query to 10 sec or so. This is of
course rule of thumb depending on differences in master/slave
configuration, their load and concurrency you may need to keep the
ratio higher or allow a bit longer queries.
What should you do if you need to update a lot of rows ? Use
Query Chopping - this can be running update/delete with LIMIT in
the loop, controlling maximum amount of values per batch in
multiple row insert statement or Fetching data you're planning to
update/delete and having multiple queries to delete it (see example
below)
This brings us to yet another rule for smart replication - do
not make Slave to do more work than it needs to do. It is crippled
by having to do all of this in single thread already - do not make
it even harder. If there is considerable effort needed to select
rows for modification - spread it out and have separate select and
update queries. In such case slave will only need to run
UPDATE
Example:
SQL:
UPDATE posts
SET spam=1 WHERE body
LIKE
"%cheap rolex%";
This query will perform full table scan in MySQL 5.0 (even if
there are no spam posts) which will load slave significantly. You
can replace it with:
SQL:
SELECT id
FROM posts
WHERE body LIKE
"%cheap rolex%";
UPDATE posts
SET
spam=1 WHERE id
IN
(list of
ids)
If there could be many ids matched on the first place you should
also use query chopping and run update in chunks if application
allows it.
In MySQL 5.1 with row level replication you will not have
selection process running on SLAVE but it will not do the chopping
for you.
In general this trick does not only work well for full table
scan updates but in general for cases when there are much more rows
examined than modified.
The next common mistake is using INSERT ...
SELECT - which is in similar to what I just described but can be
much worse as SELECT may end up being extremely complicated query.
It is best to avoid INSERT ... SELECT going through replication in
5.0 for many reasons (locking, long query time, waste of execution
on slave). Piping data through application is the best solution in
many cases and is quite easy - it is trivial to write the function
which will take SELECT query and the table to which store its
result set and use in your application in all cases when you need
this functionality.
Finally you should not overload your
replication - Quite typically I see replication lagging
when batch jobs are running. These can load master significantly
during their run time and make it impossible for slave to run the
same load through single thread. The solution in many cases is to
simply space it out and slow down your batch job (such as adding
sleep calls) to ensure there is enough breathing room for
replication thread.
You can also have controlled execution of batch job - this is
when they will check slave lag every so often and pause if it
becomes too large. This is a bit more complicated approach but it
saves you from running around and adjusting your sleep behavior to
keep the progress fast enough and at the same time keep replication
from lagging.
In many bad replication lags I've seen simply following these
simple rules would avoid a lot of problems and often save massive
hardware purchases or development efforts based on assumption MySQL
replication can't possibly keep up any more.