I am consuming a high rate data stream and doing the following steps to store data in a MySQL database. For each new arriving item.
(1) Parse incoming item.
(2) Execute several "INSERT ... ON DUPLICATE KEY UPDATE"
I have used INSERT ... ON DUPLICATE KEY UPDATE to eliminate one additional round-trip to the database.
While trying to improve the overall performance, I have considered doing bulk updates in the following way:
(1) Parse incoming item.
(2) Generate SQL statement with "INSERT ... ON DUPLICATE KEY UPDATE" and append to a file.
Periodically flush the SQL statements in the file to the database.
Two questions:
(1) will this have a positive impact in the database load?
(2) how should I flush the statements to the database so that indices are only reconstructed after the complete flush? (using transactions?)
UPDATE: I am using Perl DBI + MySQL MyISAM.
Thanks in advance for any comments.
解决方案
You don't say what kind of database access environment (PERL DBI? JDBC? ODBC?) you're running in, or what kind of table storage engine (MyISAM? InnoDB?) you're using.
First of all, you're right to pick INSERT ... ON DUPLICATE KEY UPDATE. Good move, unless you can guarantee unique keys.
Secondly, if your database access environment allows it, you should use prepared statements. You definitely won't get good performance if you write a bunch of statements into a file, and then make a database client read the file once again. Do the INSERT operations directly from the software package that consumes the incoming data stream.
Thirdly, pick the right kind of table storage engine. MyISAM inserts are going to be faster than InnoDB, so if you're logging data and retrieving it later that will be a win. But InnoDB has better transactional integrity. If you're really handling tonnage of data, and you don't need to read it very often, consider the ARCHIVE storage engine.
Finally, consider doing a START TRANSACTION at the beginning of a batch of INSERT ... commands, then doing a COMMIT and another START TRANSACTION after a fixed number of rows, like 100 or so. If you're using InnoDB, this will speed things up a lot. If you're using MyISAM or ARCHIVE, it won't matter.
Your big wins will come from the prepared statement stuff and the best choice of storage engine.