How to exploit MySQL index optimizations

最新推荐文章于 2024-09-12 01:15:37 发布

oneyearlater

最新推荐文章于 2024-09-12 01:15:37 发布

阅读量530

点赞数

文章标签： mysql optimization table query concurrency structure

from http://www.xaprb.com/blog/2006/07/04/how-to-exploit-mysql-index-optimizations/

I’ve written a lot recently about MySQL index and table structure, primary keys, surrogate keys, and related optimizations. In this article I’ll explain how MySQL’s index structures enable an extremely important query optimization, and how that differs between storage engines. I’ll also show you how to know and predict when the optimization is triggered, how to design tables and queries so it’ll be used, and how to avoid defeating it with poor practices. Plus, I’ll peek a bit into InnoDB internals to show you what’s going on behind the scenes.

A review of MySQL’s primary and secondary indexes

You need to understand how MySQL’s indexes work, and how InnoDB’s are different from other storage engines, such as MyISAM, because if you don’t, you can’t design tables effectively.

The InnoDB storage engine creates a clustered index for every table. If the table has a primary key, that is the clustered index. If not, InnoDB internally assigns a six-byte unique ID to every row and uses that as the clustered index. (Moral of the story: pick a primary key of your own — don’t let it generate a useless one for you).

All indexes are B-trees. In InnoDB, the primary key’s leaf nodes are the data. Secondary indexes have a pointer to the data at their leaf nodes. A picture is worth a thousand words, so here’s a diagram of the table structure I’ll use later on in this article (click through to see it full size):

MyISAM has no clustered index, so the data isn’t physically ordered by any index (it’s in insertion order), but in InnoDB, the rows are physically ordered by the primary key. That means there can be page splits as rows are inserted between other rows — if there are too many rows to fit on a page, the page has to be split. MyISAM doesn’t have that problem, because rows don’t get stuffed between other rows (they are added at the end), so a secondary index’s leaf nodes always point directly to the row in the table. In fact, there’s no functional difference between primary and secondary keys in MyISAM. A MyISAM primary key is simply a unique index named “PRIMARY.”

Here’s a picture of the equivalent table structure, using the MyISAM engine. Notice how different it is from InnoDB! This is the same table, it’s just a different storage engine.

Why doesn’t InnoDB just “point” to the rows like MyISAM? If InnoDB used that strategy, it would have to rewrite all the secondary indexes at every page split, when the rows get moved to a different location on disk. To avoid that cost, InnoDB uses the values from the primary key as its secondary index’s leaf nodes. That makes the secondary indexes independent of the physical order of the primary key, but the “pointer” isn’t a pointer directly to the row as in MyISAM. It also means secondary index lookups are more expensive than primary key lookups, because any secondary index lookup only results in a tuple that can be used to navigate the primary key — double work. MyISAM doesn’t have that issue. Of course, it doesn’t have rows in index order, either; and the primary key might be “deeper.” It’s a trade-off.

Secondary index optimizations

So there’s a cost to secondary indexes in InnoDB. There’s an optimization too. Once a query navigates to the leaf node of a secondary index, it knows two things: the values it used to navigate the index, and the primary key values of that row in the table.

For example, suppose I have a table structured like this:

create table apples(
   variety varchar(10) primary key,
   note varchar(50),
   price int,
   key(price)
) engine=InnoDB;

insert into apples values
   ('gala', 'hello', 5),
   ('fuji', 'hello', 6),
   ('limbertwig', 'hello', 8),
   ('red delicious', 'hello', 3),
   ('pippin', 'hello', 8),
   ('granny smith', 'hello', 11),
   ('roma', 'hello', 6);

Note only the ‘gala’ row has a price of 5. Now suppose I issue the following query:

select variety from apples where price=5;

The query takes the value 5 and navigates the price index. When it gets to the leaf node, it finds the value ‘gala’, which it can use to navigate the primary key. But why does it need to do that? It already has the value it was looking for!

In fact, if the query only refers to values in the secondary and clustered index, it doesn’t need to leave the secondary index. If you like fancy lingo, the index “covers” the query, so it is a “covering index” or “index cover.”

This is a fantastic optimization. It means each secondary index is like another table, clustered index-first. In this example, the secondary index is like a table containing just price and variety, clustered in that order (refer again to the diagrams above).

In MyISAM, the “don’t leave the index” optimization can be used too, but only if the query refers only to values in the index itself, because MyISAM indexes don’t have any PK values at their leaf nodes. A MyISAM index can’t be used to find any additional data without following the pointer to the row itself. Again, it’s a trade-off.

How to know when the optimization is used

Theoretically, the optimization can be used anytime a query only uses values from the clustered index and a secondary index in InnoDB, or only uses values from the index itself in MyISAM. That doesn’t mean the query will use that index, though. For a variety of reasons, the query might use some other index. To find out for sure, EXPLAIN the query. If the Extra column includes the text “Using index,” the optimization is being used.

How to design indexes for this optimization

Once you understand how indexes work, you can make deliberate decisions about indexes. Here is a methodical approach to designing indexes.

Begin with a table and the data it needs, but without any indexes except those designed to constrain the data to valid values (primary and unique indexes). Next, consider the queries that are issued against the table. Is it queried ad-hoc, or do certain types of queries happen repeatedly? This is very important to know.

Before you start, consider the size of the table and how much it is used. You should put your optimization effort where it is most needed. If one 5-minute query runs once a day and you know it should be possible to optimize it to 5 seconds, that’s 4 minutes and 55 seconds saved. If another query issued every minute takes 5 seconds and you know it should be possible to run it in a few milliseconds, that’s about 7,000 seconds saved. You should optimize the second query first. You should also consider carefully-designed archiving jobs to get those tables as small as possible. Smaller tables are a huge optimization.

Now, back to the index design discussion. If the table is queried ad-hoc all the time, you need to create generally useful indexes. Most of the time you should examine the data to figure out what they should be. Pretend you’re optimizing the apples table above. This table probably does not need an index on the note column. Look at its contents — every row just says “hello.” Indexing that would be a total waste. Plus, it just seems reasonable that you want to look at the note, but not filter by it. On the other hand, it’s very reasonable that you’d want to find apples by price. The price index is probably a good choice.

On the other hand, if you know there’s a certain query that happens all the time and needs to be very fast, you should consider specially optimized indexes. Suppose these two queries each run 50 times a second:

select variety from apples where price = ?;
select note from apples where price = ?;

These queries deserve a close look. The optimization strategy will depend on the table size and the storage engine.

If you are using the InnoDB engine, the first query is already optimized as we’ve seen above. It will use the price index and not even look at the table itself. If you’re using the MyISAM engine, you need to consider how large the table is, and therefore how large an index on (price, variety) would be. If the table is very large, for example, if there are a bunch of large VARCHAR columns in it, that index might be significantly faster than all the bookmark lookups required to find the variety column for each row found in an index that only contains the price column.

The second query is trickier to optimize, because it really depends on how large the table is. If the table is very large, and has lots of other columns as I mentioned in the previous paragraph, it might make sense to create an index on (price, note). This is where careful testing is needed. I will explain how to do that testing in an upcoming article. It is non-trivial in MySQL, unfortunately.

The general strategy is as follows:

For InnoDB, put the columns in the WHERE clause first in the index, then add the columns named in the SELECT clause at the end, unless they are included in the primary key.
For MyISAM, put the columns in the WHERE clause first in the index, then add the columns named in the SELECT clause at the end.

How to write queries that don’t suck

I’ve noticed many people have a tendency to write SELECT * FROM... queries. If you don’t need all the columns, don’t select all the columns, because it can make the difference between a fast and a slow query. If you only select the columns you need, your query might be able to use one of the optimizations I’ve just explained. If you select every column and the query uses a secondary index, there’s no way to do that, and the query will have to wander around indexes finding the rows it needs, then do other operations to get the actual values from the rows.

Of course, if you only need a few columns, it can also be a lot less data not to select all the columns you don’t need. Getting that data off the disk and sending it to whatever asked for it is significant overhead. Don’t do it unless you need to.

Other InnoDB index design considerations

Since InnoDB secondary indexes already contain all columns from the primary key, there’s no need to add them to the secondary index unless the index needs them at the front of the index. In particular, adding an index on (price, variety) to the apples table above is completely redundant. And in tables where the primary key is several columns and it’s desirable to have the table “clustered two ways” by using the indexes as I’ve explained, not all of the columns need to be added to additional indexes. Indexes need to be designed very carefully to avoid causing a bunch of extra overhead. Every index adds a cost to the table, and it’s really important to avoid indexes that add cost but no benefit.

Suppose you added an index on (price, variety) to the apples table anyway. You might think the variety column can just be optimized out of the internal nodes, since the values are already at the leaf nodes. It can’t, because the primary key values are only at the leaf nodes, not in the internal nodes, and they can’t be optimized out of the internal nodes because they’re needed for navigating the index. Again, adding that column to the end of the index will just make the index larger, but result in the query knowing nothing it didn’t already know — and that’s useless.

I want to point out that it’s not always possible to design indexes so this optimization can be used! It is not necessarily a good design goal to make sure every query can be satisfied without leaving the indexes. In fact, it’s unrealistic. But in special cases, it may be possible and worth doing.

Another InnoDB optimization

Here’s another neat optimization: a tiny index might be used unexpectedly. For example,

create table something (
   id bigint not null auto_increment primary key,
   is_something tinyint not null,
   othercol_1 bigint not null,
   othercol_2 bigint not null,
   othercol_3 bigint not null,
   index(is_something)
);

is_something is a 1/0 indicator of whether something is true about the row. Normally I’d say an index on that is a waste of disk and CPU, because it’s not selective enough for the query optimizer to use it, assuming there’s an equal distribution of ones and zeroes. But the fact that it’s a very small value is important for some queries. For example, select sum(id) from something will scan the is_something index because it’s the smallest available. Its internal nodes only have one-byte tinyint values, and the leaf nodes have a tinyint and an 8-byte bigint. That’s much smaller than the clustered index, which has 8-byte values in the internal nodes, and 33 bytes at each leaf.

Proof of InnoDB’s automatic clustered index

I said every InnoDB table gets a 6-byte internal clustered index if it has no primary key. Here’s a neat way to see that in action. I created a table like so:

create table test(a int, b int, c int) engine=InnoDB;

insert into test values(1, 1, 1), (2, 2, 2);

I started a transaction and got an exclusive lock on it, then started another transaction on a different connection and tried to update that table:

-- connection 1:
set transaction isolation level serializable;
start transaction;
select * from test;

-- connection 2:
set transaction isolation level serializable;
start transaction;
update test set a = 5;

The query blocked and waited for a lock to be granted. Then I issued SHOW ENGINE INNODB STATUS on another connection. The transaction information shows the lock on the internally generated index:

---TRANSACTION 0 81411, ACTIVE 1410 sec, process no 8799, OS thread id 1141414240 starting index read
mysql tables in use 1, locked 1
LOCK WAIT 2 lock struct(s), heap size 1216
MySQL thread id 4, query id 194 localhost xaprb Updating
update test set a = 5
------- TRX HAS BEEN WAITING 9 SEC FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 0 page no 131074 n bits 72 index `GEN_CLUST_INDEX` of table `test/test` trx id 0 81411 lock_mode X waiting
Record lock, heap no 2 PHYSICAL RECORD: n_fields 6; compact format; info bits 0
 0: len 6; hex 000000018a02; asc       ;; 1: len 6; hex 000000013e0a; asc     > ;; 2: len 7; hex 80000000320110; asc     2  ;; 3: len 4; hex 80000001; asc     ;; 4: len 4; hex 80000001; asc     ;; 5: len 4; hex 80000001; asc     ;;

Notice the lock on the index called GEN_CLUST_INDEX. Notice also the number of fields (n_fields) in the lock struct: two more than the number of columns in the table. The first field in the index is the internally generated unique value, and it is 6 bytes as I said above.

If there is a primary key on a, it's a different story:

---TRANSACTION 0 81456, ACTIVE 17 sec, process no 8799, OS thread id 1141680480 starting index read
mysql tables in use 1, locked 1
LOCK WAIT 4 lock struct(s), heap size 1216
MySQL thread id 9, query id 277 localhost xaprb Updating
update test set a = 5 where a = 1
------- TRX HAS BEEN WAITING 6 SEC FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 0 page no 131076 n bits 72 index `PRIMARY` of table `test/test` trx id 0 81456 lock_mode X locks rec but not gap waiting
Record lock, heap no 2 PHYSICAL RECORD: n_fields 5; compact format; info bits 0
 0: len 4; hex 80000001; asc     ;; 1: len 6; hex 000000013e27; asc     >';; 2: len 7; hex 80000000320110; asc     2  ;; 3: len 4; hex 80000001; asc     ;; 4: len 4; hex 80000001; asc     ;;

Now the lock is on the index called PRIMARY, there are only 5 fields in the lock structure, and the first one is 4 bytes instead of 6. Fields with the value 2 have the hex value 80000001. When the primary key is a column, that field comes first in the lock structure.

These examples prove that InnoDB adds a "hidden column" to your tables when you don't create a primary key. Maybe I'm saying this too often, but you should always create a carefully designed primary key, because if you don't, you're throwing away one of the best things InnoDB gives you: a clustered index. Read my past articles for more on how to design an effective primary key.

Summary

The more you know about how indexes work, the more you can optimize your databases. Sometimes these optimizations don't help much, but sometimes they're huge. In this article I explained how InnoDB's primary and secondary indexes are different from other storage engines. Now that you understand the differences, you can understand the optimizations and trade-offs each storage engine has, and how to take advantage of the optimizations and avoid the drawbacks if possible. I showed you several side effects of the index design, such as a query scanning a secondary index instead of the table, and went into a bit of InnoDB internals to see how tables without primary keys work.

If this article helped you, you should consider subscribing via feeds or e-mail, because it's the best way to get my upcoming articles. I publish two or three times a week.

Further Reading:

Written by Xaprb

July 4th, 2006 at 9:41 pm

Posted in Uncategorized

51 Responses to 'How to exploit MySQL index optimizations'

Subscribe to comments with RSS

I have a table wherein I need only two columns one for `userid` and another for `productid`. I do not need any additional field. Also I need to add a UNIQUE index to `userid,`productid` combination. The table is InnoDB.

Should I add an additional column `id` and assign it as AUTO_INCREMENT PRIMARY KEY or leave it to the innodb engine to create its own id? Which will be a better option. I really have no use for any additional column, and I cannot create a primary index on any one of `userid` or `productid` since they’ll be repeating if considered individually.

Can I assign the combination of `userid`,`productid` as a PRIMARY KEY instead of UNIQUE?

Zeest

19 Jul 06 at 9:18 am
Zeest, it’s hard for me to say exactly without knowing more about your data — it sounds like a unique index on (userid, productid) is an unconventional design, but that’s just my gut feeling. In any case, if you really only have those two columns and they are the minimal candidate key, which it sounds like is the case, then that should be the primary key. You can create that like so:
```
alter table ... add primary key (userid, productid);
```
Or,
```
create table...
  userid ...
  productid ...
  primary key (userid, productid)
...
```
Xaprb

19 Jul 06 at 9:41 pm
Thanks that was really helpful.

Another question (I’m not sure if this fits here, please delete it if not): I’ve seen sometimes that multiple indexes are used for the same column, one a multiple column index another an individual, or something similar. Is it wise to have more than one index for a column. Could you, if possible, give an example of such a usage.

Zeest

20 Jul 06 at 9:49 am
That’s a great question. It depends on whether the index has a column leftmost or not. If a table with columns (a, b, c) has an index on columns (a, b) it doesn’t need an index on column (a) alone, because it is already the leftmost column in an existing index. MySQL can use the leftmost prefix of any index efficiently.

However, if you do a query with just column b in the WHERE clause, it cannot use the (a, b) index. In this case you may need to add an index on just (b).

It’s also good to look at SHOW CREATE TABLE output and check all the indexes. It’s actually possible to create multiple identical indexes on the same column, so checking this can be important.

I always recommend reading the MySQL manual as well. It has a great deal of information. My articles mostly bring out little details you can already find in the manual.

Xaprb

20 Jul 06 at 10:49 am
I want to know is it a good idea to index a column which will have numeric values ranging from 1 to 25.

Does it make any difference if it is 1 to 50.

Sukhwinder Singh

24 Aug 06 at 8:09 pm
It depends on the distribution of values and number of rows in the table. You should calculate the selectivity you expect for the index and compare the gain for that against the cost of the index. Jay Pipes has a good article on index selectivity.

Xaprb

24 Aug 06 at 8:18 pm
[...] Indexes: How to exploit MySQL index optimizations – Xaprb I recently had the same problem with a csv import. I used MySQLFront to get in ad found that the import had added a space in front of every string cell. [...]

Mysql Basics - Indexes, optomization, flushing, structure

25 Aug 06 at 5:53 pm
In your example of index (a,b) and possibly need another index on (b). Would it be better to have one index on (a) and one on (b) – I am thinking of MyISAM here.

I have a table with 3,000,000 rows where col a might be unique for 10-20 rows and b unique for 1-10,000. I have 50 inserts / updates per second and 400 reads per second. I am trying to work out whether the overhead of indexing (a,b) AND (a) is better than indexing on (a) and (b).

Talking of overhead, seems hard to find out what the overhead of indexes really are, must be significant otherwise you could just create indexes on everything…

cheers for the article

Dave

18 Oct 06 at 11:42 am
You can find the disk space used by indexes with SHOW TABLE STATUS. That’s not really the ‘overhead’ because maintenance is a big factor, but it gets you an idea of how big an index is.

MySQL can’t really use two indexes at once. There are certain types of indexes that can work this way — bitmap indexes for one, but they aren’t in MySQL. Peter at mysqlperformanceblog.com has written some good articles on how multi-column indexes are used or not used. There is room for improvement.

It really depends on your queries, but it is often better to put the most selective column leftmost in the index. Think of this rule of thumb to remember this: if you have a column where every value is unique, and a column where there is only one value for the whole table, and you want to index them (you woudn’t, but this is for demonstration), would you gain anything by putting the single-valued column leftmost? No. So always put the most selective column leftmost — in the general case. It depends on how the columns are used in your queries.

Xaprb

18 Oct 06 at 7:54 pm
Thanks for a great article. Question: How can I return the row id from an index file. For example, I have a primary key called PersalNo on a table. If I wanted to get the row number where PersalNo 12345 is in a table via API, how can I do that? Like is there a way of running a select query from the index file of a table. Is there code that I can use to read the MYI file.

Anele Mbanga

16 Dec 06 at 6:11 am
The row ID is hidden from SQL, unlike in Oracle. There’s no way to get it, as far as I know.

Xaprb

18 Dec 06 at 3:40 pm
Just in case I don’t know all the fields of the query, I need them all but one (for example, when one is what I already know), is there a way to tell SQL not to send me it?

For example, in place of sending me the name, userid, picture, address and timestamp fields (what would be sent if I had used *) send me all, but the userid, what I already know.

Henrique Vicente

9 Feb 07 at 10:45 pm
No, there’s no way to say “all but X columns.” The only wildcard is * and there’s no negation of that (at least not as far as I know).

Xaprb

10 Feb 07 at 11:27 am
We have a database table of 30 million names and other fields that will be used about 200 times a day and has poor performance. I say, it will, because we imported the data from several Access databases and have not yet implemented the MySQL version on production. I’ve added an index on the name column, but the queries still take around 1.5 minutes or more to complete for common names. For example a search for ‘Smith, J%’ can take a really long time, because there are 44,000 possible Smith, J% records. Unfortunately, the names must be searched as “lastname, first initial”. I’ve used EXPLAIN and it reveals that it is using the index I created on the name column. I also used the PROCEDURE ANALYSE() on the table and found the name column can be converted from a Varchar(30) to a Char(30). Other than that, I am not sure how to better optimize the data so that these queries will not take so long. The storage engine is InnoDB, but it probably doesn’t need to be since updates only occur once a month. Should I switch to MyISAM? Should I change the name column to a Char(30)? Should I split the name column into last and first name columns and re-index with a multi-column index on last name and first name? Any help is appreciated.

Ryan

17 Feb 07 at 12:01 pm
Hi Ryan, and thanks for writing in. My first thought is yes, you should probably normalize the name column into first and last names. Doing so will shorten the index on the lastname column, so more of it will fit in memory at once. This alone might help your slow queries. You will probably not benefit from splitting the column and then making a combined index on it if you’re doing range queries, but if you’re looking only for one entry, it may help.

Changing to a fixed-length field is generally not beneficial for something like this, especially in InnoDB. Are there other variable-length columns in your table? If so, your change will have no effect — read the MySQL manual section on silent column specification changes.

You may see some improvement with another storage engine, but be aware that there are trade-offs. MyISAM will be more space-efficient and offers other kinds of indexes should you need them (FULLTEXT in particular), and has very good performance, but InnoDB is much more durable and resistant to corruption.

One other suggestion I have is to see how selective your index is. Obviously it’s not very selective for Smith and other common names, but if you are always querying with a column prefix like ‘something%’, you may benefit from indexing only a prefix. Run some substring queries to calculate the selectivity of (for example) a 5-character prefix of the column. I would bet that even if you don’t normalize the column into first and last names, you will be able to get most of the selectivity of a full-column index with a fraction of the characters. If 5 or 8, or even 15 characters gets you 95% of the selectivity, that’s probably a significant win.

Xaprb

18 Feb 07 at 3:32 pm
I have a census of persons (about 7 million) table A, and in table B I have a list of persons (who have been excluded from A) who have a particular disease. (Those in A do not have the condition)

I want to find a selected number of controls (records in A) that closely match each person/record in B. However, I cannot seem to coax MySQL to use indexes with constructs such as

A.age between B.Age-5 and B.AGE+5.

I have the data in a MyISam engine, and age in both A and B are indexed.

My objective at this stage is to identify the subset of A that “match” to B, other algorithms help me choose the “best” control amongst that subset.

Suggestions?

Rob

25 Feb 07 at 4:44 pm
MySQL tends to be very poor about loose index scans or repeated range scans on a non-prefix column in an index. It may also be evaluating the statistics and deciding it’s cheaper to do a full scan. Even though you think it’s not, you might be wrong (you might be right too though).

Peter Zaitsev has written some good articles on optimizing loose index scans (there are others besides the one I just linked — look around the site a bit). Some of those techniques may help.

Xaprb

26 Feb 07 at 1:29 pm
Would you please tell me what INNODB and MyISAM engines are good at? And, in what situation should we use INNODB or MyISAM?

Wei

3 Apr 07 at 9:11 pm
Wei, that’s a common question. It’s a bit off-topic for this article, so I don’t want to go into it too much. People commonly use MyISAM for read-only or low-concurrency and InnoDB for high-concurrency, transactional data. There is great advice in the MySQL Manual on this topic. You might also look into getting a copy of Pro MySQL. I think it has a great discussion of the storage engines.

Xaprb

4 Apr 07 at 9:36 am
Thank you for the article. I have a MyISAM table with columns (a, b, c, d), where d is a forign key which refers to another table. It is obvious for me to create an index for d as I am using it several times for comparison in my queries. Meanwhile, I notice that columns (d, b) in combination is unique. So at first I created a “unique” index i.e UNIQUE KEY `unique` (`d`,`b`).
My question is:
Does the fact that MySQL can search on a prefix of an index be appropriate for “unique” index? (I mean the fact “if we have index for (x, y), we dont need an index on(x)”).
or
Do I need to create seperate index for column d?

Apart from this, In one of your previous comments you mentioned that People commonly use MyISAM for read-only or low-concurrency. By the term “concurrency”, did you mean concurrent reading or concurrent writing?. I am asking because the reason I choose MyISAM is that my table is read-only, but I am afraid that the reading may be concurrent.
(Iâ€™m not sure if the second part of my question is appropriate here, please delete it if not)
Thanks in advance.

Ela

2 May 07 at 11:24 am
Hi Ela,

If you have queries that refer to column d without also referring to column c, the index in (c,d) will not be useful. Use EXPLAIN to see when an index is a candidate.

MyISAM has great read concurrency, and in some circumstances can even have good INSERT concurrency, but for the most part if you’re doing a lot of writes the reads and writes will conflict.

Xaprb

2 May 07 at 5:42 pm
Say I have a ‘user’ table with ‘uid int primary key’, and I want to write query the uid. MySQL will allow me to write the query as “select * from user where uid = 1″ or like “select * from user where uid = ’1′”. Normally, a database would give an error if you used single quotes on an int, but mysql allows you do do both.

My question is, does using single quotes to search an int field make any difference to the speed of the query in MySQL?

Craig Kohtz

30 May 07 at 3:40 pm
I always have to think about exactly how this works, but yes, there are cases where indexes will be defeated. The reason is MySQL sees there might be several strings that could convert to the same integer, if for example there’s a space-padding that would be thrown away on conversion to integer.

The easy test is to simply run EXPLAIN.

Xaprb

30 May 07 at 7:24 pm
[...] researching the optimum way to design large tables for InnoDB, I found an article entitled How to Exploit MySQL Index Optimizations that provides some good information about the differences between MyISAM and InnoDB index [...]

Marcus Schwartz » Blog Archive » Optimizing MySQL Indexes

24 Oct 07 at 5:20 am
[...] MySQLä¸InnoDB ç´¢å¼•ä¼˜åŒ– [...]

å¯»æ¢¦å®¶å› » Blog Archive » MySQLæ€§èƒ½è°ƒä¼˜ç›¸å…³

4 Dec 07 at 2:28 am
Have you experimented at all with solid state drives to see how they perform with the different engines? I feel like solid state changes the game completely, concurrency is no longer anywhere near as much a problem at the disk level. I know it is off-topic, but really it’s all about performance!

dumbfounder

15 Dec 07 at 6:37 pm
No, I haven’t played with SSD at all. They do have a completely different performance profile. Their main advantage, from what I’ve read, is very fast random reads and writes. Sequential reads and writes don’t seem to do that well. It’s young technology though.

Xaprb

16 Dec 07 at 11:32 pm
There’s a lot of good information in this discussion. One area that I’ve not found much good advice on is indexing bit fields. I know MySQL only officially added them recently, but prior to that an unsigned numberic field worked OK. So, what is the best way to index these so you can use bitwise comparisons in your queries?

Kipp

12 Feb 08 at 8:26 pm
MySQL doesn’t have an index type that’ll support such comparisons, and I really don’t like its built-in BIT data type either. I’d use an integer.

Xaprb

12 Feb 08 at 9:27 pm
thanks.. very ncie article

justin

18 Mar 08 at 10:02 pm
Great article. I’ve got 15GB product tables that i’m trying to get optimized and this helps a lot.

drew

25 Apr 08 at 3:54 pm
very good explaination..thanks for the great article..

varun

17 Jun 08 at 2:14 am
Very nice article!

It would be nice to know more about ORDER BY index use, I’m having lots of problems with those… no matter which indexes I create the explain shows “using filesort” :(
btw here is the query:
EXPLAIN SELECT profile_id FROM profile_global_tags AS pgt WHERE ((pgt.global_tag_type=”employer” && pgt.global_tag_id IN (750255,991733,2181124,742274,889796,198047,896035,710892,764629,795746,3043532,1250050,2869214,369104,2528996,1866080,975015,535303,1666170,1666169,114814,1200717,756304,830154,3043538,3125029,3239698,1539572,2450512,2788184,889795,420389,1824454,2146155,1670613,384985,2134882,2465310,319977,3348763,3019010,979572,233330,1653540,341189,247119,2670802,1208688,2899954,1826522,1994971,3354839,2713518,375313,23277,11410)) || (pgt.global_tag_type=”school” && pgt.global_tag_id IN (20907,43090,39603,25783,39640,419561,41449,433074,392717,407533,111947,112862,154387,239891,418555,292933,381917,417209,407532,410035,387586,334701,384237,423499,387527,312048,453454,346002)) || (pgt.global_tag_type=”tag” && pgt.global_tag_id IN (48523))) ORDER BY pgt.links DESC LIMIT 0,20;

and explain:
—- ————- ——- ——- ———————– ———————– ——— —— ——– ——————————————
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
—- ————- ——- ——- ———————– ———————– ——— —— ——– ——————————————
| 1 | SIMPLE | pgt | range | tag_type_link_profile | tag_type_link_profile | 6 | NULL | 357433 | Using where; Using index; Using filesort |
—- ————- ——- ——- ———————– ———————– ——— —— ——– ——————————————

Pavel

14 Aug 08 at 3:44 pm
I’d like to use your images of the MyISAM and InnoDB indexes for a presentation and article I’m doing for my company. This would be posted publicly, and is to be used as an introduction to MySQL indexing for our junior developers. May I have permission to use those if I link back to this post from my article and presentation?

I could easily do my own, but yours are so darned good and I was going to link back to this article anyway.

Ryan

11 Oct 08 at 8:55 am
Ryan, go ahead. Enjoy.

Xaprb

11 Oct 08 at 10:54 am
[...] How to exploit MySQL index optimizations [...]

Southwestern Virginia Developers Forum » MySQL Indexes

11 Oct 08 at 1:15 pm
Ryan, I’d post a comment on your entry, but it wants me to log in. See http://www.xaprb.com/blog/2008/10/11/three-steps-to-stopping-blog-comment-spam/ for my rant about this ;-)

Couple of notes: I have a presentation about EXPLAIN, which I think is much better than the MySQL manual: http://www.percona.com/presentations.html

Also, there’s a lot of detail in our book, http://www.amazon.com/dp/0596101716?tag=xaprb-20

I’d post both of these on your blog, but I don’t want to sign up for an account.

Xaprb

11 Oct 08 at 1:23 pm
Awesome! Thanks for the info. Sorry about making you log in. I don’t actually control the blog admin.

Ryan

11 Oct 08 at 1:25 pm
So I have a huge table (3M Records) that people often search for partial fname and partial lname. What would be the best practice for InnoDB as it seems slower than it was on MyISAM with an index on lname,fname. I tried doing a single index on each…but that was even slower.

Mark

9 Nov 09 at 3:41 pm
3M Records is not huge, our DB has 240Million records in 1 table, but that’s besides the point.

We are using sphinx… outperforms mysql by a huge number!

Pavel

9 Nov 09 at 4:04 pm
is there any utility, means, reasons to use multiple single indexed on the same colum ?

by example, this table :

DROP TABLE IF EXISTS `SPACES`;
CREATE TABLE `SPACES` (
`SPACEID` bigint(20) NOT NULL default ’0′,
`SPACENAME` varchar(255) default NULL,
`SPACEKEY` varchar(255) NOT NULL default ”,
`SPACEDESCID` bigint(20) default NULL,
`HOMEPAGE` bigint(20) default NULL,
`CREATOR` varchar(255) default NULL,
`CREATIONDATE` datetime default NULL,
`LASTMODIFIER` varchar(255) default NULL,
`LASTMODDATE` datetime default NULL,
PRIMARY KEY (`SPACEID`),
UNIQUE KEY `SPACEKEY` (`SPACEKEY`),
KEY `HOMEPAGE` (`HOMEPAGE`),
KEY `SPACEDESCID` (`SPACEDESCID`),
KEY `HOMEPAGE_2` (`HOMEPAGE`),
KEY `SPACEDESCID_2` (`SPACEDESCID`),
KEY `HOMEPAGE_3` (`HOMEPAGE`),
KEY `SPACEDESCID_3` (`SPACEDESCID`),
KEY `HOMEPAGE_4` (`HOMEPAGE`),
KEY `SPACEDESCID_4` (`SPACEDESCID`),
KEY `HOMEPAGE_5` (`HOMEPAGE`),
KEY `SPACEDESCID_5` (`SPACEDESCID`),
KEY `HOMEPAGE_6` (`HOMEPAGE`),
KEY `SPACEDESCID_6` (`SPACEDESCID`)
) TYPE=MyISAM;

what is the mean to use all the SPACEDESCID_x index ?
is it an error or not ?

Michael REMY

11 Jan 10 at 11:46 am
That’s an error. Take a look at mk-duplicate-key-checker.

Xaprb

11 Jan 10 at 12:48 pm
[...] http://www.xaprb.com/blog/2006/07/04/how-to-exploit-mysql-index-optimizations/ [...]

How to Use Indexes In MySQL - MILOL Blog

30 May 10 at 10:38 am
“MyISAM has no clustered index, so the data isn’t physically ordered by any index (it’s in insertion order), but in InnoDB, the rows are physically ordered by the primary key. That means there can be page splits as rows are inserted between other rows — if there are too many rows to fit on a page, the page has to be split. “

my question is:
the page size is 16k,
if a leaf page is full,i still insert a data row with same key,it will allot a new page?
can you draw a diagram for me?

arlen

2 Jul 10 at 3:42 am
Useful article! Just a few questions:

1. Is there any limitation on number of indexes per table? any performance issue?

2. Assuming I have the ‘a’ field indexed in myTable, have I really optimized the table for the following common query?

select * from myTable where b=’something’ order by a

Thanks

Milad

27 Jul 10 at 4:07 pm
excellent post and very nice explanation !

well done!
thanks

Artur Ejsmont

15 Nov 10 at 1:23 pm
Nice Help.. Thanks

Munnabhai [Dhanashree Inc]

18 Nov 10 at 8:58 am
> Since InnoDB secondary indexes already contain all columns from the primary key,
> there’s no need to add them to the secondary index unless the index needs them at the front of the index.
> In particular, adding an index on (price, variety) to the apples table above is completely redundant.

I don’t think so. (I know, this article have been written long time ago.)

I created table like below.
—————————————————————————————-
CREATE TABLE tab_lsu (
fdpk1 char(10) NOT NULL,
fdpk2 char(10) NOT NULL,
fd3 char(10) DEFAULT NULL,
fd4 char(10) DEFAULT NULL,
PRIMARY KEY (fdpk1,fdpk2),
KEY ix_fd3_fdpk1_fdpk2 (fd3,fdpk1,fdpk2),
KEY ix_fdpk1_fd3 (fdpk1,fd3),
KEY ix_fd3 (fd3)
) ENGINE=InnoDB DEFAULT ROW_FORMAT=COMPACT — // (and also test REDUNDANT ROW_FORMAT)
—————————————————————————————-

InnoDB’s table monitor print out like this
—————————————————————————————-
INDEX: name ix_fd3_fdpk1_fdpk2, id 0 978, fields 3/3, uniq 3, type 0
root page 4, appr.key vals 16030, leaf pages 39, size pages 97
FIELDS:
fd3
fdpk1
fdpk2

INDEX: name ix_fdpk1_fd3, id 0 979, fields 2/3, uniq 3, type 0
root page 5, appr.key vals 15991, leaf pages 39, size pages 97
FIELDS:
fdpk1
fd3
fdpk2

INDEX: name ix_fd3, id 0 980, fields 1/3, uniq 3, type 0
root page 6, appr.key vals 15991, leaf pages 39, size pages 97
FIELDS:
fd3
fdpk1
fdpk2
—————————————————————————————-

And I dumped these 3 index’s pages. (all 3 index have only one index page)
(But index’s page contents never changed when index pages is greater than 2)

—————————————————————————————-
ix_fd3_fdpk1_fdpk2

[root@toto:~/temp/innodb-recovery-0.3/result/pages-1295763380/0-978]$ hexdump -v -C 4-00000004.page
00000000 0d 19 f0 3c 00 00 00 04 ff ff ff ff ff ff ff ff |..<….|
00000010 00 00 00 03 aa b4 02 15 45 bf 00 00 00 00 00 00 |….E…|
00000020 00 00 00 00 01 ef 00 0a 07 05 80 29 00 00 00 00 |……….)….|
00000030 06 e3 00 02 00 26 00 27 00 00 00 00 00 00 00 00 |….&.'……..|
00000040 00 01 00 00 00 00 00 00 03 d2 00 00 01 ef 00 00 |…………..|
00000050 00 02 02 72 00 00 01 ef 00 00 00 02 01 b2 01 00 |…r……..

00000060 02 00 1e 69 6e 66 69 6d 75 6d 00 08 00 0b 00 00 |…infimum……|
00000070 73 75 70 72 65 6d 75 6d 0a 0a 0a 00 10 00 11 00 |supremum……..|
00000080 2b 46 44 33 23 23 23 23 23 23 23 50 4b 31 23 23 |+FD3#######PK1##|
00000090 23 23 23 23 23 50 4b 32 23 23 23 23 23 23 23 00 |#####PK2#######.|
000000a0 00 00 0a 0a 0a 0a 00 00 00 19 00 2b 46 44 33 30 |………..+FD30|
000000b0 30 30 30 32 30 35 50 4b 31 30 30 30 30 32 30 35 |000205PK10000205|
000000c0 50 4b 32 30 30 30 30 32 30 35 00 00 00 0b 0a 0a |PK20000205……|
000000d0 0a 00 00 00 21 00 2b 46 44 33 30 30 30 30 36 31 |….!.+FD3000061|
000000e0 36 50 4b 31 30 30 30 30 36 31 36 50 4b 32 30 30 |6PK10000616PK200|
000000f0 30 30 36 31 36 00 00 00 11 0a 0a 0a 00 04 00 29 |00616……….)|
00000100 00 2b 46 44 33 30 30 30 31 30 32 37 50 4b 31 30 |.+FD30001027PK10|
00000110 30 30 31 30 32 37 50 4b 32 30 30 30 31 30 32 37 |001027PK20001027|
00000120 00 00 00 15 0a 0a 0a 00 00 00 31 00 2b 46 44 33 |……….1.+FD3|
00000130 30 30 30 31 34 33 38 50 4b 31 30 30 30 31 34 33 |0001438PK1000143|
00000140 38 50 4b 32 30 30 30 31 34 33 38 00 00 00 1a 0a |8PK20001438…..|
—————————————————————————————-

—————————————————————————————-
ix_fdpk1_fd3

[root@toto:~/temp/innodb-recovery-0.3/result/pages-1295763380/0-979]$ hexdump -v -C 5-00000005.page
00000000 0b f0 fa 66 00 00 00 05 ff ff ff ff ff ff ff ff |….|
00000010 00 00 00 03 aa b4 03 01 45 bf 00 00 00 00 00 00 |….E…|
00000020 00 00 00 00 01 ef 00 0a 07 05 80 29 00 00 00 00 |……….)….|
00000030 06 e3 00 02 00 26 00 27 00 00 00 00 00 00 00 00 |….&.'……..|
00000040 00 01 00 00 00 00 00 00 03 d3 00 00 01 ef 00 00 |…………..|
00000050 00 02 03 f2 00 00 01 ef 00 00 00 02 03 32 01 00 |………..2..|
00000060 02 00 1e 69 6e 66 69 6d 75 6d 00 08 00 0b 00 00 |…infimum……|
00000070 73 75 70 72 65 6d 75 6d 0a 0a 0a 00 10 00 11 00 |supremum……..|
00000080 2b 50 4b 31 23 23 23 23 23 23 23 46 44 33 23 23 |+PK1#######FD3##|
00000090 23 23 23 23 23 50 4b 32 23 23 23 23 23 23 23 00 |#####PK2#######.|
000000a0 00 00 0c 0a 0a 0a 00 00 00 19 00 2b 50 4b 31 30 |………..+PK10|
000000b0 30 30 30 32 30 35 46 44 33 30 30 30 30 32 30 35 |000205FD30000205|
000000c0 50 4b 32 30 30 30 30 32 30 35 00 00 00 0d 0a 0a |PK20000205……|
000000d0 0a 00 00 00 21 00 2b 50 4b 31 30 30 30 30 36 31 |….!.+PK1000061|
000000e0 36 46 44 33 30 30 30 30 36 31 36 50 4b 32 30 30 |6FD30000616PK200|
000000f0 30 30 36 31 36 00 00 00 12 0a 0a 0a 00 04 00 29 |00616……….)|
00000100 00 2b 50 4b 31 30 30 30 31 30 32 37 46 44 33 30 |.+PK10001027FD30|
00000110 30 30 31 30 32 37 50 4b 32 30 30 30 31 30 32 37 |001027PK20001027|
00000120 00 00 00 16 0a 0a 0a 00 00 00 31 00 2b 50 4b 31 |……….1.+PK1|
00000130 30 30 30 31 34 33 38 46 44 33 30 30 30 31 34 33 |0001438FD3000143|
00000140 38 50 4b 32 30 30 30 31 34 33 38 00 00 00 1b 0a |8PK20001438…..|
—————————————————————————————-

—————————————————————————————-
ix_fd3
[root@toto:~/temp/innodb-recovery-0.3/result/pages-1295763380/0-980]$ hexdump -v -C 6-00000006.page
00000000 1d 9d 1b 81 00 00 00 06 ff ff ff ff ff ff ff ff |……..|
00000010 00 00 00 03 aa b4 03 ed 45 bf 00 00 00 00 00 00 |….E…|
00000020 00 00 00 00 01 ef 00 0a 07 05 80 29 00 00 00 00 |……….)….|
00000030 06 e3 00 02 00 26 00 27 00 00 00 00 00 00 00 00 |….&.'……..|
00000040 00 01 00 00 00 00 00 00 03 d4 00 00 01 ef 00 00 |…………..|
00000050 00 02 05 72 00 00 01 ef 00 00 00 02 04 b2 01 00 |…r……..
00000060 02 00 1e 69 6e 66 69 6d 75 6d 00 08 00 0b 00 00 |…infimum……|
00000070 73 75 70 72 65 6d 75 6d 0a 0a 0a 00 10 00 11 00 |supremum……..|
00000080 2b 46 44 33 23 23 23 23 23 23 23 50 4b 31 23 23 |+FD3#######PK1##|
00000090 23 23 23 23 23 50 4b 32 23 23 23 23 23 23 23 00 |#####PK2#######.|
000000a0 00 00 0e 0a 0a 0a 00 00 00 19 00 2b 46 44 33 30 |………..+FD30|
000000b0 30 30 30 32 30 35 50 4b 31 30 30 30 30 32 30 35 |000205PK10000205|
000000c0 50 4b 32 30 30 30 30 32 30 35 00 00 00 0f 0a 0a |PK20000205……|
000000d0 0a 00 00 00 21 00 2b 46 44 33 30 30 30 30 36 31 |….!.+FD3000061|
000000e0 36 50 4b 31 30 30 30 30 36 31 36 50 4b 32 30 30 |6PK10000616PK200|
000000f0 30 30 36 31 36 00 00 00 13 0a 0a 0a 00 04 00 29 |00616……….)|
00000100 00 2b 46 44 33 30 30 30 31 30 32 37 50 4b 31 30 |.+FD30001027PK10|
00000110 30 30 31 30 32 37 50 4b 32 30 30 30 31 30 32 37 |001027PK20001027|
00000120 00 00 00 17 0a 0a 0a 00 00 00 31 00 2b 46 44 33 |……….1.+FD3|
00000130 30 30 30 31 34 33 38 50 4b 31 30 30 30 31 34 33 |0001438PK1000143|
00000140 38 50 4b 32 30 30 30 31 34 33 38 00 00 00 1c 0a |8PK20001438…..|
00000150 0a 0a 00 00 00 39 00 2b 46 44 33 30 30 30 31 38 |…..9.+FD300018|
00000160 34 39 50 4b 31 30 30 30 31 38 34 39 50 4b 32 30 |49PK10001849PK20|
00000170 30 30 31 38 34 39 00 00 00 21 0a 0a 0a 00 00 00 |001849…!……|
—————————————————————————————-

And all dumped result are same whether row_format is REDUNDANT or COMPACT ( have not test baracuda file format yet.)

So, I think,
"INDEX (not_pk_column + pk_column1 + pk_column2)" and "INDEX (not_pk_column)" index have same internal structure.
the only different thing is second index have only 1 user defined index column, but first case have 3 user defined index column.

And "INDEX (pk_column1 + not_pk_column + pk_column2)" and "INDEX (pk_column1 + not_pk_column)" are same.

Finally (If this is right),
Index columns are same whether I defined or not trailing primary key columns, I think first case is better than second case.
In Second case, other people cannot notice that the creator really expected appended primary key for sort or group by or else.
For example,
If table's index is "INDEX (fd1)", you can't tell the creator of index want "INDEX (fd1 + fd_pk)" or "INDEX(fd1)".
But, in this case the index is created with "INDEX (fd1 + fd_pk)", then we can tell this index's Primary key column is also used
for searching or grouping or sorting…

What do you think about this. ?

toto

23 Jan 11 at 8:59 pm
Well, First of all I have to say thanks for such a nice article.

But I have small questions to clarify here with you. I hope this thread is still alive, and author is there to listen (hopefully).

When you say -
“For MyISAM, put the columns in the WHERE clause first in the index, then add the columns named in the SELECT clause at the end.”

say, I have a table(tableA) with 50 columns and couple of them have index’s (single/individually). But selection is based on something like this
1. select tableA.col1, tableA.col2,tableA.col3, tableA.col4, tableA.col5,tableA,col6, tableA,col7, tableA.col8, tableA.col9, tableA.col10 from tableA where col15= AND col16=

In this scenario, and after your suggestion, I understand I have to create index on (col15, col16, col1,col2,col3…col10) Correct?

But this seems to be huge index? My doubt, how true is your statement for huge tables (with more columns, specially?)

Could you please clarify my doubt please?

Cheers in advance,

Shashidhar

Shashidhar

1 Aug 11 at 3:08 am
Shashidhar, this is an example of when the query might touch so many columns that you need to use a trick such as the type that Peter Zaitsev calls a “delayed join”, which you can see here: http://www.mysqlperformanceblog.com/2007/04/06/using-delayed-join-to-optimize-count-and-limit-queries/

You might also be interested in this book: http://www.xaprb.com/blog/2010/09/19/a-review-of-relational-database-design-and-the-optimizers-by-lahdenmaki-and-leach/

Xaprb

1 Aug 11 at 4:31 pm
I can’t recall the data model, but I have encountered a situation where MySQL would not use an index prefix and a separate index on just the first column produced a vastly different (and superior) query plan.

As I said, I don’t really recall anything about the data, but I’m curious to know if anyone else has experienced this in practice.

Michael Mior

7 Nov 11 at 10:32 am