What is the more efficient (in terms of query performance) database table design - long or wide?
I.e., this
id size price
1 S 12.4
1 M 23.1
1 L 33.3
2 S 3.3
2 M 5.3
2 L 11.0
versus this
id S M L
1 12.4 23.1 33.3
2 3.3 5.3 11.0
Generally (I reckon) it comes down to the comparison of performance between GROUP BY and selecting the columns directly:
SELECT AVG(price) FROM table GROUP BY size
or
SELECT AVG(S), AVG(M), AVG(L) FROM table
Second one is a bit longer to write (in terms of many columns), but what about the performance of the two? If possible, what are the general advantages/disadvantages of each of these tables formats?
解决方案
First of all, these are two different data models suitable for different purposes.
That being said, I'd expect1 the second model will be faster for aggregation, simply because the data is packed more compactly, therefore needing less I/O:
The GROUP BY in the first model can be satisfied by a full scan on the index {size, price}. The alternative to index is too slow when the data is too large to fit in RAM.
The query in the second model can be satisfied by a full table scan. No index needed2.
Since the first approach requires table + index and the second one just the table, the cache utilization is better in the second case. Even if we disregard caching and compare the index (without table) in the first model with the table in the second model, I suspect the index will be larger than the table, simply because it physically records the size and has unused "holes" typical for B-Trees (though the same is true for the table if it is clustered).
And finally, the second model does not have the index maintenance overhead, which could impact the INSERT/UPDATE/DELETE performance.
Other than that, you can consider caching the SUM and COUNT in a separate table containing just one row. Update both the SUM and COUNT via triggers whenever a row is inserted, updated or deleted in the main table. You can then easily get the current AVG, simply by dividing SUM and COUNT.
1 But you should really measure on representative amounts of data to be sure.
2 Since there is no WHERE clause in your query, all rows will be scanned. Indexes are only useful for getting a relatively small subset of table's rows (and sometimes for index-only scans). As a rough rule of thumb, if more than 10% of rows in the table are needed, indexes won't help and the DBMS will often opt for a full table scan even when indexes are available.