The Cost of GUIDs as Primary Keys
In this article, Jimmy Nilsson presents the pros and cons of using globally unique identifiers (GUIDs) as the datatype for primary keys in SQL Server 2000. In doing so, he shows you test results that hint of performance characteristics and introduces you to a special type of GUID that he invented, called COMBs, that solves what otherwise might give you a big throughput problem.
In this article, I will be discussing the pros and cons of using globally unique identifiers (GUIDs) as the datatype for primary keys in SQL Server 2000. You will see concrete test results that give a hint of the performance characteristics. Along the way, I will also discuss a special type of GUID that I "invented," called COMBs. Before we start discussing GUIDs, though, I'd like to say a few words about natural and surrogate keys.
Natural or Surrogate Keys
When you do the physical design of a relational database, it's very important to decide upon which style to use for the primary keys. Some people prefer to use natural keys—that is, keys that are found in the domain that the database models. Others prefer to use surrogate keys, which are constructed keys with no other purpose than to be just keys (and which are not found in the domain). An example of a natural key is a Social Security number. A value incrementing by 1 for each row is a typical example of a surrogate key.
Using natural keys is the traditional approach, in line with Codd's original relational model. When you use them, you have only natural data that means something to users. This is good if users will ask ad hoc queries directly to the database in raw SQL. You can also often reduce the numbers of joins when using natural keys because you don't have to go to a lookup table to convert an ID to a description. One more advantage is that you get the minimum number of constraints because you don't have to protect the uniqueness of the natural keys separately. You already did this when you used them as primary keys.
Surrogate keys can be seen as a newer approach. This approach does not conflict with the relational model, but, in a way, it is a step closer to a more object-based approach in which each object has an ID and the structure of all IDs is of the same type. When you use surrogate keys, you often get smaller foreign keys, which reduces the size of the database. There is no risk of users changing the values of the primary keys, and the programming can be more consistent because all keys are of the same format.
With cascading updates/deletes in SQL Server 2000, the problem of users changing the values of primary keys is not so great anymore because you don't have to program the UPDATE of dependent rows manually.
That was a brief description of the different kinds of keys. Now let's assume that we choose to use surrogate keys when we design a new database.
A New Database
We are about to build a new database for order processing. A greatly simplified version of the database will look as shown in Figure 1.
Figure 1 Database schema for a simplified database for order processing.
As you can see in Figure 1, ordermaster.id and orderdetail.id are primary keys. There are also several foreign keys: namely, ordermaster.customer_id, orderdetail.ordermaster_id, and orderdetail.product_id.
In this simplified model, I skipped adding the product and the customer tables, so ordermaster.customer_id and orderdetail.product_id aren't implemented as physical foreign keys.
We should take into consideration all of these primary and foreign keys when deciding on the datatype to use.
Which Datatype Should Be Used for Surrogate Keys?
I'm going to focus on two different datatypes for surrogate keys, INTEGER and GUID. Let's start with INTEGER.
In the specific example shown in Figure 1, the users need a column in the ordermaster table to identify each order uniquely. The most correct solution to this is add an orderno column, which is given a value separately from the ID column. In the case of INTEGER+IDENTITY for the primary key, it would be fairly common to skip the orderno column, but I prefer not to let the user know the real key value.
To avoid affecting the test too much, I decided to skip the user-known column here. Otherwise, I would have had to maintain the orderno value manually for when the IDENTITY has already been used for the ID. When GUID is used for the ID, I could have let IDENTITY handle the orderno column. I did not do that, though, to avoid testing too many apples and oranges (or apples and pears, as we say in Sweden ).
INTEGER as the Datatype
Even though we have decided to use surrogate keys, we are not yet finished—we still have to choose a specific datatype. The most typical one to use is INTEGER. An INTEGER is 4 bytes. You can declare INTEGER columns as IDENTITY and thereby let the database provide new row values automatically. Listing 1 illustrates an example of a new row being INSERTed into the ordermaster table and the new ID being fetched for later use when INSERTing into the orderdetail table.
Listing 1: Stored Procedure for Inserting a Row into ordermaster, INT Version
CREATE PROCEDURE a_OrderINT_InsertMaster
(@customerId INT, @orderId INT OUTPUT)
DECLARE @anError INT
INSERT INTO orderINTmaster
(customer_id, orderdate, status)
(@customerId, GETDATE(), 1)
SET @anError = @@ERROR
SET @orderId = @@IDENTITY
Before we continue, I'd like to say a couple of things about the code in Listing 1 and the other code in this article:
- The error handling is not of production quality, but I decided to do it this way to avoid distracting you from the topic of the article and to get as clear test results as possible. You will find more information about appropriate error handling in my book, .NET Enterprise Design with Visual Basic .NET and SQL Server 2000.
- I am very fond of using user-defined data types, but I skipped them in the code snippet here, to make the code as obvious and clear as possible.
- In real code, you obviously must use a lot of comments! For example, I always start all stored procedures with a comment block, in which I state the purpose, tell who and when, and also document.
- You can download all the code for this article at http://www.jnsk.se/informit/pkguid/.
You can let the application server decide what the next value should be, but then you would have to be careful when INTEGER is used as the datatype for the primary keys that you don't get too much chatter between the application server and the database server. Other risks are creating a hot spot, which reduces concurrency, and creating a solution that makes it impossible to clone the application server.
A rather long and interesting discussion addresses different solutions for "manually" creating surrogate INTEGERs, but that is beyond the scope of this article.
UNIQUEIDENTIFIER as the Datatype
A GUID is called UNIQUEIDENTIFIER in SQL Server. It has been supported as a native datatype in SQL Server since version 7. A GUID is 16 bytes long (four times the size of an INTEGER).
In Listing 2, you can see an example of a new row being INSERTed into the ordermaster table when GUIDs are used for the keys. In this case, the ID is generated before the INSERT statement, with the help of NEWID().
For reasons of simplicity, there are two sets of tables and stored procedures: one set with "INT" in the name and one set with "GUID" in the name.
Listing 2: Stored Procedure for Inserting a Row into ordermaster, GUID Version
CREATE PROCEDURE a_OrderGUID_InsertMaster
, @orderId UNIQUEIDENTIFIER OUTPUT)
DECLARE @anError INT
SET @orderId = NEWID()
INSERT INTO orderGUIDmaster
, orderdate, status)
, GETDATE(), 1)
SET @anError = @@ERROR
Pros and Cons of GUIDs as Primary Keys
Let's say that we plan to use GUIDs for the primary keys. What are the advantages and disadvantages when compared to INTEGERs? We should always start with the positive, so here goes:
- We get a more object ID–like datatype because each primary key typically will be unique for the complete database. This gives good effects, for example, when designing a custom solution for pessimistic locking (but that's another long story).
- @@IDENTITY has been a source of problems when INSERT triggers have INSERTed rows into other tables with IDENTITYs. (You can use SCOPE_IDENTITY() in SQL Server 2000 to avoid this problem.)
- You can move the creation processing of the values from the data tier to the business or consumer tiers. For example, the consumer can create complete structures of orders and details that can be sent all the way through the business tier to the data tier in one go, which can reduce the number of round trips. A more basic way of saying this is that you can get the ID value before you use it, not just afterward, as is the case with IDENTITYs.
In the case of .NET and ADO.NET, it's a big advantage to know the ID value at the time rows are added to the DataSet. Otherwise, for example, autogenerated INSERTs can't be used for master-detail relationships within the DataSet.
- The span of values for GUIDs is enough for most scenarios. INTEGERs are not big enough for some scenarios. (By the way, with INTEGERs, you don't have to start the seed on 1. You can just as easily start it on the smallest negative INTEGER and thereby double the span of values.)
- You can prepare for merge replication. When merge replication is to be used, each table to be replicated must have a GUID column (called ROWGUID in SQL Server). You also don't have to give each server in a merge replication scenario a unique set of values to use for the keys. The risk of duplicate GUIDs is so small that this shouldn't be a problem.
- IDENTITY can't be used for tables in distributed partitioned views.
- A GUID will not mean anything to the users. An INTEGER can be read, understood, and remembered. It's common to see INTEGER keys in the UI, and that might cause problems later.
- You know what a nightmare it can be when doing a manual merge between two tables with INTEGERs as primary keys. Not only do you have to create a new sequence for the union of both tables, but you also must change all the foreign keys for all the dependent tables. If GUIDs are used, this sort of problem does not arise.
The algorithm for generating GUIDs doesn't use the MAC address of the network cards in recent Windows versions anymore. Instead, it just creates a random value. In theory, this presents a risk of getting duplicate GUIDs, but, in practice, it shouldn't be a problem.
The reason for excluding the use of the MAC address is that it not only couples users to GUIDs, but some network cards don't use unique MAC addresses.
I will discuss this further when we reach the section about the COMBs. Stay tuned.
I'm not a salesman, so here are a few drawbacks:
- There is a huge overhead in size. As I said, a GUID is four times larger than an INT. This is very important when it comes to indexes.
- Portability can be a problem because not every database has GUID as a native datatype.
- You can't use GUIDs with aggregates such as MIN. This is unfortunate when it comes to using WHILE plus an aggregate function, such as MIN or MAX for looping instead of a CURSOR. (Well, a workaround is to use a CONVERT or a CAST to CHAR(36) or BINARY(16) of the GUID before it's used in the aggregate, but watch out for performance problems with large tables.)
- It's detrimental to developer-friendliness because it's really hard to edit GUIDs manually, to remember them when you browse the database, and so on. You have to get used to navigating the database by using SQL scripts, but this is not usually that much of a problem.
So, the overhead related to size is the main problem. Most developers seem to have an idea of how big the overhead is, but few have tested it. It's time that this was changed, so let's take a look at a few performance tests and get an idea of how large the overhead is.
Now I'll discuss the test cases and their results, as well as the test environments and other factors.
The Test Cases
The test cases are as follows:
A. Inserting an order with 10 detail rows (repeated 500 times, measured when having 1,000,000 orders)
B. Fetching 100 specific orders out of 1,000,000, one by one
C. The same as B, but this time with a JOIN with the orderdetail table
The metric is the time that the test case takes to execute.
You can easily compare the size overhead of the database, if you want to. Just run some INSERTs for the INT case in one new database and run the same number of INSERTs for the GUID case in another new database. Then compare the file sizes afterward.
Because the test database is so simplified, the size metric is a bit inaccurate because the number of foreign keys compared to ordinary columns is pretty big. Keep this in mind if you evaluate the size metric. This also applies to the time difference, but there are definitely situations in which the database consists of almost only foreign keys.
All tests will be executed by a single user. The tests will be executed directly at the database server—I'm testing only at the data tier in this article. For example, the database will create the GUIDs with NEWID(). You could quite easily expand the test to run from several users or from an application server instead. A very relevant test would be to let the application server create the GUIDs instead, but that is beyond the scope of this article.
Now, I'm sure you are curious about the results. I promise to show them, but first I'd like to say a few things about test results in general and the environment that I used.
It's extremely important to be skeptical when reading test results. To come to the right conclusion, you have to translate the tests to your environment and to your situation, and then re-execute the tests.
The equipment that I used for the tests is not real production equipment; I used only a development machine (1.7GHz, 512MB RAM). The test machine was a Windows XP machine with SP2 for SQL Server 2000.
In a new production environment, you will typically have RAID 1 (mirroring) for the log and RAID 10 (mirroring and striping) with at least four disks for the data. You will also have more CPU cache than 256KB, probably more than one CPU, and also at least 1GB of memory.
With that out of the way, let's take a look at the test results.
Test A: INSERT of Orders
In Table 1, you can find the results of the first test, in which I INSERT orders and details for those orders. The time values are recalculated so that the first result has a value of 1 (a baseline value) and all other time values shows a percentage compared to the first test result value.
As you can see in Table 1, the overhead for the GUID solution is actually huge. What is especially worrisome is the sign of bad scalability that I saw when I compared results from executions when I had different numbers of orders.
Table 1 Result of Test A: INSERT of 500 Orders
Time (with 1,000,000 Orders)
This big difference occurred with "only" 1,000,000 orders (and 10,000,000 detail rows). The difference was very small with a small amount of orders, so this is probably a really bad problem with 100,000,000 orders!
Something is wrong—very wrong!
Test A: INSERT of Orders, Revised
I did quick tests before comparing GUIDs and IDENTITY+INT without seeing the huge overhead that I saw when I scaled up the test a bit. (I also did the tests on NT4, and the problem is probably smaller there, as you soon will see.) This certainly worried me a lot! I tried different tricks, different indexing strategies, different fill factors, a clustered IDENTITY+INT, and so on—everything without luck. My guess was that the reason for the problem was the lack of order for the GUIDs generated at Windows 2000 (and beyond). In Listing 3, you can see a series of GUIDs generated with a couple of calls to NEWID() on SQL Server 7 on NT4. Compare that series with the one in Listing 4, which is generated with a couple of calls to NEWID() on SQL Server 2000 on Windows 2000.
Listing 3: Some GUIDs Generated with SQL Server 7 on NT4
B3BFC6B1 -05A 2-11D6-9FBA -00C 04FF317DF
B3BFC6B2 -05A 2-11D6-9FBA -00C 04FF317DF
B3BFC6B3 -05A 2-11D6-9FBA -00C 04FF317DF
B3BFC6B4 -05A 2-11D6-9FBA -00C 04FF317DF
B3BFC6B5 -05A 2-11D6-9FBA -00C 04FF317DF
Listing 4: Some GUIDs Generated with SQL Server 2000 on Windows 2000
C87FC 84A -EE47-47EE -842C -29E 969AC 5131
2A 734AE4-E0EF-4D77 -9F 84 -51A 8365AC 5A 0
70E2E8DE-500E-4630-B3CB-166131D 35C 21
15ED 815C -921C -4011-8667-7158982951EA
56B3BF2D-BDB0-4AFE-A26B-C 8F 59C 4B5103
As you saw in Listing 3, only the eighth half byte is changed between calls. On the other hand, in Listing 4, only the 13th half byte is constant between calls. In Windows 2000, the MAC address isn't used any longer for when GUIDs are generated. Instead, the GUID is only a 16-byte random number. Well, that isn't totally true. The 13th half byte is constant, so only 15.5 bytes are random. Adam Nathan at Microsoft explained to me that the 13th half byte is the value that will point out the source of the GUID, and 4 means Microsoft. Thanks, Adam!
The guess that I had about the problem with the INSERT overhead for GUIDs was that the lack of order in the Windows 2000–generated GUIDs was giving SQL Server a hard time administering indexes under massive INSERT periods. Therefore, I tried to create an order for the GUIDs instead. I tried to CAST the current DATETIME to a BINARY(8) and put that first in the GUID. Unfortunately, that had no effect. When I investigated it further, I found that when I had a BINARY(16) value and CASTed it to a UNIQUEIDENTIFIER, some bytes were scrambled. What to do? You guessed it. I tried to compensate for the scrambling to see if that had any positive effect on throughput, but no effect occurred. Then I found out that it wasn't the first (high) byte that was important for the new ordering, but the last (low) bytes. I also learned that I didn't have to use BINARY(8) for the current DATETIME. BINARY(6) is enough for the next 77 years, so I decided to occupy only the last (low) 6 bytes with the current DATETIME. In Listing 5, you can see the result of PRINT CAST(GETDATE() AS BINARY(8)). The first (high) bytes are 0.
Listing 5: GETDATE() as BINARY(8)
The algorithm that I used to generate my new GUIDs is shown in Listing 6; GUIDs of that type (generated in SQL Server 2000 on Windows 2000) are shown in Listing 7. I call this type of GUIDs COMBs (for COMBined, abbreviated).
Listing 6: An Algorithm to Generate Ordered GUIDs (COMBs)
DECLARE @aGuid UNIQUEIDENTIFIER
SET @aGuid = CAST(CAST(NEWID() AS BINARY(10))
+ CAST(GETDATE() AS BINARY(6)) AS UNIQUEIDENTIFIER)
Listing 7: A Set of GUIDs (COMBs) Generated with the New Algorithm
CC 22A 56D-0CD5 -43C 5-990E-919001862D20
D5149998-1718 -468C -B1AD-919001862D20
CBD0182D -4A 0E -40AC -9A 4C -919001862D20
As you saw in Listing 7, those GUIDs (COMBs) have the same last 6 bytes. Those represent the DATETIME of when those five GUIDs were generated. A DATETIME has the "precision" of 1/300th of a second, and it's no problem for SQL Server to generate five GUIDs in 3ms with this algorithm, even though it is slower than using only NEWID(). As you can see, ordering the GUIDs isn't perfect because I will have several GUIDs being generated with the same DATETIME—but that doesn't really matter here.
Finally, I found something that helped with my INSERT overhead. When I used this new solution, I got the throughput table shown in Table 2.
Table 2 Result of Test A: INSERT of 500 Orders, Revised
Time (with 1,000,000 Orders)
This was great news, and the overhead of the GUIDs was actually less than expected. Perhaps you wonder whether my algorithm to generate GUIDs in itself isn't very slow. Often when you do conversions, you see CPU cycles fly away. On the other hand, compared to writing to a database, those conversions are pretty cheap.
You also might think that this algorithm must be risky because there is no guarantee of unique GUIDs. That's correct, but the built-in algorithm makes no guarantee, either. It's just a random number. (Not even the NT4 algorithm was guaranteed, as I said before.) Okay, a 15.5-byte random number has low predictability of being duplicated. (Most of us think that the chance of winning millions in the lottery is almost zero. Still, the chance is a number that is less than 4 bytes!)
With my algorithm, the random number's size decreases from 15.5 bytes to 9.5 bytes because the last 6 bytes are occupied with the current DATETIME. Of course, the risk of generating duplicated random numbers with my algorithm is larger than with NEWID(), but it's not a problem if the numbers don't occur within the same 1/300 of a second. In my opinion, the risk of duplicates is small enough that it's acceptable. And when you have UNIQUE CONSTRAINTs, it's okay even if you happen to see two duplicated GUIDs being generated once a decennium or so.
To judge the likelihood of generating two duplicate COMBs, I ran a quick test. I used 6 static bytes for the low bytes and then created a million COMBs without duplicates. It won't be possible for you to INSERT close to that many rows in 1/300 of a second.
I also discussed the risk with a friend who is a mathematician. He pointed out that we don't know the algorithm being used for the random number. Is it "perfect"? He also pointed out that it's very dangerous to make assumptions and conclusions about likelihood and random numbers. Even so, he said that there seemed to be a very low risk of creating duplicate COMBs.
To add some more fuel to the fire, I'd like to point out that the differences in real-life apps are probably bigger than the one I showed in Table 1 when ordinary GUIDs are used. This is because I skipped test results when CHECKPOINTing was going on—CHECKPOINTing was much more meaningful to the GUID test. When I used my custom (COMB) algorithm, CHECKPOINTing was much less frequent and was used for short intervals. That is obvious when I compare the time for INSERTing 500,000 orders when there are 500,000 orders present at the start. You can see the throughput shown in Table 3.
Table 3 Result of Test A: INSERT of 500,000 Orders
Time (with 500,000 Orders)
When did you last get a wage increase of 30 times your salary? Think about it, and you'll get a feeling for what a huge difference this is!
Finally, the memory consumption of the tests was very different. The IDENTITY+INT and my new GUID (COMB) algorithm grabbed perhaps 1MB of RAM for each 500 orders INSERTed. The ordinary GUID algorithm grabbed all memory given to SQL Server (350MB in my tests) when 500 or 1,000 orders were INSERTed!
One property of ordinary GUIDs isn't maintained with COMBs: The order of creation is not known with GUIDs. With COMBs, it is known. For some applications, this can be a problem that keeps COMBs from being used.
Is it common to see this type of raw INSERTs only? No, not that common. Most applications also need some reading of data. Now on to the second test.
Test B: Fetch of Single Order
In the second test, I fetch the customer ID from some of the orders, one by one. To get a correct set of IDs to work with, I first ran a script that loops through the table to grab 100 random IDs and PRINTs a complete stored procedure call to the stored procedure a_Order_Fetch, which is shown in GUID version in Listing 8.
Listing 8: GUID Version of a_Order_Fetch
CREATE PROCEDURE a_OrderGUID_Fetch
, @customerId UNIQUEIDENTIFIER OUTPUT)
DECLARE @anError INT
SET @aCustomerId =
FROM orderGUIDmaster om
WHERE om.id = @orderId)
SET @anError = @@ERROR
Then I copied the 1,000 PRINTed rows and pasted them to another script that EXECUTEs the calls. To skew the test results as little as possible here, I first executed a DBCC DROPCLEANBUFFERS in the beginning of the script that calls the stored procedure.
Let's take a look at the results for the second test. As usual, the first value is recalculated to 1, to be used as a baseline.
Table 4 Result of Test B: SELECT of Orders
Time for Fetching 100 Orders
This is fine, isn't it? But what about the dreaded 16-byte JOIN?
Test C: A Small JOIN
The most common argument for not using GUIDs for primary keys is that it's so expensive for JOINs. Perhaps you have heard this: "Use a 16-byte column to join? Are you crazy?"
I reused Test B but changed the stored procedures slightly so that the orderdetail table was also used in a JOIN, as shown in Listing 9.
Listing 9: COMB Version of a_Order_FetchWithDetail
CREATE PROCEDURE a_OrderCOMB_Fetch
(@orderId UNIQUEIDENTIFIER, @noOfRows INT OUTPUT)
DECLARE @anError INT
SET @noOfRows =
FROM orderCOMBmaster om
INNER JOIN orderCOMBdetail od
ON om.id = od.ordermaster_id
WHERE om.id = @orderId)
SET @anError = @@ERROR
The results of the third test case are shown in Table 5.
Table 5 Result of Test C: SELECT of orders and JOIN with Details
Time for Fetching 100 Orders
Worth mentioning is that the foreign keys are clustered both in the INTEGER case and in the COMB case. As a matter of fact, it was faster not using clustered foreign keys for this specific test, but I thought that it was most natural (and probably more effective in other tests) to use clustered foreign keys.
A Few Tweaks
The test set is pretty untuned, but I'd like to mention a few small tweaks that proved to work well:
- Even though you saw that the INSERT throughput for the ordinary GUID case wasn't good, it would have been even worse if I had used clustered primary keys in that case.
- Each INSERT of an order is wrapped in an explicit transaction. That is, I wrote 1 row to ordermaster and 10 rows to orderdetail in each transaction. Writing many orders in each transaction would have given a shorter processing time, but in an Online Transaction Processing (OLTP) application, it's often natural to have one transaction per order.
- I ran a DBCC DBREINDEX ('tablename', '', 90) before each performance test. This is very important for the GUID case but not that important for the INTEGER and COMB cases. I also did an UPDATE STATISTICS before I executed the second test, and I recompiled the stored procedures.
- The orderdetail.ordermaster_id is indexed. Otherwise, Test C will probably run forever.
I did a few other things between tests to get as fair tests as possible. I ran a backup of the database after the massive INSERTs to all the tables before I collected my metrics. That way, I emptied the transaction log. I also stopped SQL Server between each test case. And, of course, I ran each test case several times and skipped extreme values before finding typical values.
Of course, there are other solutions than INTs and GUIDs. Another typical solution is to go for the BIGINT datatype (64 bits), which is new for SQL Server 2000. You can use IDENTITY for BIGINTs, too.
Here is my conclusion regarding the overhead of using ordinary GUIDs for the primary keys:
- Indexes need more frequent rebuilding with GUIDs. That also applies when updating statistics.
- You must be prepared to see inconsistent INSERT throughput at different times.
- The scalability for INSERTs isn't good because the INSERT overhead increases rapidly.
With my COMB algorithm, the three problems mentioned disappear. The problem with the large overhead in size is there both with my COMBs and with ordinary GUIDs. (After all, COMBs have the format of GUIDs!)
But GUIDs have a lot of advantages, too. (These are sometimes so big that you really don't have a choice!) In my opinion, using GUIDs as primary keys is often a very good idea (especially if you go for the COMBs instead). Is the overhead or the advantages more important when it comes to your applications?
Special thanks to Vladimir Suponitsky and Tibor Karaszi!