A relational database is composed of entities (aka tables), relationships between the entities, and attributes that describe the entities. By creating related tables, you minimize data redundancy, and therefore, the disk space utilized. For example, suppose you have two entities, Employees and Departments. If you separate employee information into one table and departmental information into another, you minimize disk utilization by including in the Employees table only an identifier of the employee's department rather than duplicating the departmental information for every employee. Although this separation minimizes disk space, it creates a problem. When you need to display employee information along with the appropriate department, you need a way to bring the two tables together. The type of query that brings the two tables together is called a join. A join is one important way of accessing relational databases; by using a join, you can unite entities into one resultset that fully describes the entities, relationships, and attributes that form your data's framework.
Although joins are simple, many people have difficulty determining when and how to use them. So, I decided that joins were a good topic to start this column. To cover all aspects of joins, I plan a series of articles to cover entity and attribute basics, definitions of the join types, and complex join concepts. I'll also examine how you can create indexes that improve join performance.
Back to Basics
A relational database consists of tables, with each table focusing on one entity. An entity is a classification given to a group of similar objects. Thus, an entity can be a person, place, or thing such as a customer or a product. Each entity has attributes that are typically specific only to that entity. An attribute—or, with respect to a table, a column—describes a non-decomposable characteristic of an entity. For example, in a products table, you might want to describe each product your company sells. Products generally have an ID (or other unique identifier), a name, a description, a price, and so on. Therefore, the Product entity table in your database would have simple attributes such as ProductID, ProductName, and Price.
When you consider what entities to store, think about how the data will be used and searched. If an attribute, or a component of an attribute, needs to be referenced in a WHERE clause or in an ORDER BY clause, you probably want to define the component as a single column. For example, a Name column is acceptable, but the data could be more easily accessed if you broke it into FirstName, MiddleInitial, and LastName columns. The three-column approach best serves the following search conditions and ORDER BY clause:
WHERE LastName = 'Smith' AND FirstName =
'Robert'
ORDER BY LastName, FirstName, MiddleInitial
However, you can take this approach too far. For example, you could break AddressLine1 into StreetNumber, StreetName, and so forth. You might need to break addresses down if you perform analysis based on demographics and need to search within an area's defined quadrants; however, AddressLine1 is usually sufficient.
When you describe an entity during table creation, you must also decide whether each attribute requires a defined value and, if so, what the domain of legal values is that the attribute will support. If a definite value isn't necessary, the column definition can allow NULL values. You implement these decisions by defining the attribute's data type and nullability (whether that column accepts NULL values). Nullability is an important consideration, so don't leave the determination of nullability to the current session settings or database settings. Always describe nullability in the table definition after giving the subject serious thought.
After you define the entities and corresponding attributes in your database, you populate the table with data; then you'll probably need to display that data in reports. Entities are a database's core components; attributes and relationships give an entity detail and depth and describe its associations with other entities. If a relational database has minimized redundancy by creating many entity tables, you use joins to create a picture of the data and its relationships.
Let's review a sample database called TSQLTutorJoins (see "Obtaining the Code," page 3, for download instructions for the creation file). This database has four entities: Customer, Product, Category, and Order. To minimize the join results, each table has only a few attributes and only a few rows. These limitations let you focus solely on the join results and not get lost in large resultsets and complex attributes.
Also, I defined primary key and foreign key constraints on each table to enforce the relationships between the tables. This enforcement, declarative referential integrity (DRI), significantly helps maintain and enforce accuracy in the relationships between entities when the database processes additional INSERT, UPDATE, and DELETE statements.
DRI is useful when you're first using a database because the DRI adds descriptive graphical information to the entity relationship diagram (ERD) you create through your database-diagramming tool. You can use a database-diagramming tool to generate a graphical representation of all your tables and their relationships. The DRI definitions expose the relationships. After running the TSQLTutorJoins.sql script to create the TSQLTutorJoins database, you can use Enterprise Manager to create a database diagram. Expand your database list, right-click the Diagrams option, then choose New Database Diagram. Add the Category, Customer, Product, and Order tables from the Available Tables list to the Tables to add to Diagram list. Click Next to diagram the database; your completed diagram should look like the diagram that Figure 1 shows.
In Figure 1, you can see that Customer is related to Order, Order is related to Product, and Category is related to Product. Let's examine these relationships more closely. The line that connects two tables tells an important story. Between the Customer and Order tables, the line with a key at one end and an infinity symbol at the other end denotes a one-to-many (1:M) relationship (i.e., for one Customer, you can have many Orders). The connecting line also says that for any given Order, you can have only one Customer. However, for the purpose of describing tables and their relationships, you can say that Order and Customer have a 1:M relationship. In this diagram, you can also see that a Product can be on many Orders, but an Order can contain only one Product. Again, this relationship between Order and Product is 1:M.
What's the relationship between Customer and Product? In this database, a Customer can purchase many Products, and many Customers can purchase the same Product. When the diagram shows the line connecting two tables intersecting at one table with keys touching the main tables (Customer and Product) and infinity symbols touching the table in common (Order), the relationship between the related "key" tables is many-to-many (M:N). Therefore, Customer to Product is an M:N relationship (which database designers frequently describe through an intermediate table, as I've done here).
You can use T-SQL syntax—specifically, sp_help and sp_helpconstraint—to view the tables, their relationships, and their details. However, a database diagram is an easy way of seeing the big picture quickly. However you discover the relationships, you can use this information to determine the most common means of joining the tables—primary keys and foreign keys. Most joins connect the primary key of one table to the foreign key of another.
What Can Joins Do for You?
The four tables in the TSQLTutorJoins database contain six customers, three product categories, seven products, and eight orders. By looking at the tables, you can see who the customers are, what categories of products the company sells, which products it sells, and when each order was taken. You can also see that not every customer has made a purchase and not every product has been sold. Furthermore, some customers have made more than one purchase, and customers can make over-the-counter orders without supplying a CustomerID.
Let's start with some easy questions about the data: Which customers have made a purchase? Which customers haven't made a purchase? You don't need to know how to write a join to answer these questions; you can use a subquery instead. For example, you can answer the question, Who made a purchase? with the following query:
SELECT *
FROM dbo.Customer
WHERE Customer.CustomerID IN (SELECT
[Order].CustomerID FROM dbo.[Order])
This query returns the results that Table 1 shows.
TABLE 1: Results of Query to Find Which Customers Purchased Products | |||
CustomerID | FirstName | MiddleInitial | LastName |
1 | Arthur | A | Curry |
2 | Bruce | B | Wayne |
5 | Eddie | R | Nashton |
6 | Diana | W | Prince |
(4 row(s) affected |
You can answer the question, Who hasn't made a purchase? with the following query:
SELECT *
FROM dbo.Customer
WHERE Customer.CustomerID NOT IN (SELECT
[Order].CustomerID FROM dbo.[Order])
which returns no rows. But you know that two customers, CustomerID 3 (Selina) and CustomerID 4 (Oswald), haven't made any purchases. Why doesn't the subquery answer the question correctly? The reason goes back to nullability. Because your database allows NULL values for CustomerID, the set has undefined rows—rows in which no certain value is known. In this subquery, you're asking for the customers who are NOT IN a specific list, but that list includes unknown values. Therefore, you can't know with certainty that any rows are NOT IN the undefined set. If you use the default behavior of Query Analyzer—ANSI_NULLS behavior ON—the subquery won't work here unless you add a search condition to it. I recommend working within the bounds of the Query Analyzer session-setting defaults because some changes to those settings can have a negative impact on other operations. (For more information about session settings and their impact, see SQL Server Books Online—BOL—for "Set Options That Affect Results.") I describe the behavior I found with the defaults in Query Analyzer.
Note that you could create an entry in Customer for the "over-the-counter" customer to solve this problem, but you won't always have control over the database. Understanding how NULL values work in all situations is crucial if you plan to support NULL values. The following query (with the search condition added)
SELECT *
FROM dbo.Customer
WHERE Customer.CustomerID NOT IN
(SELECT [Order].CustomerID FROM
dbo.[Order]
WHERE [Order].CustomerID IS NOT NULL )
returns the expected results, which Table 2 shows.
TABLE 2: Results of Query to Find Which Customers Purchased No Products | |||
CustomerID | FirstName | MiddleInitial | LastName |
3 | Selina | C | Kyle |
4 | Oswald | P | Cobblepot |
(2 row(s) affected) |
Now you have the answer to the original questions of who has and hasn't bought a product, but what if you need details? How many orders did Bruce place, what did he order, and when did he place the orders? With subqueries, you can see an order's existence but not its details. Joins help you to see all the details. Let's look at the various types of joins.
An inner join is the intersection between two tables. For inner and outer joins, Venn diagrams show, in a simple way, the relationships between the joined tables and the rows that the join query should return. For example, let's review the Customer and Order tables. You can graphically represent the question, Which customers have made a purchase? by showing Customer as the left circle and Order as the right circle, as Figure 2 shows. In the circles, CustomerIDs represent those customers who have placed an order. Note that two rows in the Customer circle (CustomerIDs 3 and 4) and two rows in the Order circle (representing over-the-counter sales) are outside the intersection.
To perform a join between these tables, you first need to define the criteria to base the intersection on. If you want to display the orders for a given customer, the customer's CustomerID must equal the order's CustomerID. This correspondence acts as the join definition. Typically, joins are based on primary keytoforeign key relationships, so looking for these relationships first can help you write your joins.
You can write a join in either of two ways: You can use the current ANSI standard syntax (ANSI-92), which SQL Server 2000 supports, in which you place the join conditions in the FROM clause. Or you can use the older ANSI standard, in which you place the join conditions in the WHERE clause. You'll encounter both styles, so let's look at the semantics of both. Especially in dealing with outer joins, the two styles differ in the resultset produced as well as in syntax.
Both syntaxes start with the SELECT list: What information do you need about each order and the customer who placed the order? For this example, let's ask for the customer's FirstName and the order's ProductID:
SELECT Customer.FirstName,
[Order].ProductID
--NOTE: Order must be in square brackets
--because it's also a keyword.
To truly understand joins, I recommend that you anticipate the results. Ask yourself, How many rows should the query return? If you have six customers and eight orders, the maximum number of rows that an inner join between these tables could return is eight—the number of rows in the larger of the two tables. Anticipating the number of rows helps you know when you've made a mistake. Because this estimation works only for small tables, I created the TSQLTutorJoins database with a limited number of rows. If every row in the Order table has a match, you'll see each order along with the CustomerID for the customer who placed the order. Figure 2 shows that four CustomerIDs—1, 2, 5, and 6—have placed orders; however, it doesn't show how many orders each one placed. So you know that your query should return between four and eight rows. With two rows in the Order table showing NULL for the CustomerID, you know that an inner join of these tables will return exactly six rows. Estimating the results is an important first step toward fully understanding the join relationships between tables. If you understand the resultset first, each join type will come into focus.
Next, let's put together the JOIN clause. Using the newer ANSI standard—ANSI-92—you begin with the keyword FROM and add the name of one of the tables, fully qualifying the owner:
FROM dbo.Customer
Next, add the join type (INNER JOIN) and the remaining table name:
INNER JOIN dbo.[Order]
Note that the table order in the query is unimportant. You could also reverse the table order to write this query.
Then, you need an ON clause that defines the relationship and contains the conditions on which the join is based. In this case, the customer's CustomerID must match the order's CustomerID:
ON Customer.CustomerID = [Order].CustomerID
Using the recommended ANSI syntax, the complete JOIN statement is
SELECT Customer.Firstname AS
CustomerFirstName, [Order].ProductID
FROM dbo.Customer
INNER JOIN dbo.[Order] ON Customer
.CustomerID = [Order].CustomerID
The old ANSI-standard syntax can be useful in correlated subqueries and other T-SQL syntax, although that syntax contains less detail about the type of join in the FROM clause. In addition, with the older syntax, you place the join conditions in the WHERE clause instead of the FROM clause. So, using the old ANSI standard syntax, the complete JOIN statement is
SELECT Customer.Firstname AS
CustomerFirstName, [Order].ProductID
FROM dbo.Customer, dbo.[Order]
WHERE Customer.CustomerID = [Order]
.CustomerID
Both these queries return exactly the same six-row resultset, but the order of the results can vary. Remember that as of SQL Server 7.0, ordering is guaranteed only if you request ordering through an ORDER BY clause.
So which syntax should you use? Using the newer ANSI syntax for joins is better, for several reasons. You can demonstrate one important reason by creating a typo or an omission in your JOIN statement in Query Analyzer. For example, remove the join conditions from the join queries, then retry the queries:
SELECT Customer.Firstname AS
CustomerFirstName, [Order].ProductID
FROM dbo.Customer
INNER JOIN dbo.[Order]
and
SELECT Customer.Firstname AS
CustomerFirstName, [Order].ProductID
FROM dbo.Customer, dbo.[Order]
The first query generates a syntax error, but the second generates 48 rows of data. This product of the eight orders and six customers is called a cross-join, or Cartesian product. Using the old standard, you can easily make this mistake and generate an incorrect resultset. With a two-table join, you can easily recognize and fix this type of error. But with a larger, more complicated join and additional search conditions, errors can be difficult to detect.
The recommended ANSI-92 syntax requires that you supply the necessary join conditions and returns an accurate resultset. Furthermore, the ANSI-92 syntax makes your JOIN statements easier to read by separating the join syntax from the search arguments.