Writing a Portable Data Access Layer
Summary: Find out how to write smart applications that work transparently with different data sources, from Microsoft Access to SQL Server to Oracle RDBMS. (15 printed pages)
During the last six years of doing consulting work, I've heard this question about data access and manipulation many times, and it's become a real obsession: "How can I write my application so that it works with database servers x, y, and z with few or no changes?" Knowing that the data access layer is still the most critical part of a modern application, and usually the #1 enemy for inexperienced developers, my first reaction has always been: you can't!
People's scared faces and the question, "But what about the Universal Data Access mantra that Microsoft proposed with ADO?" prompt me to provide a more detailed explanation of the problem, and a suggested solution.
The problem is that everything goes well while your application is a small prototype, or if you have few concurrent users and simple data access logic, even if you choose the easiest way: the use of RAD tools, like the Data Environment in Microsoft? Visual Basic? 6.0, or some "all-in-one" solutions like the ActiveX? Data Control and other third-party components, which usually hide the complexity of the interaction between your application and a specific data source. But when the number of users grow, and concurrency becomes more of an issue, a lot of performance problems can appear due to the underlying use of dynamic record sets, server-side cursors, and unnecessary locking policies. The design and code changes you'll have to make to the system to reach your users' goals will then cost you a lot more because you didn't take this problem into consideration from the beginning.
Using a Universal Data Access Approach
Microsoft launched the Universal Data Access campaign when ADO entered maturity in MDAC (Microsoft Data Access Commponents version 2.1). The idea was to show developers that with a simple object model (Connection, Command and Recordset), they could write an application that could connect with a wide set of different data sources, in both relational and non-relational form. What the documentation—and the majority of the articles and samples at that time—usually forgot to say was that even using the same data access technology, the programmability and the characteristics of the various data sources were very different from one another.
The net result was that in applications which needed data from several sources, it was easiest to use the "common denominator" of functionalities provided by all data sources, thereby missing the benefits of using data source-specific options that could provide an optimized way to access and manipulate information inside the various RDBMSs.
What always made me skeptical about this approach was that, after a more detailed analysis of the problem with my customers, we usually agreed the portion of the application that interacted with the data source was a very small one, when compared to the rest of the presentation and business logic. By doing a good job with a modular design, it was possible to isolate the RDBMS-specific code in some easily interchangeable modules, and thereby avoid the "one-size fits-all" approach to our data access. Instead, we could use very specific data access code (using stored procedures, command batches and other features, depending on the data source), without touching the majority of the other application code. This always serves as a reminder that correct design is the key to writing portable and efficient code.
ADO.NET brings some important changes into the data access coding arena, like the concept of specialized .NET data providers. Using a specific provider, you can get an optimized way to reach your data sources, bypassing the [very] rich—but sometimes unnecessary—series of software interfaces and services that the OLE DB and ODBC layer interposed between your data access code and the database server. Still, every data source has different characteristics and features, with different SQL dialects, and to write efficient applications you must still use these specific characteristics instead of a "common denominator". From the point of view of portability, managed and unmanaged data access technologies are still very similar.
Outside of "Leverage the unique characteristics of the data source," the other rules necessary to write a good data access layer are usually the same with every data source:
· Use a connection pooling mechanism, where possible.
· Take care with the limited resources of a database server.
· Pay attention to the network round-trips.
· Promote the reuse of execution plans and avoid recompilations, where applicable.
· Use an adequate locking model to manage concurrency.
In my personal experience using the modular design approach, the amount of code in a complete application which is dedicated to working with a particular data source is not more than 10% of the total. Obviously, this is more complex than just changing the connection string in a configuration file, but I think you'll find that it is a tolerable compromise in return for the performance benefits.
Working with Base Interfaces
Our goal here is to use abstraction, and encapsulate the code specific to a particular data source in a layer of classes that let the rest of the application be independent, or decoupled, from the database server in the backend.
The object-oriented characteristics of the .NET Framework will help us during this process, giving us the opportunity to choose which level of abstraction we want to use. One option is to use the base interfaces that every .NET Data Provider has to implement (IDbConnection, IDbCommand, IDataReader, etc). Another is to create a set of classes—the data access layer—that manage all the data access logic for the application (using the CRUD paradigm, for example). We will examine these two possibilities, starting from a sample order-entry application, based on the Northwind database, to insert and retrieve information from different data sources.
Data provider base interfaces identify the classic behaviors that an application usually requires to interact with a data source:
· Define a connection string.
· Open and close a physical connection to the data source.
· Define a command and related parameters.
· Execute the different kind of commands you can create.
· Returning a set of data.
· Returning a scalar value.
· Executing an action on data without returning anything.
· Provide a forward-only and read-only access to the returned data set.
· Define a set of operations to keep in sync a data set with the content of the data source (a data adapter).
That being said, however, if we encapsulate the various operations needed to retrieve, insert, update and delete information in different data sources (using different data providers) in our data access layer, and only expose members of the base interfaces, we can reach a first level of abstraction—at least from a data provider point of view. Let's take a look at some code illustrating this idea:
The point of this class is to hide, from the upper levels of the application, the details regarding the creation of instances of a particular type coming from a specific data provider, the application can now interact with a data source using the generic behaviors exposed through the base interfaces.
Let's look at how to use this class from the rest of the application:
In the GetCustomers() method of our CustomerData class we can see how, by reading information from a configuration file, it's possible to use the DataFactory class to create an XxxConnection instance with a particular connection string, and write the rest of the code with no particular dependency on the underlying data source.
An example of a business layer class that interacts with our data layer could look like this:
So, what's wrong with this approach? The problem here is there's just one important detail that ties the code to a particular data source: the SQL syntax of the command string! In fact, writing your app this way, the only thing you can do to make it portable is to adopt a base SQL syntax that can be interpreted by any of your data sources, thereby losing any chance to benefit from the specific functionality of a particular data source. This could be a small problem if your application has to do only very simple and standard operations over the data, and if you don't want to use advanced functionality (XML support, for example) in a particular data source. Usually, though, this approach will result in poor performance, since you cannot use the optimized features of each data source.
Writing a Specialized Data Access Layer
Consequently, the use of base interfaces only is not enough to provide an acceptable level of abstraction from the different data sources. In this situation, a good solution could be to "raise the bar" of this abstraction, creating a set of classes (e.g. Customer, Order, etc.) to encapsulate the use of a specific data provider, and exchanging information with the other levels of the application through data structures not related to a particular data source; a typed DataSet, an object collection, etc.
This layer of specialized classes can be created inside a particular assembly, one for every supported data source, and can be loaded on demand from the application, following instructions in a configuration file. In this way, if you want to add a brand new data source to your application, the only thing you have to do is implement a new set of classes, respecting the "contract" defined in the common set of interfaces.
Let's see a real example: If we wanted to support both Microsoft? SQL Server? and Microsoft? Access as data sources, we would create two different projects in Microsoft? Visual Studio? .NET, one for each data source.
The one for SQL Server would look like this:
The code for data retrieval from Microsoft? Access would look like this:
The CustomersData classes implement the IdbCustomers interface. When we need to support a new data source, we only have to create a new class that implements this interface.
An interface of this type can look like this:
We can create private or shared assemblies to encapsulate these data access classes; in the first case, the assembly loader will search for the one we specify in the configuration file inside the AppBase folder, or in a child directory using the classic probing rules. If we have to share these classes with other applications, we can put these assemblies in the global assembly cache.
Using Data Access Classes from the Other Layers
These two almost identical CustomersData classes are contained in two different assemblies that the rest of the application will use. Through the following configuration file, we can now specify which assembly to load and which data source to target.
An example of a possible configuration file would be something like this:
We have to specify two pieces of information inside this file. The first one is a canonical connection string; to have the opportunity to change, for example, the name of the server, or some other parameter for the connection. The second is the fully qualified name of the assembly that the upper layer of the application will load dynamically to find the class to use with a particular data source:
Let's look at this portion of code too:
As you can see, the assembly loads using the name read from the configuration file, and creates and uses an instance of the CustomersData class.
Some Possible Improvements
To see an illustration of the approach I'm suggesting, see the .NET Pet Shop v3.0 sample application. I'd recommend downloading the sample and taking an in-depth look at it—not just for portability issues, but also for other interesting areas like caching and performance optimization.
An important area on which to focus your attention during the design of the data access layer for a portable application is how to pass the information back and forth with the other layers. In my example, I simply use a generic DataTable instance; in a production scenario you might want to consider a different solution, based on what kind of data you have to represent (do you have to deal with hierarchy, etc.). I don't want to reinvent the wheel here, and my suggestion is to take a look at the Designing Data Tier Components and Passing Data Through Tiers guide that describes very well the different scenarios and the benefits of the recommended solutions.
As I said in the introduction, the particular features that your targeted data sources expose—as well as the overall data access—should be considered during the design phase. This should cover such things as stored procedures, XML serialization, and so forth. Regarding Microsoft? SQL Server? 2000, you can find a discussion of how to optimally use these features in the .NET Data Access Architecture Guide, which I strongly suggest you read.
I always receive a lot of requests about the Data Access Application Block and how it is related to the arguments I'm describing in this article. These .NET classes act as a layer of abstraction over the SQL Server .NET Data Provider, and let you write more elegant code to interact with the database server. This is an idea of what you can do:
There's also an extrapolation of this approach available in the open source Data Access Block 3.0 (Abstract Factory Implementation) sample that you can find on GotDotNet. This release implements the same abstract factory pattern, and lets you use different data sources based on the available .NET Data Providers.
You should now be able to build business logic classes that don't require modification based on the choice of a particular data source, yet allow you to exploit the unique features of the given data source to obtain a more optimized result. This comes with a cost; we have to implement multiple sets of classes to encapsulate the low-level operations for a particular data source, together with all the programmable objects that we build for every specific data source (stored procedures, functions, etc.). If you want performance and portability, however, this is the price you have to pay. Based on my practical experiences, it's worth it!