Database Abstraction vs. Data Access

Alejandro Gervasio has an excellent article on using polymorphism to create a database abstraction layer.

Basically if you are wondering why or how PEAR DB, MDB, ADOdb, or PDO use the Factory pattern to provide database abstraction this walks you through it.

Maybe five years ago, this was a mind-blowing idea. But perhaps we should call into question whether this is an unnecessary abstraction.

[Database abstraction vs. Data Access after the jump]

Database abstraction

The only good purpose of database abstraction is to provide a uniform interface for database access. Instead of having to remember that the server, username, password ordering is different between mysql_connect(), or oci_connect(), you can just use new PDO() for all of them. Similar things hold for prepared queries, binds, and transactions.

I mention this because that convenience has followed my predilection for database abstraction from PEAR DB in 2000, to ADODB in 2004, to the in-vogue PDO today. I still like it and sometimes I wish they’d go even farther (like correctly abstracting the date function and date type information for me).

The seedy underside of database abstraction

Many people use a database abstraction layer for database portability and independence.

How far can you go with portability? Honestly unless you go to really heinous extremes, your code isn’t really going to be easily portable after a month using only database abstraction. Any of you who have had to do this sort of migration know what I mean. Let’s look at the SQL LIMIT clause in the example given:

SELECT          * FROM table FETCH FIRST 10 rows ONLY /* DB2 */
SELECT FIRST 10 * FROM table                          /* Informix */
SELECT TOP 10   * FROM table                          /* SQL Server */
SELECT          * FROM table LIMIT 10                 /* MySQL and PostgrSQL */
SELECT          * FROM table WHERE ROWNUM <= 10      /* Oracle */

(The astute reader will notice that my Oracle query is shorter than the one in the example I quote. That is because I take a shortcut as will be explained in a future article).

And beyond that, let’s say you want database independence and not just portability. You would not only have to be forced to using strange abstractions like ADODB->SelectLimit(), abstracted transactions, etc., you’ll have to avoid using RecordCount(), subqueries, stored procedures, or just about any interesting feature of the database.

You may have not noticed this because PHP developers have been advocating database independence via its hit man, database abstraction, for as far back as I can remember. And the M in LAMP has been MySQL 3, one of the simplest database known to man, and—just as fortunately—well tuned for web development in general (fast SELECTS, LIMIT clause) and PHP development in particular (lightweight connections).

But I wonder how much of this is becoming self-selection: How many developers out become PHP developers because they secretly despise the database development the way a Java programmer despises procedure-oreinted code? How many developers choose MySQL simply because they develop in a database independent manner? How many developers advocate database independence because all they know is MySQL 3? (Or, put another way, how many times have you heard a PHP developer lament “database X doesn’t have the LIMIT clause” when, in fact, Database X does, it’s not just called LIMIT and that developer just doesn’t know it.)

Don’t get me wrong: I use database abstraction when I have the opportunity. And, MySQL is my favorite database. As Adam likes to joke, “MySQL and PHP put the mmp in LAMP.”

But even I have to wonder: Do I only use database abstraction because, like a friend said to me in graduate school, “Terry, you’re so lazy you’d walk a mile to save a yard?” Do I use MySQL because it is like a really good whore: easy, fast, and cheap?

Database dependence

Over the next year I’m going to try to convince you why database dependence can be a Good Thing™. I don’t know exactly what I’m going to write about, but since Chris Jones has been bugging me to write more about OPAL (Oracle PHP Apache Linux), I suppose I’ll do it from approaching things using Oracle.

(Maybe I’ll get better deals from Oracle if I keep at it Or OTN will publish some of these articles and I can buy yet another digital SLR camera.)

This doesn’t mean that if you use MySQL there is nothing to learn from these posts. MySQL has come a long way and there some cool new database dependent tricks that are popping up in MySQL development. The nasty side of talking in a database-dependent way is that a trick I talk about for Oracle won’t necessarily have a corollary in MySQL or DB2. But I hope reading these (hopefully not completely boring) articles will get you thinking about examining your favorite database for some database dependent gems of your own.

Heck, I’ll be the first to admit that while MySQL is my bitch, the company I work for uses Oracle, bought at paid for. And, I don’t like buying commercial software if I use it in a manner that is nothing better than the free and open source counterpart. Nature abhors a vacuum, and so do I, especially when someone is paying for it.

Data Access and Security

I’m going to advocate Data Access instead of (or on top of) Database Abstraction. This does two things: it increases security and preserves portability. (The consequences will be extra abstraction and a lot of extra code.)

You may remember that an interview question of mine is “How do you create a single security audit point for SQL injection attacks?” An answer is Data Access Objects.

Data Access is a very easy concept. Basically it is simply: “Restrict access to the database to an interface (or set of interfaces) to that layer.” Let’s say you are building an eCard site on top of a contact management site, you might create a data access object that looks like the following:

class plx_dao
{
    protected $_dbh;
    function __construct($dsn, $user='', $password='', $options=array())
    {
       $this->_dbh = new PDO($dsn, $user, $password, $options)
    }
    protected function _getAssoc($sql, $bindings)
    {
       // TODO: no error handling
       $stmt = $this->_dbh->prepare($sql);
       foreach ($bindings as $idx=>$value) {
       for ($i=0,$max=count($bindings); $i<$max; ++$i) {
           $stmt->bindParam($i,$bindings[$i]);
       }
       $stmt->execute();
       return $stmt->fetchAll(PDO::FETCH_ASSOC);
    }
}
class plx_dao_ecard extends plx_dao
{
    const select_sql = 'SELECT * FROM ecards ORDER BY popularity,DESC LIMIT ?,?';
    public function getEcards($start,$max_num=20) {
        return $this->_getAssoc(self::select_sql, array((int) $start-1, (int) $max_num);
    }
}

Now an audit is very easy to perform. You can audit all the SQL for a given application by looking at its data access object’s constants. You can look in the methods themselves to make sure you are correctly casting all the bind parameters. You can even filter or escape the outgoing data from the methods (though you should probably leave escaping for your template layer).

Without a Data Access Object, you have to look at your entire codebase for the database calls.

Data Access is not Data Object

You may think this look a lot like Data Objects but actually they’re polar opposites. The idea of a Data Object approach—an ORM similar to ActiveRecord1 which Ruby on Rails uses—is the database rows should be made (almost) homomorphic with the object layer. This is an entirely database independent abstraction.

The nice thing about Data Objects is the interfaces to them are well-defined. These actions have a one-to-one correspondence on the database on the back end. This fits right into the convention over configuration model of Rails development.

Data Access Objects are configuration over convention. In practical terms, a big consequence of Data Access Object is that every interface with the database has to be specified. Nothing is automatic like in a Data Object.

On the other hand, there is nothing particular about a Data Access Object from creating an interface that does a multi-table join—something very difficult to do in the case of a Data Object. And, much more importantly, it doesn’t perform like dogshit and you are free to take advantage of database dependent optimizations.

Data Access as an assist on database depedence

You can imagine that instead of returning the rows as is with

return $stmt->fetchAll(PDO::FETCH_ASSOC);

I could be fancier and manipulated the data some more in the client

$results = $stmt->fetchAll(PDO::FETCH_ASSOC);
foreach ($results as $idx=>$result) {
    $tmp_fname = tempnam('/tmp','DBcrap_');
    file_put_contents($tmp_fname,$result['data']);
    list($results[$idx]['width'], $results[$idx]['height'], $image_type) = getimagesize($tmp_fname);
    $results[$idx]['thumbnail_url'] = sprintf('%s.%s',$result['picture_id'], image_type_to_extension($image_type));
    unset($results[$idx]['data']);
    unlink($tmp_fname);
}
return $results;

If this logic were moved to a stored procedure, you wouldn’t have to send an entire BLOB down just to generate the width, height and mimetype. But here is the thing, if you created the stored procedure, you only need change the code inside plx_dao_ecard::getEcards() to support, without a single change in any of the client classes. A Database Developer can see nice targets for stored procedure calls simply by looking at the functional spec of the data access objects themselves, instead of the PHP developers entire codebase, or blindly finding relationships from their query logs.

Another example might be when you are paging output. Certain abstractions like SQLRelay, have the capability to suspend sessions to access cursors across independent PHP instances. To do this, simply store the cursor into an internal property of a particular DAO. On __sleep() suspend the connection and have the cursor serialized with the session. On __wakeup() you can reconnect to the same database session and get the next page without having to run a new query.

Data Access and portability

While you are certainly and obviously not database independent, Data Access does assist in portability.

First, you have the flexibility of porting one group of tables at a time (the different DAOs will, if designed right, have few or no connectivity to other tables).

Second, you’d only have to port stored procedures and database dependent features one at a time. You would not have to audit the entire site to find all the database dependent calls because they are in one place.

Most importantly, if you no longer have access to database dependent features during a port, those features can be emulated back in the data access layer. Later your Database Developer can look for different optimization approaches.

I hope this explains why I think Data Access Objects are a necessary abstraction on large scale PHP sites because they compliment Database Abstraction without creating database independence, provide audit points for security, give the database developer adequate targets for optimization, and assist in portability.

1 This URL on Ruby on Rails site shows in a nutshell why Ruby on Rails is bad. Notice they talk about their design decision and how great it is but not a single line devoted to the consequences? Consequences for a convention? That’s sacrilege!

17 thoughts on “Database Abstraction vs. Data Access

  1. The reader (and the writer, hey Terry, haven’t you been to my PDO talk!? 🙂 should note that PDO is data access abstraction, not database abstraction.

    One of the main reasons for this is because database abstraction is basically a compromise between performance and functionality, and the tipping point of that compromise is going to be different for different folks. Rather than arbitrarily set that point based on the core developers preference, we avoided it and left it to people writing apps (and libraries) to extend and make their own choices to fit their usage.

  2. But perhaps we should call into question whether this is an unnecessary abstraction.

    I’ve seen so much bull**** scripts based on php, missing a basic set of abstraction and i wonder why, especially the developers that grew with php3,4 to 5, still code like we have 1999.

    You can do so much nice things with php, you just need to do it 😉

    btw: Everyone with a piece of brain and the chance to do so has a set of unit-tests and can easily port from one database to another w.o. worries 😉

  3. For portability, userland level data access abstraction is still a must have, as you just can’t depend on someones PHP configured with mysql, mysqli, PDO or whatever other flavour you fancy.

    This isn’t going to go away with PHP though, but as PDO becomes more prevalent, it will become less important.

  4. Wez,

    You are right. Unlike say MDB and ADODB, PDO only abstract the minimal stuff like connections, prepared queries, commits, rollbacks. A Database Abstraction layer does this plus abstracting the SQL calls themselves.

    I didn’t make this distinction because Database Access Abstraction and Data Access Objects are different concepts that are easily confused.

    (Database Abstraction is a necessary superset of Database Access Abstraction, and the line between them is very blurry: Does an abstraction support or emulate prepared queries? How does it choose to work with named bind parameters in a prepared queries? How much data translation does it do on native types and return types? Heck, even in the case of PDO’s lightweight interface, there is an option to transform the case of the associative array.)

    If you look, you can see I’m still advocating Database Access Abstraction in the first paragraph, both for ease of programming and to assist in portability. I just traditionally have used the Database Abstraction classes to do it.

    (In fact, if you look at my Data Access Object example, you can see I use PDO as my Database Access Abstraction.)

    In the later paragraphs, I’m speaking out against using Database Access Abstraction or Database Abstraction to create database independence. Database Independence is a concept I see in almost any PHP article on database abstraction and is one that I abhor. (I hate to single one out, but, for example, take the diagrams in this one.)

    (Umm, yes, I’ve been to your talk a couple times. 🙂 )

    I hope that clarifies things.

  5. Sebs,

    I agree completely for small scale sites: Data Access is unnecessary. Heck, Database Abstraction is also unnecessary unless the product is marketed as being deployed interchangeably on many database platforms. Also, in many cases, an enterprise product will gladly eat the performance penalty of database independence.

    As for “half a brain” comment, you are assuming there exist unit tests with adequate coverage. What PHP projects have you worked on where that is the case?

    In addition, I often find that database is the last area people do unit tests because Mocking is a hard to do—people will use a unit test harness to do integration testing at this level. You will note, I hope, that Data Access Objects make much easier targets that Database Abstraction for mocking and thus testing client classes.

    🙂

  6. Mark,

    I think you mean using a PHP database abstraction as a database access abstraction. I agree. While I used PDO in the example, the real world application the example was based on uses ADODB.

    I hope I show, though, that database abstraction is not sufficient in of itself for database portability: you either trade off performance (by making the database generic) or you lose portability (where are the database dependent calls going to leak into)?

  7. Hey Terry,

    Of course you knew I would eventually comment. As all good programmers are lazy and I strive to be one .. I will just cut and paste from some slides of a talk (http://www.backendmedia.com/MDB2/database_abstraction.pdf) I gave in Cancun. As you can see there are tons more reasons for using abstraction than simply being able to sell a shrink wrapped product to a bigger market.

    Motivation (1/5)
    • Support for multiple RDBMS
    – Less effort to support multiple RDBMS
    – Increase market opportunities
    – Focus development time on the bottlenecks
    Motivation (2/5)
    • Forward compatibility
    – new RDBMS versions
    • MySQL 3.x => 4.0.x => 4.1.x => 5.0.x etc
    – new PHP (extension) versions
    • ext/mysqli
    • ext/pdo
    • ext/ibm_db2
    Motivation (3/5)
    • Vendor Lock in
    – Too much code to rewrite
    • Train costs
    – New API to learn
    – New Concepts
    – New Syntax
    Motivation (4/5)
    • Pushing your own preference to the client
    – Show your client that your choice works
    – Better leverage open source database price
    point advantage
    Motivation (5/5)
    • Higher level API
    – Convenience methods
    – Debugging

  8. Lukas,

    I expected you would. 🙂

    Maybe in future talks you could devote more slides to addressing:

    Database Abstraction vs. Database Access Abstraction
    Database Independence vs. Database Dependence
    Data Objects vs. Data Access Objects
    Independence vs. Portability

    The purpose of the article is to put some of those distinctions “out there” because the current trend in the PHP world is to say one is left column is strictly good, and the right column strictly bad.

    Take Wez’s presentation. I don’t know how many times he’d distinguished between Database Abstraction and Database Access Abstraction, but how many people who have seen it understand it? How many people understand the consequences of doing so? How many people know how to preserve portability if you use that approach? I’ve asked a number of people who have parrotted the “PDO is better because it is a Data Access Abstraction not a Database Abstraciton” line that when I asked these sort of questions I become convinced there is no “there” there.

    IMO the pluses of Database Abstraction are mostly along the lines the fact that it gives you Database Access Abstraction. That covers most of your bullet points.

    Vendor Lock In and Pushing preferences are all along the lines of Database Independence which, while not able to be done by a strict Database Access Abstraction like PDO, comes at a huge performance cost.

    The higher level API stuff (I’ve used the query caching featrues of ADODB for instance), are where you can talk the talk, though try/catch and trigger_error natively aren’t half bad. It is in these areas (and performance) where one particular database abstraction or database access abstraction distinguishes itself from another.

  9. I am currently writting a talk on “Deployment using Database Layers & Database Schema” and as soon as it’s finished I’ll keep you posted.

    However I must also mention that:

    Personally I believe that database abstraction layers can be used for good in order to make sure you can port your applications easily, however one must take in consideration that this is not because you are using a DBAL that all your deployments will be effective. I mean you can write many queries optimized for mysql for instance if you product is mainly for a mysql crowd. However, this can become problematic when changing RDBMS. Yes your main queries (if you make them simple) are going to be all working and everything is going to be working well, however as soon as you need per-rdbms-performances-optimization then one will of course have to adjust it’s queries in order to fit and use the specific queries for each RDBMS (as MSSQL has different index types than mysql, mysql will have sql92 non compliant optimizations ( I think of COUNT(idField) that can be changed to SQL_CALC_FOUND_ROWS (which is proven to be faster), Oracle has also other ways of optimizing it’s queries.

    So my main point is coming..

    Having a DBAL can be good for some reasons, however, one thing that I personally find attractive, is that lightweight-DAO built on top of a good DBAL that will have optimized queries per-rdbms will be very effective for deployment, and will give you the ability to code new queries and just execute them and later on debug them. The DBAL will give you (most of the times) good ways/tools to debug the faultive queries, however, switching between data sources will be made smoothly and will be much less pain for of course vendors but can also benefit a lot to open source softwares to cover more and more grounds. This said, having a good DAO built will also allow you to have DBAs working per-rdbms which is going to end up in a much more solid, scalable, flexible software (free or not).

  10. @David:

    I agree. And it sounds like an interesting talk

    You might want to go into detail which of the DbAL’s are best for DAO and why. But the main point is the concept of a DAO is foreign to most PHP developers out there, so a basic understand of how a DAO and DbAL/DbAA compliment each other for portability and performance is necessary to convey (with examples).

    Too many talks in the PHP world have hidden frames like:
    1) A DbAL magically gives you portability
    2) Database independence is more important than portability
    3) THere is no need for database-specific performance optimization.

    Showing specific examples how a DAO/DbAL is constructed and how it allows for query optimization or stored procedure optimizations without sacrificing portability would be a great talk.

    🙂

  11. The only thing I could point to right now is Hibernate versions prior to 3 (As three and more are overly bloated)

    This is java, but imho, a lot of PHP Developers should learn first, real Object Oriented Structures and secondly do not take anything for granted 🙂 It’s not because you have a very good DBAL that your life is going to be easier 🙂

    $0.02

  12. “Showing specific examples how a DAO/DbAL is constructed and how it allows for query optimization or stored procedure optimizations without sacrificing portability would be a great talk.”

    Fun that you are specifying this because this is also something I have written the structure of my talk, but without content just yet. Perhaps that’ll be something quite new but very useful to developers (PHP et al)

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.