serialization without pity

You may have guessed my PHP development philosophy from something I wrote recently, but an interesting question at work yesterday showed that I need to put it in words.

If there is something difficult to do in PHP, there is probably an extension somewhere that allows PHP to push it to another layer. If that is something that can’t be pushed to another layer, than PHP probably has a built-in function or best practice to handle that case. Find that extension, function, or best practice and never choose one where another is better.

At work, the problem was, “Well if we weren’t so big, I’d have done all this on the database, which probably means that PHP shouldn’t be solving this problem.” In other words, if the scale was small, the database, not PHP, would have been the obvious place to solve the problem we were having. But the scale we operate is too large for a database so it becomes a problem. A solution to the problem on this scale written in PHP would not be a good architecture decision. (I’d pay for it down the road.)

That explains the first part of my philosophy, but what about the second?

[autoload, quirks, and session serialization after the jump]

Introducing __autoload and unserialize_callback_func

Frank Kleine writes a PHP 5 framework called Stubbles. I have a long-standing view about frameworks that hasn’t changed one bit. But instead of arguing about the Sysaphean task Frank is engaging in, I’ll show what my approach is to one small component (while being a bit jealous that he can actually develop in PHP 5).

In this case, Frank’s framework stores objects into the session. Now PHP stores this by serializing the object into a string which is stored from request to request. The problem occurs when PHP deserializes the object and it doesn’t have the class definition already loaded. In these cases PHP creates an unusable object and throws a warning. But he doesn’t want to pre-include every class that could possibly be serialized with the session on session_start() so what to do?

Well the way to do this is built-in to PHP if you know where to look. The solution is __autoload() and unserialize_callback_func. For example, in the case of PEARified naming conventions, here is the code that does the magic:

function __autoload($class_name)
{
    // no need to waste time with include_once here.
    require(str_replace('_',DIR_SEP,$class_name).'.php');
}
ini_set('unserialize_callback_func','__autoload');

Now if you don’t know about __autoload() or unserialize_callback_func, this problem would seem pretty difficult. Since it does, the problem is easy. But how would you know that it does? Since this solution can’t be in an external library, that’s a dead giveaway that it’s probably in the language and a little looking would have caused you to stumble upon this. Make sense?

About the code: you can add some error checking or creating inactive stub classes with eval() if the class load fails, etc. (If you code in PHP 4, the class names are serialized in lower case. So if you wonder why all the classes I write are lower case, now you know!)

…and I did it my way.

Ahh! but then Frank points out a religious difference. He hates PEAR naming conventions. Personally, I love them because PHP has no namespacing and we make do with underscores, but whatever. At my company our legacy code has this exact same problem.

You can read his solution. Here is his approach:

Frank makes his classes that need to be serialized implement an interface (technically, it is subclassed from his base object). This object has a method (not __sleep) that serializes itself into a special object containing this data along with a string containing the class path. This object is included before session_start and reads the full path name to include the class definition just in time.

This approach violates my central tenet of looking for solutions in practices or functions. He closes himself off to built-in functions of __sleep() and __wakeup() because he is no longer serializing the class itself, he’s serializing a stub object. So what is the consequence?

This creates a parallel architecture for serialization and deserialization. It is a classic framework approach and why I hate them. Once you choose a framework, decisions like this force you to live inside the framework. You can’t use two frameworks, because in this case they would have conflicting ways of deserializing themselves. If you wanted to use a library from PEAR, you’d be forced to put an adapter pattern in front of it just to get the fucker to serialize.

Abstraction upon abstraction and the architecture spins out of control. If you’re willing to color inside the lines of the framework, that’s not a problem.

But I’m a contrarian and I never do. I already know PHP, why learn another language (templating) or architecture (framework) on top of that? Is this C++ with its STL, MFC, and Boost, or is this PHP the language so simple that doesn’t even support namespacing?

There is a reason you don’t hardcode

Frank correctly points out we don’t have the class path and can’t extrapolate the class path like PEAR.

Now since I have the PHP underscore religion, I’d say, “Well it is because of edge cases like this that explain the reasoning why PEAR (and PHP itself) chose that naming convention.” But whatever. We’re not hear to debate syntactical religion.

More to the point, not storing classpaths with the serialized object is a Good Thing™. Since the session is most-likely stored across servers (via the memcache best practice), storing class paths with the sessions means the directory architecture has to be shared across servers.

So what?

How about I propose an edge case: let’s say you have a pool of a thousand web servers and an error occurs on the live site. The error is a run-time error involving the interaction of 20 independentant server groups that create your web application that can’t be reproduced in QA or any test environment. You don’t know where it is and business needs prevent taking down the server and handling this offline.

No problem, take one server on the pool and start hacking away, only 1 in 1000 requests get affected, and only if you screw up. If you’ve embedded hard paths into your code, it won’t be easy to guarantee that your parallel test installation on the live site will run your code. It might include code for another path.

Okay, maybe with some FollowSymLinks action in Apache you can work around this. But isn’t the fact that you’ve had to deal with this in a clever manner an indicate that there will be other systemic stumbling points? It works for this case, but what about the next?

When PEAR is not a pomaceous fruit

So what would you do if you only have the class name and they’re not PEARified?

One approach is to store all the classes in the same directory or couple of directories in the include_path. That looks ugly, but I’ve seen a number of applications which do perfectly fine this way.

The other approach is to load a look up table. Here’s how it’s done:

function __autoload($class_name)
{
    static $map_table;
    if (empty($map_table)) {
        $map_table = include('class_map_table.php');
    }
    $class_name = strtolower($class_name); //make PHP 5 b.c. with PHP 4
    if (array_key_exists($class_name,$map_table)) {
        require($map_table[$class_name]);
        return;
    }
    trigger_error(sprintf('Cannot find class: %s',$class_name));
}

And class_map_table.php is a line of code that does Free Energy: it just returns the array hash look up table. If you never need this class it never loads the look up table into memory and it is easy to override.

“That doesn’t look very appealing.”

I’m really itching to say that it isn’t appealing because you’re not using PEARified conventions.

But really, I think the proper answer is to ask the question: “Why isn’t it appealing?”

Most likely the thought is that the former solution won’t work on a large site and the latter solution loads a “large array” on “every request” (actually, it doesn’t, but lets say you’re using sessions on every request).

Is it really that large? My feeling is it isn’t and it’ll be code cached anyway.

Let’s say your memory footprint has to be small. What to do now? I don’t know, but personally I’d use the first part of my maxim: I’d push the problem to a different layer. In this case: APC. Here is some code that might work.

function __autoload($class_name)
{
    @require(apc_fetch($class_name));
    if (class_exists($class_name)) { return; }
    trigger_error(sprintf('Class %s not stored in apc',$class_name));
}

Ahh, but how to get the look up table stored in APC? Well that’s your problem. One approach might be to overload the __sleep function of a given class that might be serialized:

function __sleep() {
    $class_name = get_class($this);
    apc_store($class_name,__FILE__);
    return get_class_vars($class_name);
}

__autoload() is teh suck

I’ll leave that for another article. Instead, I’ll note that the larger problem is the framework is serializing an object with a session. serialize() and unserialize() are slow and objects have a tendency to be abusive.

You might want to look at why you’re serializing an object with the session at all. If you want to scale, a lot of thought should be put in before serializing anything into the session.

13 thoughts on “serialization without pity

  1. Session management? Object serialization? I feel like I’m reading a Java blog…

    So the serialized objects get stringified and stuffed into a db or the filesystem? Is this replicated (memcache?)

    Kind of reminds me of the old ASP way of stuffing all that serialized goodness into a hidden field on the client.

  2. Michael

    The session extenion in PHP was introduced in 2000 with PHP 4 to handle what before was done by a number of PHP userspace libraries (like PHPlib). It is is just a library that handles creating and using a session token from a cookie (or GET string), using said token to deserialize some variables (stored in the $_SESSION superglobal), and garbage collection.

    If the functions are not overridden, the sessions are stored in the local filesystem usually in the /tmp directory. This is good enough for 95% of all cases out there. It can easily be changed to use a shared memory cache without writing a line of code.

    The old best practice for scalability was to put it in a database like berkleydb. There was a project that bound this to a callable volatile shared memory interface, but the introduction from the Perl community of memcache eliminated that. BTW, memcache is distributed, but it isn’t replicated in this case.

    There is an alterative best practice for high volume sites where the data is serialized right into the cookie on the client. This is fine as long as the programmer is aware of the size limitations, security, and bandwidth consequences of doing so.

    A lot of people use session extension but don’t understand the consequences for doing so. They also serialize() and deserialize() in PHP but are unaware of how it works. In this article, I tried to explain some interesting aspects of the language and approaches.

    Any time you work with a framework for web development it will start to sound a little like J2EE blog. That says more about frameworks for web development than Java or PHP. As a language PHP is a “great artist” (it steals from wherever and quite liberally).

  3. Terry,

    kickass article. There is an interesting corollary to designing using PEARified class paths that *can* be __autoloaded but don’t *need* to that we have worked out for PEAR2 that is much more APC/opcode cache-friendly, I’ll be blogging about that soon(ish) as well.

  4. “You might want to look at why you’re serializing an object with the session at all. If you want to scale, a lot of thought should be put in before serializing anything into the session”

    Doh. That’s what I keep suggesting to a lot of colleagues and friends.
    It really is hard to convince them, especially if they’re coming from Java-land, where not-an-object is a synonym of “evil”. And sessions are the basic building blocks of web applications.

    As for myself, I stick to
    – if an array can be used instead of an object, use an array. I love php arrays
    – never stick anything into sessions that yo do not need to
    – do not pile up layers upon layers of classes that end up basically replicating the same functionality offered by the php api itself, only wrapped in objects

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.