At Tagged, I’ve been working on the project for the last month. Basically it allows us to cache dynamic parts of throughout the site and keep the caches from having stale data. Since we call the APIs both externally and internally—over 20 calls to generate a user’s profile page alone—the caching is turned on at the API level.
The initial hit to the system was severe.
Basically, I was causing the profile page to take 30 extra milliseconds to render. The profile page accounts for 17% of all page views on the site where we do about 220 million “views” a day. This 30 milliseconds, believe it or not, was dropping profile page views by 5%. And it took about 20 hours of back and forth before I finally resolved to rewrite the whole thing so that it can be rolled out gradually without any performance hit.
That means I cost the company one and a half million ad impressions. 🙁
The second day’s cached piece is actually measurable on the live site with tools I wrote. I am saving between 17 and 900 milliseconds depending on the state of the backend and load on the server.
Since I can’t measure how much actual page views I’m adding back into the system, I was curious about how much extra server capacity these changes are getting me.
In other words, how much is a millisecond?
If I save a millisecond on the profile page, that means I save about 37,400,000 cpu-process-milliseconds during the course of the day or about half a CPU-process-day. During peak hours we have about 50 processes/machine.
This means a millisecond saved is worth at least a hundredth of a machine at capacity.
I estimate the new caching system, activated slowly, will be like adding five machines to the server pool a day—adding capacity about 5x faster than our growth rate.
I have enough caching projects to keep doing this for the rest of the month. 🙂
Suddenly I feel the need to go in the other room and install apc…
@Giorgio Sironi: What’s scary is that this api is called 1100 times / second, and we can actually see a performance boost of 30 msec, even though not all the calls in this grouping are cached yet. 🙂
The real capacity savings come from the ancillary costs per server, like HVAC, data center space, power consumption, and (often most importantly) operational cost (personnel.) We usually estimate $2800/server, so 5 machines would just be $14K, but the true cost of those machines is much greater.
@Michael Good point, I’d add bandwidth costs to that. I didn’t talk about it that way because our traffic is growing between 1-5% a week so this pushes out those purchases and costs instead of giving us savings.
Also a big difference in cost comes from the revenue generated/ost by a slightly faster/slower response time. As you can see, they seem to be very correlated. Just caching two boxes on profile has caused us an daily gain of 5% in traffic to that page (though we can’t be sure that it wasn’t something else in the release).
I did some testing today on production with some user profiles. It seems I’m saving on average 313ms/profile page view. Not bad.
BTW, the api system current does between 4000-7000 queries/second (there are more actual transactions since the API system can be both called internally and can be multiplexed). The profile page is viewed betten 400-600 times/second (there are more actual profile views because in miniview mode, the user can browse multiple profiles via ajax).