(This was given mostly by David Recordon, followed by Mike Schroepfer, and one other)
Facebook is big.
They stated they are the #2 site on the internet with 350 billion monthly page views.
This works out to about 350 million active users, 4 trillion feed actions, and 1 million application developers. Or, put another way, they are the largest photo sharing site with 35 billion photos (x4 different resolutions stored for each photo) serving an average 1.2 million photos per second to their user base.
Let’s look at some optimizations they’ve done to the engineering architecture to reach these heights.
In the cases of photos, a problem they’re very proud to showcase, you have a huge complex problem. First the photos are multi-homed (that means they have to have copies of the photos in both West Coast and East Coast datacenters), and I’ve already outlined the traffic for those photos above.
If you look at their numbers, the sheer number and variety of images means you can’t depend on a CDN to solve your problem.
So, Facebook invented a new way of retrieving photos they call Haystack. To see why, consider this table
|Method||I/Os per photo|
For reference, almost all websites you know just fit their images on regular disks. If they have scaling issues, most of them use NFS to scale those disks out. If they run into issues with that, as Tagged did, you use a costly appliance known as a NetApp, and then spend your time optimizing the file structure to get it going at top speed. In Facebook’s case you write your own. Since number of photos, not size, is the bottleneck here, that second column basically represents the inverse upper bound of speed per unit hardware. It has the added benefit of being its own webserver, so it doesn’t need a separate hardware allocated to actually serve the images.
“The multi-dimensional social graph.”
Facebook is famous for the newsfeed: a scroll of data showing what you or your friends are up to. When you think of what this takes from an engineering perspective, you realize the difficulty of the problem due to interconnected data. The estimate they gave is if 1% of their users are active on their website at a given moment, then 90% of the user dataset needs to be available due to the fact that the users are interconnected—like displaying thumbnails and links to friends as users browse profiles.
They call that last bit the “multi-dimensional social graph” which is Facebook buzzword coinage to mean: in social networks, people interact with other people’s data (and not just their own, or each other).
This “multi-dimensional” problem is there for all social networks—it’s the nature of the business. At Tagged, I got around it with a simplified privacy model, minimizing this sort of computation, caching this computation, and through other shortcuts.
“Nowadays, disk is the new tape.”
I’ve said this many times over the years, without memcache, we’d be dead. The analogy I like to use is that think of a modern website like a really big personal computer. In its architecture, PHP is your processor, gluing functionality together. Now look at your memory system. You have disk, which is the database, you have the in-processor L1 cache (PHP process RAM) and L2 cache (APC user cache). In this model, your RAM is memcache.
Facebook uses a slightly different analogy: “Disk is the new tape.”
What this means is the same thing: your database is disk bound and slow. You only want to use it for archival purposes as much as possible. For active use, you need something else. For Facebook, the problem is especially difficult—social networks need a huge chunk of “RAM” active as mentioned above. Cache misses mean going to the database (disk in my analogy, tape in theirs) which is slow.
To give you an idea of how big the store is, in order to get a 98% hit rate on their memcache, they need 700 machines. This works out to 40 terabytes of RAM—probably the largest single memcache store around. For reference: a large social network like Tagged uses only half a terabyte of memcache, and many other startups can’t even afford that since the cloud-hosting they are using bills by the megabyte—nor is there a guaranteed minimal latency between slices.
This means that their memcache has to perform greater than 100 million operations per second.
There are a number of ways they’ve done this, many outlined by Facebook on their developers blog. Here is a list:
- 64-bit port of memcache
- more efficient serialization routine. (PHP uses a text
serializethat is slow and generates an inefficient file size (for network traffic). Facebook’s version is bundled with the open-source Thift—<span title="[Ed: a friend noted that
fb_thrift_unserialize is Facebook’s internal name for the serializer. A similar serializer also forms the basis of the Thrift protocol library.]" class="commentary">look for
- multi-threading version of memcache. (Before this was added to the memcache core, most sites would run as many instances as they had CPU cores, each with a slice of the RAM.)
- improvements to memcache protocol
- new network drivers for the machine. (These are custom network drivers written for each network card and machine combinations.)
- network stack optimizations. (One famous one was that the ethernet driver in the high end Intel servers had a bug where it only one core could do the networking.)
- UDP support to memcache. (TCP/IP limits the number of simultaneous connections to around 250,000. If you have multiple webservers accessing multiple memcaches with multiple processes keeping the connection alive, you can hit this number. This is often implemented incorrectly elsewhere as “fire-and-forget” since the people writing memcache clients don’t understand why it was added. Tagged took shortcuts that handle 90% of cases, but Facebook’s is a full two-way UDP com layer for all dataset sizes—packets will need to be reordered on both sides.)
- a memcache proxy layer
- many other system optimizations.
Here’s a table:
|batching packet handling||250,000|
(For a better discussion of this and other caching optimizations, please see Lucas Nealan’s 2009 OsCon talk.)
Facebook chose MySQL because it is simple, fast, and reliable. It is their “tape backup” which means that they run MySQL across 6000 machines without data loss. For you database nerds they are using MySQL 5.084 on a Percona-build with some custom patches. [Ed: Per correction below, this is incorrect. You can check this website to see Facebooks patches to MySQL]. The filesystem is an XFS or Ext3, depending on the machine (they’re migrating to XFS).
As a company, they really haven’t focused much on database optimizations until recently. But here is a list of what they’ve learnt:
- Logical migration of data is very difficult. (This is a known weakness of MySQL, actually.)
- Create a large number of logical dbs, load balance them over varying number of physical nodes. (This is db speak for the fact they are scaled horizontally and vertically.)
JOINs in production: It is logically difficult because data is distributed randomly. (This means they are horizontally partitioning the data.)
- Data-driven schemas make for happy programmers and difficult operations.
- Don’t ever store non-static data in a central db. (In other words, if any data is updated and not put in a partitioned data store, it’s asking to break as you grow.)
- Use services or memcache for global queries. (Don’t do any
count *or queries that have to go across every node in a horizontal partition, it won’t finish in time on a busy site.)
That’s not very much we don’t know already. I did ask, since they are partitioned up the wazoo and doing no
JOINs, why they bother using a relational database at all. The answer I got was that they already had the tools for cluster-management, backups, replication, data-consistency etc. done in MySQL.
I think the real reason is legacy (Facebook was built in a dorm room), and it got too big and too late before people started considered anything other than MySQL. Scaling may have been done on the database, but optimizations were done elsewhere to alleviate database dependence. For better or for worse, the whole history basically reeks of the PHP development style of where the database is a commodity.
You can see how the technologies mentioned above are put together when scaling across disparate data centers.
Facebook started with a single data center in Santa Clara. When power footprint become full, they added a dataceenter in San Francisco. Because these two places are physically close, latency is low and a memcache proxy can be implemented to make sure the dirty caches are updated simultaneously on both machines.
The issue is when a second, physically distant datacenter was built known as ECDC (East Coast Data Center). In that case, the latency over the network is high enough that a local memcached must be installed and it can be corrupted due to race if cleared via proxy. The trick Facebook came up with is to add a new SQL to the spec called
MEMCACHE DIRTY that takes a list of keys that are dirty. Thus, when the application developer updates the database, they include a dirty request along with all the keys that are dependent on that row. Which this request gets replicated using native MySQL replication across dark fibre to the other datacenter. The dirty event is read by MySQL and passed to a local memcache proxy on the other side which then clears the keys if they exist on the local memcached there—the other memcacheds have been updated by its local proxies.
What this means is that Facebook cannot implement a write-through cache, but instead leverages the database (and cluster controls) as the arbiter of data consistency. When an object is updated, it is instead cleared from cache. The next request from that update asks the local database for the new data and the structure is rebuilt in memcache. Thus they trade off some speed for consistency—the C in ACID.
Given this, I, personally, am very impressed with their stated 98% cache hit rate.
In any case, the example of this planning that they like to mention was the “Facebook Username Land Rush” that occurred last year. Here is an engineering discussion you might be interested in.
Open source at Facebook
Facebook is in the process of formalizing their open-source initiatives. Here is a list of projects they work on (or have created):
- PHP. Especially, APC.
- Thrift Allows remote procedure call in a cross language manner. This started as a Facebook project but now it’s an Apache one.
- Cassandra: Think of it as Napster as a file share.
- scribe:a distributed logging
- xhprof: A PHP profiling tool that can be used in production. It‘s also a source of most of the benchmarks they quote..
- Tornado. Has something to do with Python so I’m not interested. 😀
- Hive. Allows you to query a hadoop cluster like you would a database.
- HipHop. What this article is about. 🙂