The best advertising database in the world

It’s making the news now and really should have come as no surprise if you have been following the news. It’s a reducto ad absurdum of the Administration’s “creative” interpretation of the Constitution.

If you parse the up-to-date denials and spin, you will see they’re not denying it. The only claim is that the profiles are anonymous. Their interpretation of the law seems to be as long as its not tied to your social security number, no warrant is needed and the 4th is not violated. (There is this separation in their minds between “unreasonable search and seisure” and “probable cause” along the lines of “these are terrorists, this is war, it is not unreasonable to search these records without warrants in times of war.)

I’m not interesting in talking about the legality of that. Instead I want to think about two things: 1) Why the outrage now? 2) Even if you take the most restrictive definition of what that database does, how useful is it?

Why the outrage?

The interesting thing is why this has jumped from the front page of USA Today to the top of Google without more than a few hours passing. After all, they’ve been doing this with airplane records, international phone calls, and nearly everything else.

I’d have to say it is not because it is “domestic” vs. “international” it is because the database is 200 million Americans strong. That’s 2/3 of the United States. It is reasonable to assume you are on the list, that you have been tracked.

(Personally, the database is probably much larger and can cover pretty much every American with a phone. You only need AT&T, Verizon, or Bell South (the sources of the leak) as one endpoint in the conversation before the profile can be built based on the caller ID on the other side. But this detail seems to have skipped the news reporters.)

Just like nobody cares about how many Iraqi’s have died in the war. Just like McNamara has to put what we did to Japan in terms of which equivalent cities in the United States. Just like Bush’s approval is strongly tied to the price at the gas pump instead of the hundreds of billions spent, tens of thousands injured, and thousands of Americans who have died in this far-off adventure. No longer in America does anyone care, unless it is about them. The 80’s greed morphed in to the 90’s rationalization of it to create a 00’s apathy: me = capitalist = capitalism = free markets = efficiency = beats the Commies = good.

You tell them, “The NSA has a database of every phone call you made and every e-mail you sent, and every website you visit” Then this gets people worked up.

What a bunch of selfish fuckers we are.

Reconfiguring our DLLs…

The interesting thing is what you can do with it if you had this information even if you assume the most restrictive part of what the government is saying and take the most conservative estimate of what they do with it.

Let’s assume all you do is track end to end phone numbers (not internet packet records) and calling records (times and amounts). Let’s assume it’s only for the 200 million people who are customers of Verizon, AT&T (SBC), and BellSouth.

This is what is claimed by Bush defenders who dismiss any outrage over this as coming from “privacy advocates and Democrats.” The money quote comes from the business magazine Forbes who spins it: “The paper said the NSA wasn’t wiretapping the calls and listening to the content, but was compiling extensive lists of who called who, and when they called them.”

Let’s ignore the obvious business intelligence aspect of “Gee it would nice to know what other companies are vying for this business with that company. I wonder if knowing access to the phone call records of that business would help?” Let’s look at it from purely like a business, shall we?

Even if it was anonymous and all you had were the call records, you could build a detailed and useful profile of where that phone number stands politically, who is connected to whom, and which political arguments would be most effective. Given NSA’s computing power and talent, that shouldn’t be too much of a stretch. We’re talking about almost half the country’s computing power and the top graduates of America’s best math, science, and engineering schools here. (The former head of the NSA was from my school and they recruit from it heavily from it—that school is Caltech.)

Database smackdown…

Let us look at how internet business databases are structured and compare that to the NSA one. I will consider advertising and search giant (Google), the “yet another social networking services” (MySpace, Facebook, LinkedIn, Friendster), and a contact management service (Plaxo). And see how they compare.

Google is building exactly such a profile to combat click-fraud, but it’s not a very reliable one. While it covers your searching and browsing habits, it only covers it during the times you search on Google or the web site you visit uses AdSense and you don’t delete your cookies and you eventually log into Google service like GMail or GTalk. Else the profile can’t be attached. That’s pretty limited given that nobody uses Google services beyond search.

Google makes so much money all the latest tech news cycle could talk about was Microsoft vs. Google. (It is important to remember that Microsoft is a monopolist convicted of levering their monopoly, so that’s a pretty high aspiration.). Unlike Microsoft, they make all their money off only a single product: advertising. And click fraud is the chink in Google’s advertising armor. This is why Google rationalizes their “Don’t be evil” core value with a more relevant one “organize the world’s information.”

Friendster, MySpace, LinkedIn, and FaceBook have connection profiles uniquely tied to the user. But they’re all in tiny verticals. For instance, LinkedIn does well among technology professionals and consultants, but not far beyond that. FaceBook expands instantly into any college in the country, but the security model doesn’t work as well for high schools and it doesn’t scale for connected companies that range from individuals to corporations (The security model is tied to the domain of your e-mail address. They “scale out” by adding machines based on that domain. Colleges are relatively isolated communities so this works well, but it doesn’t “scale up.”)

And yet this is the best example of what you can do with the database. I might mention that MySpace was bought by NewsCorp for over half a billion dollars, that Friendster makes a lot of money, and LinkedIn and FaceBook are profitable. All based on advertising into those connected networks.

Plaxo has the connection profile, but it’s only for 10 million users. Plus if you read their privacy policy you can deduce that the data is so segmented that the connection profile is not stored uniquely like it would be for Google GMail, the social networking sites’ databases, or in this NSA database. In other words, they can’t do much with it unless you expressly allow them to because Plaxo is destined to grow only to the extent that they can be trusted to keep you in control of your data (the one thing the others don’t offer).

This means Plaxo has a nice niche (they serve you, not the advertisers) which allows them to be a Web 2.0 company (work with or as a plug-in to other web services and sites, instead of achieving all their value by locking out competitors from their network). But it’s not a database in the same sense as the NSA database.

(Disclaimer: I work for Plaxo. All opinions are my own… yadda yadda.)

Note that all these networks are relatively small, incomplete, and do not handle changes well.

…and the winnah is…

I’m calling it. The winner is the NSA.

That’s an amazing database. I’m really jealous. It’s like Google meets MySpace meets Plaxo only for every American.

We can reasonably assume that Sprint and the others have been roped into something similar. If they haven’t you can still build the profile if only one endpoint is belongs to it. So the size of this database is not 200 million users, but everyone with a phone or internet connection.

The connection profile is really complete because they can use phone number or IP to phone number (SBC DSL) as the UID to build the profile. So this covers anyone you have called, e-mailed or site you’ve browsed to as the inputs to build your profile.

It is consistent because the company does the work for the NSA. Heck you can change phone providers but your phone number stays the same. Sure there is some double counting (say you have two phones), but statistics solves that one and your mobile phone network, your home phone network, and your work phone network are not walled off from each other—your contacts are spread non-exclusively among these phones.

Their database contains the most number of users, the most complete set of connections, and is consistent.

We haven’t even examined the database in terms of say, which pr0n sites you have visited, or if you ever made a disparaging comment on George Bush. We don’t really need to. Your network connections have revealed the profile better than this. That’s how MySpace and the others make money. We add to it your actual habits (a la Google) and the NSA’s “journey to the Dark Side will be complete.”

After all, do you really think the NSA cares about COPA? :-)

Business homomorphism

One of the interesting thing about business is how one business practice and solutions are mappable to another. Mathematicians call this “homomorphism.”

Say you have a problem like “I want to show related products” you build a database based on “SKU” numbers of the products, “who bought what” and “how many”. You then write a statistical algorithm to determine relevance. That’s how Amazon shows you recommended products.

Call “SKU,” “websites”, “who bought what” is now “who links whom” and “how many” is “how many links.” You have the basics of how Google delivers relevant search results.

Take that same thing, and then say “SKU” is strings in your e-mail headers or body, “who bought what” is “marking e-mails with that string as spam” and “how many” is each piece of mail with that string. You have a junk mail filter.

We can say that the current business solutions to “determine related products”, “find relevant websites” and “filter junk mail” are “homomorphic.”

“Gotta do evil…”

Now take the same thing. Your “product” is a product like the latest Harry Potter movie. Your goal is to sell your product to people who will actually buy it without wasting resources on those who won’t (a certain demographic of children and fantasy lovers). You nail down the demographics of a few people in the network (both Harry Potter lovers and haters) and then you use this database figure out which people might be interesting in the Harry Potter movie, even if you don’t know they are. (The network tells you they are.)

This is how FaceBook, Friendster, and MySpace make money.

Now you do the same thing where your “product” is “click frauders” and your goal is to nail them. So you use your network, nail down a few points and then identify potential statistical abberations: “That person is most likely a child, all of a sudden they’re clicking on Google Adwords for asbestos lawyers.”

That is why Google is building this database.

These databases are “homomorphic” with one another.

Even the most conservative estimate would have to admit NSA’s database is the best advertising database in the world. Besides mundane stuff like what Google and YASNSs do with it, what can you do with it?

Let’s build some more homomorphisms:

Let’s start it innocently. Say your goal is to find out who the terrorists are. Your “product” is “people who might be influenced to become terrorists or aid terrorists”, your goal then is to use this database to figure out how they might “sell this product”. You nail down the known terrorists and use this network to how that influence might “sell” and you have a map of “the terrorist network.”

Say your “product” is “Tell people we plan on building a wall and deporting all ‘the Mexicans’’.” Your goal is to get them to “buy your product” (“go vote for you to make up the reality that the majority doesn’t like you”). And you have the NSA’s “advertising database”…

You want to test a talking point or a new frame? Why bother polling? You can inject the right frames to the right people to influence them without much fallout from those people to “the terrorists” (liberals).

I am reminded of a funny Skeletor mashup that Caitlin pointed me to. Skeletor says: “Gotta do evil… Gotta do evil… Hmm, have I killed any puppies today?”

Shorter Terry Chay

You don’t need to know anything about the person in the traditional sense. Who you are connected to (your network) reveals your personal preferences much better than anything else. Businesses today make a lot of money advertising onto database networks like this, but business databases are small, incomplete, and unreliable. There are a lot of (political) issues out there like that, like a business, are mappable to “advertising a product for purchase.”

4 thoughts on “The best advertising database in the world”

  1. There was some discussion the Balloon Juice article I linked on whether this was “the largest database.” Here is my response:

    I think it may be the biggest database, not in terms of number of records, but in terms of the likelihood that a person’s social network can be mined from it. Given that it is the combined call records of nearly every phone in the country. It is highly unlikely that there is a person working in America who doesn’t make some statistical contribution to that database. I don’t think even Walmart’s vaunted data warehouse can claim that.

    Plus, you can’t mine the connection information (who knows who) from B&N or Walmart’s database. That’s where the rubber meets the road with this database.

    This is why, even though Total Information Awareness is actually an interesting academic idea, it was destroyed by both sides of the aisle and ended it for John Poindexter—something even the Iran-Contra wasn’t able to do.

    Take a look at what happened with TIA and you can see why this issue has political legs.

  2. The phone call records could be useful for blackmail. On the bright side, I guess they won’t need any more deals like Iran-Contra to raise revenue.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>