I agree this is a bullshit story. They only store conversations when they have a...

derf_ · on June 12, 2013

So let's estimate:

http://www.telegeography.com/press/press-releases/2012/01/09... says there were 438 billion international (because that's all the NSA collects, right?) calling minutes in 2011 (in the world... not just the Netherlands).

Aberdeen will sell you 1 PB of storage for $495k: http://www.aberdeeninc.com/abcatg/petarack.htm

A narrowband speech codec will encode calls in excellent quality (for the PSTN) at 12 kbps.

So that's 438 * 10^9 minutes * 60 seconds/minute * 12000 bits/second / (8 bits/byte * 10^15 bytes/petabyte) (using lying harddrive manufacturer's definitions of a petabyte) = 39.42 PB.

Or less than $20mln/year. Which of course is the quoted budget of PRISM.

SigmundA · on June 12, 2013

Your not counting bandwidth, cpu, facility and personnel charges required to pull this off, raw storage is a minor part of the cost.

derf_ · on June 12, 2013

I'm not actually trying to imply that this is what PRISM does (no one has made that claim). I'm just saying that on a government scale, the cost of storing all voice calls ever made forever is not even very expensive.

So let's add bandwidth: the most expensive estimate I've seen is $0.019/GB <http://blogs.howstuffworks.com/2011/04/07/what-does-a-gigaby.... Let's assume the original audio is captured using G.711 (64 kbps). So that's 438 * 10^9 minutes * 60 seconds/minute * 64000 bits/second / (8 bits/byte * 1024^3 bytes/GB) * $0.019/GB = $3.72mln.

Let's add CPU: A medium-sized, high-CPU AWS instance is $0.0024/minute <http://aws.amazon.com/ec2/pricing/>. A moderate laptop-class processor can encode and decode 150 channels/core in real time <http://www.ietf.org/mail-archive/web/rtcweb/current/msg05236.... So that's 438 * 10^9 call minutes * $0.0024/CPU minute / 150 call minutes/CPU minute = $7.01mln.

Facility: The NSA's Utah facility is projected to cost $1.5...$2bln <http://en.wikipedia.org/wiki/Utah_Data_Center> and will contain a 100,000 square foot data center <http://nsa.gov1.info/utah-data-center/>. A 42U rack is about 7 square feet. Let's assume a floor occupancy of 25%. That's $2bln/facility / 10000 ft^2/facility * (7/0.25) ft^2/PB * 40 PB = $22.4mln.

I don't have a good estimate of the personnel involved, but I doubt it'd require anything out of the ballpark of the other numbers here. You could have every rack maintained and operated by its own PhD-level researcher at less than $10mln/year including all overhead and benefits.

A single JSF F-35A has a $207.6mln procurement cost (excluding R&D costs, maintenance costs, and operating costs) <http://en.wikipedia.org/wiki/F-35_Lightning_II#Program_cost_....

ajb · on June 12, 2013

Commercial speech compression algorithms are hamstrung by the need to only add milliseconds of delay: they can only compress over a 'window' of tens of milliseconds. You can almost certainly do a much better job of compressing speech in batches of an minutes or tens of minutes: there is much more redundancy to remove. So if the spooks wanted to store massive amounts of speech data, they may have invested in such algorithms.

haarts · on June 12, 2013

Storing voice (audio) data is not what the article says. I'd imagine you transcribe the audio to text and search in that. Storing text is incredibly easy. Besides you can throw away 99.9% of the data almost immediately.

I'm actually curious how much text data this would be per day; number of call minutes * average number of words per minute. I'd be surprised if that wouldn't fit in a reasonable cluster.

SigmundA · on June 12, 2013

You underestimate the CPU power needed to do this. The Netherlands has a population of 16 million, by comparison Google voice has about 1.4 million users. This is an order of magnitude difference. On top of this they only transcribe voicemail not all calls. What is the ratio of calls to voicemail?

Transcribing all voice calls to text in the Netherlands computationally could easily be two orders of magnitude more difficult than Google voice.

MichaelSalib · on June 12, 2013

I'm sorry, but do we really think that machine transcription of millions of cell phone conversations is worth anything? How can anyone believe that after using google voice?

btilly · on June 12, 2013

So you use a hybrid approach. The text transcription can be fed into programs that look for specific phrases, build up social networks, etc. And then anyone you decide you actually want to monitor you keep audio as well as the machine transcription.

The machine transcription remains incredibly valuable for broad surveillance even though it is highly imperfect.

lucb1e · on June 12, 2013

True actually. Ironically, a call itself is much more expensive than storing it for 20 years would be.

They do have a lot of eavesdrop approvals though, or so I heard from a colleague. (But that still doesn't mean they capture all the calls.)