Hacker News new | past | comments | ask | show | jobs | submit login

I agree this is a bullshit story. They only store conversations when they have an eavesdrop approval.

A lot of people underestimate the amount of storage it would take to store all voice data.




So let's estimate:

http://www.telegeography.com/press/press-releases/2012/01/09... says there were 438 billion international (because that's all the NSA collects, right?) calling minutes in 2011 (in the world... not just the Netherlands).

Aberdeen will sell you 1 PB of storage for $495k: http://www.aberdeeninc.com/abcatg/petarack.htm

A narrowband speech codec will encode calls in excellent quality (for the PSTN) at 12 kbps.

So that's 438 * 10^9 minutes * 60 seconds/minute * 12000 bits/second / (8 bits/byte * 10^15 bytes/petabyte) (using lying harddrive manufacturer's definitions of a petabyte) = 39.42 PB.

Or less than $20mln/year. Which of course is the quoted budget of PRISM.


Your not counting bandwidth, cpu, facility and personnel charges required to pull this off, raw storage is a minor part of the cost.


I'm not actually trying to imply that this is what PRISM does (no one has made that claim). I'm just saying that on a government scale, the cost of storing all voice calls ever made forever is not even very expensive.

So let's add bandwidth: the most expensive estimate I've seen is $0.019/GB <http://blogs.howstuffworks.com/2011/04/07/what-does-a-gigaby.... Let's assume the original audio is captured using G.711 (64 kbps). So that's 438 * 10^9 minutes * 60 seconds/minute * 64000 bits/second / (8 bits/byte * 1024^3 bytes/GB) * $0.019/GB = $3.72mln.

Let's add CPU: A medium-sized, high-CPU AWS instance is $0.0024/minute <http://aws.amazon.com/ec2/pricing/>. A moderate laptop-class processor can encode and decode 150 channels/core in real time <http://www.ietf.org/mail-archive/web/rtcweb/current/msg05236.... So that's 438 * 10^9 call minutes * $0.0024/CPU minute / 150 call minutes/CPU minute = $7.01mln.

Facility: The NSA's Utah facility is projected to cost $1.5...$2bln <http://en.wikipedia.org/wiki/Utah_Data_Center> and will contain a 100,000 square foot data center <http://nsa.gov1.info/utah-data-center/>. A 42U rack is about 7 square feet. Let's assume a floor occupancy of 25%. That's $2bln/facility / 10000 ft^2/facility * (7/0.25) ft^2/PB * 40 PB = $22.4mln.

I don't have a good estimate of the personnel involved, but I doubt it'd require anything out of the ballpark of the other numbers here. You could have every rack maintained and operated by its own PhD-level researcher at less than $10mln/year including all overhead and benefits.

A single JSF F-35A has a $207.6mln procurement cost (excluding R&D costs, maintenance costs, and operating costs) <http://en.wikipedia.org/wiki/F-35_Lightning_II#Program_cost_....


Commercial speech compression algorithms are hamstrung by the need to only add milliseconds of delay: they can only compress over a 'window' of tens of milliseconds. You can almost certainly do a much better job of compressing speech in batches of an minutes or tens of minutes: there is much more redundancy to remove. So if the spooks wanted to store massive amounts of speech data, they may have invested in such algorithms.


Storing voice (audio) data is not what the article says. I'd imagine you transcribe the audio to text and search in that. Storing text is incredibly easy. Besides you can throw away 99.9% of the data almost immediately.

I'm actually curious how much text data this would be per day; number of call minutes * average number of words per minute. I'd be surprised if that wouldn't fit in a reasonable cluster.


You underestimate the CPU power needed to do this. The Netherlands has a population of 16 million, by comparison Google voice has about 1.4 million users. This is an order of magnitude difference. On top of this they only transcribe voicemail not all calls. What is the ratio of calls to voicemail?

Transcribing all voice calls to text in the Netherlands computationally could easily be two orders of magnitude more difficult than Google voice.


I'm sorry, but do we really think that machine transcription of millions of cell phone conversations is worth anything? How can anyone believe that after using google voice?


So you use a hybrid approach. The text transcription can be fed into programs that look for specific phrases, build up social networks, etc. And then anyone you decide you actually want to monitor you keep audio as well as the machine transcription.

The machine transcription remains incredibly valuable for broad surveillance even though it is highly imperfect.


True actually. Ironically, a call itself is much more expensive than storing it for 20 years would be.

They do have a lot of eavesdrop approvals though, or so I heard from a colleague. (But that still doesn't mean they capture all the calls.)




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: