I used to work at Tumblr, the entirety of their user content is stored in a single multi-petabyte AWS S3 bucket, in a single AWS account, no backup, no MFA delete, no object versioning. It is all one fat finger away from oblivion.
What the hell. It is so easy to configure multi-region glacier backups, mfa delete, etc. for a single S3 bucket. Took me like a couple hours to setup versioning and backups, and a few days to setup mfa for admin actions. Why would they not set this stuff up?
The key words you probably need to look at are "multi-petabyte". Not saying they shouldn't be doing something but it all costs - and at multi-petabytes, it cooooosts
1 Petabyte (and they have multiple)
S3 - $30,000 a month, $360,000 a year
S3 - reduced redundancy - $24,000 a month, $288,000 a year
S3 - infrequent access - $13,100 a month, $157,000 a year
Add in transit and cdn and Tumblr’s AWS bill was seven figures a month. A bunch of us wanted to build something like Facebook’s haystack do away with S3 altogether, but the idea kept getting killed because of concerns over all the places the S3 URLs were hard coded and also breaking 3rd party links to content in the bucket (for years you could link to the bucket directly - still can for content more then a couple years old)
Well, the business was acquired for $500,000,000 and a single employee probably costs what backing up two petabytes of data for a year (on glacier) does.
They could also always use tapes, for something as critical as the data that is the blood of your business.
Imagine if facebook lost everyones' contact lists, how bad would that be for their business? Backups are cheap insurance.
Backups are still a hard sell for management, though. No matter how many companies die a quick and painful death when they lose too much business critical data, the bossmen just can't wrap their heads around spending $100k for what they perceive as no benefit.
Same problems with buying things like antivirus software or even IT management utilities; when they're doing their job, there's no perceivable difference. It's only when shit goes sideways that the value is demonstrated.
Hell you could take this a step further for IT as a whole; if IT is doing their job well, they're invisible. Then they can the entire department, outsource to offsite support, and the business starts hemorrhaging employees and revenue because nobody can get anything done.
>No matter how many companies die a quick and painful death when they lose too much business critical data, the bossmen just can't wrap their heads around spending $100k for what they perceive as no benefit.
Yeah, but what exactly IS the benefit? The business doesn't die if something really bad happens? Is that really important though?
Consider the two alternatives:
1) The business spends $x00k/year on backups. IF something happens, they're saved, and business continues as normal. However, this money comes out of their bottom line, making them less profitable.
2) The business doesn't bother with backups, and has more profit. The management can get bigger bonuses. But IF something bad happens, the company goes under, but then what happens to the managers who made these decisions? They just go on to another job at another company, right?
They can get more money in the short term by pushing you harder, and there's zero cost to them to go yell at you. If they could get a bigger bonus by ignoring outages, they'd do that, but instead, they can get a bigger bonus by pushing you to reduce outages without any additional resources.
Seems like they do just fine with big golden parachutes. Why tie their compensation to the company's performance when they can just have a big payout whenever they leave under any circumstances?
I worked at a place that lost their entire CVS repository. The only reason they were able to restore it at all was because I made daily backups of the code myself. Sure, a lot of context data was still still lost, but at least there was some history preserved.
I wouldn't be surprised if this was actually the rationale for not having backups.
Tumblr is apparently fragile and tech-debt laden on engineering side, stagnant on users, and unprofitable. At a certain point, it's a coherent decision to just say "a few days of downtime would seal our fate, the business can only be saved if everything goes right", and not spend any money on mitigation.
Devil's advocate: it depends on how many petabytes you have. This cloud of uncertainty over your uploads could be seen as the hidden cost of using a free platform.
Building such a storage behemoth is not the challenging part. Filling it with data, backing it up, and keeping the RAID rebuild time under load on such monster drives below the average drive failure time is the challenging part.
At that scale it makes sense to start thinking about alternatives to RAID e.g. an object storage with erasure coding should work well for a code base already using the S3 API. In theory even minio should be enough, but I never had enough spare hardware to perform a load test of that scale.
It will probably cost more to connect all these drives to some sort of a server. Though 125 is within the realm of what a simple USB should be able to handle (127 devices per controller).
And how many days of downtime are you willing to tolerate while you are restoring that petabyte of data from your contraption? Let's say you have a 10Gbps internet connection (not cheap) all the way through to the Amazon data center, the data transfer will only take about 12 days per petabyte then.
Getting petabytes of storage isn't the problem, transferring the data back and forth is.
This is all true, but it sort of presupposes competence.
Taking a full month to recover a downed social media platform isn't really acceptable, but it's still better than being literally unable to recover it at all. Spending a small fortune to ship hardware to an AWS datacenter and convincing/paying them to load it directly would probably also be worthwhile, when we're talking about simply losing a $500M company. If the claim here about "no backup" is true, it's so profoundly stupid that everything I know about best practices sort of goes out the window. Approaches that any sensible person would consider unacceptably slow and unreliable are still a step up from a completely blank playbook.
(I guess the theory might be that Tumblr is such a trashfire it can't be restored, or would lose so much value in days/weeks of downtime that there's no point in even planning for that. Again, I don't really know how you run cost-benefit analyses when it's not entirely clear the project has benefits.)
And where does Amazon offer colo services? What they offer is Direct Connect at certain (non-Amazon) data centers. That costs about 20k per year for a 10Gb port, ON TOP of the colocation and cross connect fees you are paying at the data center where you want to establish the connection. If you want to bring the restore time down to 12 hours, you need 24 connections (and you need at least as many servers, no single server can handle 240Gb of traffic), so we are now at about 480k+X (large X!) per year per petabyte just for the connections you need in case you have a catastrophic failure (establishing such a connection takes days or even weeks, even if ports are available immediately, so you can't establish the connections "on demand").
That's not even talking about availability, as you are now getting into the realm where it starts to get questionable whether even Amazon has enough backhaul capacity available at those locations so that you can actually max out 50+ 10Gb connections simultaneously.
> I used to work at Tumblr, the entirety of their user content is stored in a single multi-petabyte AWS S3 bucket, in a single AWS account, no backup, no MFA delete, no object versioning. It is all one fat finger away from oblivion.
Remember when Microsoft lost all of the data for their Sidekick users? Basically they were upgrading their SAN and things went badly.
Picasso (supposedly) drew on a napkin, and Banksy draws on derelict walls or sticks his work through a shredder. The medium doesn’t need to be lasting.
Edit: The potentially short-lived medium was chosen by the above artists. Tumblr users many not be too happy if work is lost.
banksy's walls are sold though; and he is still kind of the exception because of his art format. Not everything needs to be lasting but 100% temporary art is not common.
Oh, it's more common than you think, only it being highly valued is rare. That doodle you drew while having a conversation on the phone? That's throwaway art, even if you don't consider yourself an artist.
How many do you think they would be willing to pay some small monthly fee? I'm guessing most of them think their work is worth at least $5/month, right? Maybe Tumblr should become a paid service and ditch the advertising model. That way they could be more relaxed about what types of content they are willing to host.
That's basically what happened with S3 a couple years back. Mistyped command caused an outage for large parts of the internet in the US. Now, I dunno if they could make a big enough mistake that would bring down the whole company, but certainly it's been proven that a single mistake can affect major portions of the internet.
> experienced code reviewers verifying change sets using sophisticated deployment infrastructure targeting physical hardware spread out across one or more data centers in each availability zone
but the availability numbers speak for themselves :/
There's an awful lot of less-critical stuff that users have tracked down themselves. A few random highlights:
- The mobile and desktop sites are completely separate products with vastly different behavior. Some privacy features (relevant to both) can only be accessed on one, some on the other. Tags are rendered in all-lowercase on mobile, but as written on desktop. Block quotes on desktop render as enlarged-font cursive on mobile, for some awful reason.
- Tumblr support(s/ed) font coloring, with no documentation of that fact. You enable it by using the HTML editor and picking among color tags with Friends-themed names like "Monica Pizazz Orange". Oh, and the preview feature won't honor the tags, but actually posting will.
- NSFW content is flagged even in drafts, but if that content is reviewed and approved, it's automatically posted publicly, not returned to drafts where it started.
- Tumblr's desktop sign up page use(s/d) semi-random images from the site as backgrounds. Yes, they did serve cartoon porn to people trying to make accounts.
- Certain posts were impossible to view. Tumblr accounts can have their own themed pages, or simply be popup sidebars over the main news feed. Tumblr "read more" content hiders took users from the news feed to the poster's account - if that account was in popup format, a readmore opened from the wrong location would simply force a circular redirect.
- All Tumblr links are actually pushed through a site-specific forwarding system to track users. As a result, Twitter and many other sites are inaccessible because they view all link clicks as bot traffic from a "single source".
your info is somewhat put of daye. aside from #1, these are all bugs that have been fixed. #2 was only before the feature was officially launched. #3 was fixed within a few days. #4 wasn't a bug, serving artistic nudity was intentional and part of tumblrs brand (just like an art museum would). #5 was a bug for a while and it sucked. I've never heard of #6 being an issue—its true that they use a link tracking system but I've never heard of it causing "bot traffic" issues, respectfully, that sounds like bullshit—while I hate it for privacy reasons, lots of sites use link tracking, like Google and Facebook.
I agree that most of these bugs are old; I figured the question included historical stuff, and I have a better knowledge of Tumblr's old bugs than the its new ones.
It looks like I was simply wrong on #2, thank you; I remembered it as something that had been around for ages but was noticed, then publicized. If it was found before a planned announcement, that's different.
#3 was fixed within a few days, but frankly I think "posting people's drafts with no warning" is a "damage done" thing, the same as an email client sending drafts to all listed recipients. There are reasons like the "private post" option that you would draft something and never openly publish it, and even beyond that it's reason to draft anything you might not want to publish as-is offline instead of in the site's draft feature.
#6 is complained about by plenty of other people, and happens to me perhaps 90% of the time. I realize I missed one thing: it's mobile-only. Opening a Twitter link on mobile produces a "you're rate-limited" blocking page which sticks around even if you try again later, but choosing "open in Chrome" to escape the Tumblr app immediately solves the problem. I haven't seen comparable behavior in any other app where I've followed Twitter links. Mobile-specific implies it's not purely the link tracking, granted, but it's very much a real Tumblr-specific issue.
My experience with Tumblr was generally that a large part of the content, especially larger media content like videos, failed to load most of the time. Makes me wonder if that's related ...
there was a S3 sync client that some people used that did:
aws s3 sync --delete ./ s3://your-bucket/
The delete flag was added by just a very innocuous checkbox in the UI. The result is that it removes anything not in the source directory. Kaboom. Everything's gone. The point is you have no idea what stuff is going to do even if you think it's obvious.
Have you tried this? It takes forever to clean out a bucket. At the scale we're talking about, doing this on a single thread from the CLI tool means you could go home and come back the next day and cancel it then, and you still wouldn't have made a particularly big dent in the bucket. It's really a pain in the neck to delete a whole bucket full of data when you actually want to. It's "easy" to start off a recursive delete, sure, but I think you're overestimating the "kaboom" factor.
> You can delete an empty bucket, and when you're using the AWS Management Console, you can delete a bucket that contains objects. If you delete a bucket that contains objects, all the objects in the bucket are permanently deleted.
if you'd used the s3 management console, you'd know that it uses the same API as everything else, and so has to do the same list objects by page / delete a page dance just like everybody else... the only bulk optimization i can recall is the server side transfers for sync...
Tumblr rejected all things Yahoo, except the money, so the answer to just about anything Yahoo asked was either “no”, “get stuffed”, or silence and a note to David that he needed to escalate to Marissa.
On the other side the Yahoo services were so heavily integrated that it was hard to carve out any piece of them, and the few times we tried it was a slow and painful process because Yahoo’s piece was glitchey and unreliable outside of it’s home turf and the Tumblr engineers defensive and argumentative about everything and not willing to help.
That's exactly how I imagined Tumblr's design and development, based on my multiple unsuccessful attempts, over the years, to find any useful navigation between blogs, or the function of reading comments.
> When was this? Being owned by Yahoo, I am surprised they don't use NetApp.
Dell used to offer an online backup service. It wasn't even running on Dell equipment!
Basically they acquired a company that offered the service, and while it would be "nice" if a Dell company ran on Dell gear, a lot of the time it's simply impractical/expensive to overhaul things.
i do this too with my data on a smaller scale, but i'm suprised tumblr does this because even with only a few million files s3 buckets that big are awkward to work with
This is why I wrote Timeliner: https://github.com/mholt/timeliner - a tool to download all my content from essential cloud services (like Google Photos, etc) -- I don't like to trust the cloud as a master copy of my data.
I've been amassing a collection of scripts to basically do the same thing over the last couple of months, just running them all once a month with cron.
None of the services I've been backing up (Goodreads, Trakt, DeviantArt, Tumblr) are currently covered by Timeliner, but the extra twist of assembling all your data into a single timeline sounds kinda cool, so maybe it's worth contributing a few data sources.
I use(d) Perkeep (Camlistore) for this but the current version is broken wrt Pleroma (which might be Pleroma's fault) and Pinboard. I'll definitely give Timeliner (and possibly a Pinboard datasource) to fill in the gaps.
I might not trust them 100%, but I trust them around 1000x more than I trust myself. I have had several data loss incidents and it was invariably my own fault. I simply cannot be trusted with my own data.
If one doesn't trust Google/Dropbox et.al. to handle the data, then just use two of them, or use a personal solution and one of them. Just don't use your home rolled backup system only because you don't trust any of the online storage providers.
> This also goes for Google Drive, Dropbox, and many other websites (if not all)
Of course, but that is not a bad thing per se. Digital data archiving itself is unknown territory. What should happen to the data you put online is uncertain, even in the short term.
The analogy of a "cloud" is revealing. A cloud is fleeting by nature.
Nothing last forever online, no matter what they say.
And that is not mentioning compatibility issues or reading old files made with outdated apps.
Some files that are just above 10 years old can be hard to retrieve today.
In the Google case, I don't believe the files were ever deleted. While the article says they were unable to download it again, my understanding is that Google never deletes or prevents access to your own files on Drive (in cases like this). So maybe something else was going on here I'm unfamiliar with.
Google was preventing the sharing of the files, but that should have been it.
(I'm a googler, but don't have any direct info on this specific incident)
At one point Yahoo was invincible..
I really need a way to find an alternative to gmail for email where I am in control and can replicate it when I like.
You can be in control of your actual data pretty easily with gmail. One of the better options I’ve found for maintaining a local archive is the Got Your Back[0] script. It maintains an SQLite DB with all of the message tag associations, and stores all your emails in standard .msg format text files.
If you want additional peace of mind, you can spend a few pennies and use restic[1] to do regular incremental backups of the archive to Backblaze B2 to add yet another storage location and version index.
you can always set up an email account on another service and have the mail accessed through google's interfaces. i wouldn't recommend setting up your own email server unless you really like securing servers though
The only reason i still use gmail is because of the search engine... im about to implement my own using existing techs so that I can move away from Google... why does search sucks so much in open source programs.. on a side note, Myspace is still active?
It's a little different for Dropbox, MEGA and similar services because they sync your machines. So long as you have at least one machine with those tools installed and running, you have a copy (unless someone finds an exploit in their servers and deletes all their customer data ... so yea .. still keep backups).
If you hope it remains private or it gets forgotten, you should assume that it may go public / remain online forever. And archives exist for many pieces of public content.
If you hope it stays forever, you should assume it may vanish in a few seconds.
I've seen occasional attachments to years-old gmail messages get irrevocably lost. Well, irrevocably given the likelihood of ever getting human attention on it.
This also goes for Google Drive, Dropbox, and many other websites (if not all)
Examples:
https://medium.com/@jancurn/how-bug-in-dropbox-permanently-d...
https://motherboard.vice.com/en_us/article/9kgwnp/porn-on-go...
https://www.zdnet.com/article/dropbox-under-fire-for-dmca-ta...