Spotify’s Love/Hate Relationship with DNS

jlgaddis · on April 2, 2017

What Spotify calls a "stealth primary" has typically been referred to as a "hidden master" for 15 years or so. Googling that term will turn up more relevant results, for anyone looking to do something similar.

inopinatus · on April 2, 2017

To be clear, a strict implementation of your DNS means that this host appears in the MNAME field of your SOA record. Not so stealthy or hidden, despite the moniker. You might ask "who on earth looks at the MNAME field of SOA records", I answer "well I do".

However there is no requirement that the MNAME host be willing to answer queries. There is an expectation that the host in the MNAME field accepts dynamic DNS updates, if one is using RFC2136-style dynamic DNS, although I didn't get the sense Spotify were doing so.

tacon · on April 2, 2017

When I was setting mine up, the term I saw was "unpublished primary". I did it because there were few free primary DNS services back then, but several free secondary services. So I just "forgot" to publish my primary...

guitarbill · on April 1, 2017

> We run our own DNS infrastructure on-premise which might seem a bit unusual lately.

I don't think running your own DNS is too uncommon, especially if you have a lot of on-premise hardware that changes somewhat frequently. However, if you do this don't run BIND. We found PowerDNS to be much better in terms of features, user-friendliness, and documentation. Having backends that aren't zonefiles is a huge win. I've heard good things about Unbind, but haven't used it in a big environment yet (>1000 machines).

jlgaddis · on April 2, 2017

I'm a "BIND lover" and have been since the 90s and still use it for public-facing authoritative name servers. In their case, though, it definitely sounds like they should consider PowerDNS. It allows for various backends, including SQL and custom ones, which might fit in well with the "data store" they mentioned. Instead of all the cronjobs and pushing and pulling, they might be able to point the authoritative nameservers directly at their "data store" and cut out a lot of that "plumbing" (it's impossible to know without more details, of course).

Also, unless there's a huge amount of DNS data changing every 15 mins, they might gain some speed-ups from sending dynamic DNS updates to the authoritative nameservers and/or using IXFRs instead of AXFRs.

(n.b.: unbound only handles recursive DNS, not authoritative.)

guitarbill · on April 2, 2017

> (n.b.: unbound only handles recursive DNS, not authoritative.)

Yeah, but as you probably also know the PowerDNS recurser is separate, so there's no reason PowerDNS + Unbound couldn't also be a great combination. Heck, I might even choose that combo so resolvers only have unbound installed and can never act as authoritative servers.

sigil · on April 2, 2017

> It simply pulls from our DNS data repository, then compiles all the zone data via named. With every compile time – which takes about 4 minutes...

4 minutes seems like an awful long time for what I'm assuming is a fairly simple transformation. Any insight as to why? Is named just slow?

inopinatus · on April 2, 2017

It's not intrinsically slow, no. I've built & run BIND-based infrastructure spanning >20 sites, >50 servers and >10,000 zones, and changes were compiled and propagated in seconds. Only a total rebuild & recompile & reload of all zones & services required time on the order of minutes.

adrianratnapala · on April 1, 2017

I got side-tracked by the talk of version control. Do they use a single repo for everything? It seems like it from they way they talk.

In that case, I am surprised git is a good fit. SVN might have been better, though some commercial solutions like ClearCase or Perforce would actually be right for that sort of work-load.

blixt · on April 1, 2017

Spotify mostly uses independent repos for all code, but some configurations (e.g., DNS) are contained within a single repo.

Some big companies do successfully use Git for monorepos though (e.g., Microsoft – search for "GVFS").

Note: I haven't been there for two years so of course some things may have changed.

daenney · on April 1, 2017

To me it seems like they're using a single repository for all their DNS infrastructure, not a monorepo for all of Spotify.

raverbashing · on April 2, 2017

I assume you never used CC but it's a gigantic pile of crap. It is completely worthless. Sold on golf courses to clueless higher ups, because most developers hate it

But in general I've never seen any commercial source control system beat an open source I've.

daxelrod · on April 3, 2017

Perforce is mostly a better Subversion. The two have an extremely similar model, but Perforce handles merges much better and gives more flexibility in slicing and dicing your local view of the repo (extremely useful for huge monorepos).

Disclaimer: it's been several years since I've used Subversion, this may have changed.

paulddraper · on April 1, 2017

Not sure, but FYI a single git repo is very doable for an org like Spotify, if you want to make it work. You have a few choices.

(1) Keep binary/data files elsewhere.

(2) Keep large files in Large File Store/Annex.

(3) Use Git Virtual File System by MS.

JoshTriplett · on April 2, 2017

No matter which of those paths you take, you also need to have an automatic merge/rebase bot, because in a sufficiently large organization with a single repository, by the time you finish pulling to ensure that your change fast-forwards, someone else will have already pushed another change.

paulddraper · on April 2, 2017

Few repos have a commit every couple seconds.

They exist, but you have to be monumentally huge to have that.

JoshTriplett · on April 2, 2017

Perhaps not every few seconds, but it will likely take more than a few seconds to "git pull --rebase", review the result, re-check the build and smoke tests, and then push. (Or, alternatively, to merge into master, build and test the result, and push.) And "every minute or two* is not at all unlikely for a company-wide shared repository during at least part of the day. Much easier to let a bot do either a cherry-pick or merge, including a build and smoke test.

paulddraper · on April 2, 2017

    git pull [--rebase] && git push

takes a few seconds. Doing it server-side (say, through any off-the-shelf code review system) is even faster.

If you're insisting on running tests every single change before fast-forwarding into trunk....yes that will get prohibitive very fast. A bot would hardly help though. If you have 4 commits a minute applied and verified serially, you need to build and test in 15 seconds.

JoshTriplett · on April 2, 2017

Even if you don't run tests, at a bare minimum you always need to build-test. master should always build. With a warm ccache, that shouldn't take excessively long, but long enough that you still want to do it server-side.

Assuming you've locally build-tested each commit, the bot only needs to build-test the combination. And some bots even pull in a set of merges at a time, and keep them if they all pass, only falling back to serial testing if they fail.

saryant · on April 2, 2017

git pull doesn't take just a few a seconds on huge repos. Pulling down a ~day's worth of changes takes several minutes in ours.

vinay_ys · on April 2, 2017

DNS for service discovery is very old school stuff that has been known to be unreliable and requires unnecessary amount of work. You are better off using an actual dedicated http based service for doing service discovery. Keeping things largely TCP/HTTP based makes everyone's life simpler.

0xbadcafebee · on April 1, 2017

They didn't even look at their firewall tables before deploying a completely different OS to production? Wow.

nailer · on April 2, 2017

This:

> Upon the migration of the final nameserver – you guessed it – DNS died everywhere. The culprit turned out to be a difference in firewall configuration between the two OSes: the default generated ruleset on Trusty did not allow for port 53 on the public interface.

Deploy a new service on a Linux box in the last decade? Poke it through whatever the distro uses to manage iptables.

It's like a webdev saying "it turns out DROP deletes tables".

alex_duf · on April 1, 2017

So DNS as way to orchestrate sharding? is this common, or am I right to find that odd?

daenney · on April 1, 2017

What makes you think this is about sharding?

ec109685 · on April 1, 2017

The article talks about their use of dns service records to encode shard locations. I know other companies that do similar things. DNS has to be up, so rather than adding another dependency, encoding it into that infrastructure is a valid approach.

hamandcheese · on April 1, 2017

The part about looking up track information.

blorgle · on April 2, 2017

Spotify should consider implementing Consul :P

Something1234 · on April 1, 2017

What's with the weird font rendering? It looks blurry.

Twirrim · on April 1, 2017

Comparing Firefox with Chrome, it kind of seems like it is doing more anti-aliasing on Firefox.

If I split the two windows side by side, the difference is marginal. If I do them fully maximised, the difference is more pronounced (at least on Windows)

edit: downvotes? really? Do I need to post screenshots or something?

kuschku · on April 1, 2017

They also set the font-weight on every single line of text, separately, via html style attribute. To 400 (normal is 300).

That’s very weird.

notyourwork · on April 2, 2017

> That’s very weird.

They are probably using a WYSIWYG editor that auto generates so when you press return it creates a new paragraph. I agree the output is weird but I could see how this would happen.

leesalminen · on April 1, 2017

Maybe instead of worrying about internally built DNS servers they could allow me to undo an artist ban from a daily mix. Seriously frustrating.

daenney · on April 1, 2017

Maybe their engineering blog is about engineering things and customer support should go through https://twitter.com/spotifycares and https://support.spotify.com instead of an HN thread.