Quoting from https://cr.yp.to/djbdns/tinydns-data.html : *Beware that cache time...

dspillett · on Nov 9, 2022

I wonder what those clients are, none of these warnings ever give examples in my experience so I can't tell if this is old out-of-date information being repeated or it is an active modern problem.

I know there was a DNS resolver that did worse than impose a minimum like that, it treat short TTLs (<300s) as errors and "corrected" them by using its default (often 24 hours). I actually encountered it and its effect in the wild, but this was back in the very late 90s or very early 00s. It was a DNS cache daemon on an old Unix variant (I forget which one, my university had a few floating around at the time), so old that if someone is still using it they have far greater issues than my low TTL value!

My current attitude is that if I have the need for short TTLs then I'm going to use them (though 300s is more than short enough for anything I envisage doing ATM & 300s seems to be reliable) and if someone's DNS cache is broken that is their problem, much like I don't pay any attention to issues Internet Explorer users might have looking at some HTML+CSS+JS I've inexpertly strung together.

nijave · on Nov 9, 2022

HughesNet used to run a caching DNS resolver on their customer equipment circa 2016 (not sure if they still do) which overwrote TTLs to 12 hours.

Java/JVM historically ignored TTL and just did a single DNS lookup and cached the result unless configured otherwise.

PGBouncer ignores DNS TTL and has its own configuration.

I've seen lots of software that will create a "connection" object and re-use that with TCP reconnect handling. In practice, that means the DNS resolution happens when the connection is recreated but not necessarily when the TCP connection reconnects due to timeout/failure. I've also seen some issues with long lived connections (where the connection greatly outlives the TTL). Long lived connections aren't necessarily a DNS TTL problem but they mean DNS load balancing/redirection don't always work how you expect.

That said, the solution is usually to just fix the issue in the software and not ignore DNS TTLs.

As counter examples, AWS ALBs and Kubernetes service discovery both rely on short TTLs with fairly good success.

LinuxBender · on Nov 9, 2022

I wonder what those clients are

Some heavily loaded ISP's have modified daemons that set a lower and upper TTL threshold. Anyone running Unbound DNS can also do this. Some versions and distributions of Java do all manor of odd things with DNS including ignoring TTL's though I do not have a current table of those misbehaving. The same goes for some IoT's but I have no idea what resolver libraries they are using.

dspillett · on Nov 9, 2022

> Anyone running Unbound DNS can also do this

Yep. Just checked the docs for that and “cache-min-ttl” is a thing. For those unfamiliar, unbound is a pretty common resolver: it is the default for most BSDs and things based on them (such as pfSense) and used by the popular PiHole and its forks. How common it is to use this setting I can't comment on.

> Some versions and distributions of Java do all manor of odd things with DNS including ignoring TTL's

This at least is less of an issue if all you are futzing with the DNS for is services intended to be consumed by a web browser: if common browsers and DNS resolvers behave OK then you are good, if someone consuming your stuff through their own code hits a problem they can work around it themselves :)

It goes without saying that only a fool would play trick like this for email as that is already a mountain of things that can be easily upset by anything off the beaten track.

EDIT:

Also, yep, people are using the setting in their normal environments: https://stackoverflow.com/questions/21799834/how-to-determin... (that person currently having it set to 30 minutes)

LinuxBender · on Nov 9, 2022

The reason I mention Java is that when updating DNS in a business, there may be business to business flows that are critical. This is becoming more common than ever with cloud-to-cloud-to-cloud dependencies. Mishaps in cloud services can affect a service used by a large number of people. That is why I always tried to teach people to design apps so that TTL could in theory be 1 week and it would not matter. e.g. apex domain anycast IP used for authentication, then direct browsers, API clients and cloud daemons to a service dedicated domain name that is also anycast and uses health checks behind the scenes to take nodes out the picture rather than relying on DNS.

pastage · on Nov 9, 2022

Getting DNS right is non trivial, and UDP will be dropped at times. If you have to lookup the name every two seconds means you have a high risk of getting failed lookups even if your infra team is competent. (10 minutes should be ok)