> What the administrator that sets up that record wants is "send to whichever one of these seems healthy
In the rrDNS, remove the A record of the hosts that fails tests, or that has a load that's too high
> Maybe you want to try them all in parallel and pick the first to respond (at the TCP connection level).
Something a geoIP at your DNS can do, certainly not as good as doing that in the client, but it should be decent enough.
> Your systems had better be very reliable if DNS issues can eat a year's worth of error budget
Or, if you aren't google or cloudflare, use a 30 to 60s TTL in rrDNS, with health checks to selectively remove IP that fail, on pools splitting your servers by region with geoIP - this way, if 1/10 of your east coast servers fail, nobody from APAC will be impacted, and only 1/10th of your US east users, and only for the TTL (I'm abstracting ISP that cache for too long, but you already mitigate a lot of the problem there)
I can see how it would be easier to handle that in the browser, but you may already be able to do that with some JS to estimate the latency, then store the result in a cookie that causes a reload to www.eastcoast.yoursite.com if the user sticks to www.yoursite.com or after returning home goes to www.apac.yoursite.com while new measurement say "not optimal" and update the cookie
I am kind of OK with this solution, and is in fact my plan to roll out HTTP/3 for my personal sites. I wrote https://github.com/jrockway/nodedns to update a DNS record to contain the IP addresses of all schedulable nodes in my cluster. I can then serve HTTP/3 on a well-known port and it is probable that many requests will reach me successfully. (I had to do this because my cloud provider's load balancer doesn't support UDP, and I don't have access to "floating IPs"; basically my node IPs change whenever the cluster topology needs to change.)
I don't really like it because it still means a minute of downtime when the topology does change. I would prefer telling the browser what strategy to use to try a new node, rather than relying on heuristics and defaults.
In the rrDNS, remove the A record of the hosts that fails tests, or that has a load that's too high
> Maybe you want to try them all in parallel and pick the first to respond (at the TCP connection level).
Something a geoIP at your DNS can do, certainly not as good as doing that in the client, but it should be decent enough.
> Your systems had better be very reliable if DNS issues can eat a year's worth of error budget
Or, if you aren't google or cloudflare, use a 30 to 60s TTL in rrDNS, with health checks to selectively remove IP that fail, on pools splitting your servers by region with geoIP - this way, if 1/10 of your east coast servers fail, nobody from APAC will be impacted, and only 1/10th of your US east users, and only for the TTL (I'm abstracting ISP that cache for too long, but you already mitigate a lot of the problem there)
I can see how it would be easier to handle that in the browser, but you may already be able to do that with some JS to estimate the latency, then store the result in a cookie that causes a reload to www.eastcoast.yoursite.com if the user sticks to www.yoursite.com or after returning home goes to www.apac.yoursite.com while new measurement say "not optimal" and update the cookie