Actually, there is good reason to be within 30ms ping (99th percentile). Half th...

Actually, there is good reason to be within 30ms ping (99th percentile). Half that, plus the 5ms algorithmic delay from Opus (in CELT-only-restricted-low-latency mode) gives 20ms, which is the lower end of uncanny valley for real-time interactive audio (certainly for musicians in a band, but I'll presume relevance for verbal communication to set in at the same psychoacoustic threshold).

If you introduce any amount of latency by executing the encoder/decoder pair, you'll have to subtract double the latency from your ping-allowance.

If you try to have correctly-lipsynced audio in a video call, I only know of one setup to offer similarly-low video latency: a rolling-shutter in the camera, a line-by-line display (CRT should do well), and up-to a few lines algorithmic delay for e.g. running non-buffering JPEG (8x8 DCT and an online entropy coder (no pre-analysis for optimal Huffman tables or such) to save like 80-90% bandwidth). Analog TV camera+screen hardware should also work, but it's really inefficient and not easy to emulate with digital hardware.