In general, all Zen generations share two characteristics: cores are bound into 4 core clusters called CCXes, and two of those are bound into a group called a CCD. Chips (Zen 1 and 1+) and chiplets (Zen 2) both have only ever put one CCD per chip(-let), and 1, 2, and 4 chip(-lets) have been put on per socket.
In Zen 1 and 1+, each chip had a micro IO die, which contains the L3, making a quasi-NUMA system. Example: a dual processor Epyc of that generation would have one of 8 memory controllers reply to a fetch/write request (whoever had it closest, either somebody had it in L3 already, or somebody owned that memory channel).
L3 latency on such systems should be quoted as an average or as a best case/worst case. Stating L3 as worst case only ignores memory cache optimizations (such as prefetchers grabbing from non-local L3 and fetches from L3 do not compete with the finite RAM bandwidth, but add to it, thus leading to a possible 2-4x increase performance if multiple L3 caches are responding to your core); in addition, Intel has similar performance issues: RAM on another socket also has a latency penalty (the nature of all NUMA systems, no matter who manufactured it).
Where Zen 1 and 1+-based systems performed badly is when the prefetcher (or a NUMA-aware program) did not get pages into L2 or local L3 cache fast enough to hide the latency (Epyc had the problem of too many IO dies communicating with each other, Ryzen had the issue of not enough (singular) IO die to keep the system performing smoothly).
Zen 2 (the generation I personally adopted, wonderful architecture) switched to a chiplet design: it still retains dual 4 core CCXs per CCD (and thus, per chiplet), but the IO die now lives in its own chiplet, thus one monolithic L3 per socket. The IO die is scaled to the needs of the system, instead of statically grown with additional CCDs. Ryzen now performs ridiculously fast: meets or beats Coffee Lake Refresh performance (single and multi-threaded) for the same price, while using less watts and outputting less heat at the same time; Epyc now scales up to ridiculously huge sizes without losing performance in non-optimal cases or getting into weird NUMA latency games (everyone's early tests with Epyc 2 four socket systems on intentionally bad-for-NUMA workloads illustrate a very favorable worst case, meeting or beating Intel's current gargantuan Xeons in workloads sensitive to memory latency).
So, your statement of "a tiny NUMA system within the package" is correct for older Zens, not correct (and, thankfully, vastly improved) for Zen 2.
Yeah. I bet part of why there's so much L3 per core group is that it's really expensive to go further away.
Seems like there're at least two approaches for future gens: widen the scope across which you can share L3 without a slow trip across the I/O die, or speed up the hop through the I/O die. Unsure what's actually a cost-effective change vs. just a pipe dream, though.