Spatial composition can be done easily, if you stop bothering with pure text-to-...

Spatial composition can be done easily, if you stop bothering with pure text-to-image (SD has several tricks and UIs to place objects precisely, they are all janky but they do work, that's practically photobashing). Attribute separation is also easily done with tricks like token bucketing, so your Indian guy will look Indian, and your East Asian guy will look East Asian. All of that is easy if you abandon the ambiguous natural language and use higher-order guidance.

What's really required is semantic composition. Making subjects meaningfully and predictably interact, or combining them together. And also the coherence of the overall stitched picture, so you don't end up with several different perspective planes.