I'm in the industry and nobody does that since over ten years. There was just a small phase when Hinton published "Greedy layer-wise training of deep networks" in 2007 and people did it for a few years at most. But already with the rise of LSTMs in the 2010s this wasn't done anymore and now with transformers also not. Would you care to share how you reached your conclusion as it matches 0 of my experience over the last 15 years and we also train large-scale LLMs in our company. There's just not much point to it when gradients don't vanish.
Not easy to give a concise answer here, but let me try:
The problem mainly occurs in networks with recurrent connections or very deep architectures. In recurrent architectures this was solved via LSTMs with the signal gates. In very deep networks, e.g. ResNet, this was solved via residual connections, i.e. skip connections over layers. There were also other advances, such as replacing sigmoid activations with the simpler ReLU.
Transformers, which are the main architecture of modern LLMs, are highly parallel without any recurrence, i.e. at any layer you still have access to all the input tokens, whereas in an RNN you process one token at a time. To solve the potential problem due to "deepness" they also utilize skip connections.