Well explained! All of the later contributions to the sum are effectively ignore...

Well explained! All of the later contributions to the sum are effectively ignored or their contributions severely damaged in 32-bit because the "buckets" are big.

It was precisely this problem. The individual had done all data preparation/normalization in 32-bit because the model training used 32-bit on the GPU. It's a very reasonable mistake if one hasn't been exposed to floating point woes. I was pleased to see that the individual ultimately caught it when observing that 2 libraries disagreed about the mean.

Computing a 64-bit mean was enough. Compensated (i.e. Kahan) summation would have worked too.