> sorting of the data Perhaps the main advantage is that it can filter out non-a...

> sorting of the data

Perhaps the main advantage is that it can filter out non-adaptive, non-functional (noisy) mutations this way by simply averaging similar genomes? In that case, the rows of the learned M x N kernels should be nearly identical and one could have simply averaged M data rows at a time and fed it to the 1D CNN.

What other phylogenetic information could possibly be inferred?

Edit: It could also be thought as data augmentation as it effectively creates novel inputs each time. IIRC there was also a technique for hardening against adversarial examples which simply fed the network averaged datapoints along with the original data.