I was thinking that too, but this description makes it seem more like a type of windowing rather than the position encoding in a transformer (which is fixed):
"the information [...] gets passed between different neural populations in a predictable way, which serves to time-stamp each sound with its relative order.”