can you elaborate on the chinchilla law / dataset problem a bit? (perhaps by editing your previous comment?)
what datasets are available to the community, how big are these, are they needed to be updated from time to time, where are these stored, what are the usual cost ranges involved, ...? :o