I wish they included the datasets they used for the evaluations. As far as I can...

I wish they included the datasets they used for the evaluations. As far as I can tell, in appendix II they include some sample questions, answers, and golden chunks but they do not give the entire dataset or give an explicit information on exactly what the datasets are.

Does anyone know if the datasets they used for the evaluation are publicly available or if they give more information on the datasets than what's in appendix II?

There are standard publically available datasets for this type of evaluation, like MTEB (https://github.com/embeddings-benchmark/mteb). I wonder how this technique does on the MTEB dataset.