I was picturing doing the deduplication as a map-reduce process over a huge (petabytes) dataset, where every worker is blind to what embeddings other workers have already generated. In such a case, shoving the embeddings generated by each worker into a shared vector DB, and having it (maybe incrementally) clustering the vectors as it receives them, would be acting as the "reduce" step.