Hacker News new | past | comments | ask | show | jobs | submit login

I was picturing doing the deduplication as a map-reduce process over a huge (petabytes) dataset, where every worker is blind to what embeddings other workers have already generated. In such a case, shoving the embeddings generated by each worker into a shared vector DB, and having it (maybe incrementally) clustering the vectors as it receives them, would be acting as the "reduce" step.



Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: