Hacker News new | past | comments | ask | show | jobs | submit login

If I have 50,000 historical articles and 5,000 new articles I apply SBERT and then k-means with N=20 I get great results in terms of articles about Ukraine, sports, chemistry, and nerdcore from Lobsters ending up in distinct clusters.

I’ve used DBSCAN for finding duplicate content, this is less successful. With the parameters I am using it is rare for there to be a false positives, but there aren’t that many true positives. I’m sure I could do do better if I tuned it up but I’m not sure if there is an operating point I’d really like.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: