I was retrieving text from news sites, so URLS were not that relevant.
Some news services will re-issue a story with more information, keeping the same title and description. A full text check is necessary. I computed a secure hash of the text and compared that.
There are two extremes i know here: same/similar title changing content (we hit gold in seo, let's keep updating this "10 best foos for baring" page), changing title same content (anyone doing serious A/B testing).
Both true. The internet has way more covers than actual books you could say :). Content is very much repackaged over and over again.
However, I found, that URLs don't change as much as the titles and slightly edited texts - it will happens if course, but to go beyond that you would need a similarity hash of the actual content of the page and even that reaches it's limits pretty quickly:
Sometimes a change of title + few edits can change the whole narrative of a near-identical article. Looks like currently even ai can't solve that. And interestingly I've seen that happen even for pretty large newspapers "whatever clicks"...
Since zebra is meant to enable debate/exchange of perspectives on specific contents I think only URLs give some degree of certainty about pointing to the same content.
Some news services will re-issue a story with more information, keeping the same title and description. A full text check is necessary. I computed a secure hash of the text and compared that.