I remember seeing some machine learning algorithm posted here a while back that did an amazing job fingerprinting writing samples. It could use that fingerprint to match up accounts from multiple different sites. Some people here had it correctly find their Reddit profile based off nothing but their public HN data.
If you are writing extensively both anonymously and non-anonymously, you should probably assume that someone motivated enough could match the two together either presently or in the near future as such technology becomes more widespread.
My name here is my name on Reddit is my name on GitHub is my name in real life. I’ve never believed it was possible to be totally anonymous on someone else’s server, so it’s good to have the reminder that I’m absolutely not every time I post anything.
This evidence is inadmissible in general public. You can't go on Twitter, say that the writing style of that and this person is 98% similar per the state of the art SGERT model and convince the public with that. It's also trivial to for a doxing target to dismiss this evidence as another flaky piece of software that's confusing writing styles. This kind of software is useful, but for more specialized purposes: forensics, intelligence.
You don't need any evidence at all to convince the online public of anything it wants to believe, especially if it gives it a target for vitriol. I have to agree with my sibling commenter that believing otherwise is naive given the wealth of evidence to the contrary.
I'd really hate to ruin your naïveté...but sufficiently riled-up mobs have gone after individuals for much, much flimsier evidence than "because this AI says so".
After they're identified, keen humans will go looking for stronger clues. Maybe the different accounts told the same anecdotes or show they have the same set of opinions or knowledge or the posting times are similar but never coincident or whatever other human-readable evidence.
Yes. The time of day the user posts can reveal to their time zone. People often leave comments that reveal their age or gender. If the user mentions a business or product name, it might only available in certain part of the world. Many people reuse account names on other services and their content their may have more clues... etc.
A friend of mine asked GPT-3 to mimic a text written by "nindalf". It was so good that I thought it had plagiarised something I had actually written on HN. But it was only mimicking the style of my comments.
Yeah, I tried that and it completely struck out. Gave a "similarity score" of .992 or .993 for ten other accounts that weren't mine. Detected a big fat zero of my old accounts (I rotate them regularly).
The author almost immediately took the site down and packaged it as part of some social media analysis tool. I can't seem to find the actual post at the moment.
If you are writing extensively both anonymously and non-anonymously, you should probably assume that someone motivated enough could match the two together either presently or in the near future as such technology becomes more widespread.