Hacker News new | past | comments | ask | show | jobs | submit login
Human-Assisted Archival of Yahoo Groups (github.com/davidferguson)
384 points by jstanley on Dec 8, 2019 | hide | past | favorite | 62 comments



Wonder if we could make a deal with all those casino seo spammers from indonesia. (Or whoever sells their service to them.)

If they will use their cache of yahoo addys to exfiltrate Yahoo Group content, we'll give them a free month of gitlab user spam usage no questions asked.


I agree there must be something that could be done to utilize the click farms elsewhere.


Leaderboard: https://df58.host.cs.st-andrews.ac.uk/yahoogroups/leaderboar...

Unfortunately doesn't seem to be updated very frequently.


And my new yahoo account can't join any groups :/


I had this. Once I added a backup email address, it worked. I found this by trial and error. It seems that Yahoo doesn't consider its email addresses to be valid.


Did you see the sibling comment to yours, posted 7 hours earlier? Is that not a different issue?


What error do you get? If it's:

> Your email address is not linked to a Yahoo ID. To join this group, you need to link your email address to a Yahoo account.

Then you need to go into "Account security" and link an email address with the account. I don't know why the instructions don't tell you this.


Sorry, got sidetracked, wanted to edit in the issue link. It's this issue: https://github.com/davidferguson/yahoogroups-joiner/issues/1...

Thanks for the suggestion though, I saw that thread about linking a fallback email address as well but that seems to be a different thing.


Is there a possibility of asking ReCaptcha to disable their protection for Yahoo Groups? It's a reach I guess.


That's Google, and that would be suicide to the brand.


Exactly. Especially since Google is trying to transform reCAPTCHA into an enterprise offering: https://cloud.google.com/recaptcha-enterprise/


Thinking like a different kind of hacker. I like it.


Recaptcha is the crappiest product Google makes, in my opinion. If they're as bad at detecting a human, they shouldn't even be trying. I can solve 10 puzzles correctly in a row and it's still thinks I'm not a human.


Hal is that you?

Thanks for helping us test our next CAPTCHA on HN, you’re getting closer. https://xkcd.com/810/


As of now(), it seems the reCAPTCHA is over the limit

https://github.com/davidferguson/yahoogroups-joiner/issues/1...


This seems to be the end of the road, unless Google increases the reCAPTCHA quota of Yahoo Groups.

Let Google and Verizon Media know with a tweet:

  https://twitter.com/intent/tweet?url=https%3A%2F%2Fgithub.com%2Fdavidferguson%2Fyahoogroups-joiner%2Fissues%2F14&text=@Google%20and%20@verizonmedia%2C%20increase%20the%20reCAPTCHA%20quota%20of%20Yahoo%20Groups%2C%20archivists%20are%20blocked%20from%20preserving%20history.&hashtags=letusarchive%2Crecaptcha


It works fine now, I have been joining groups for some hours.


That's great! Thanks to whoever fixed this - at e.g. Google or Verizon.


Seems like something Google could fix quickly.


yes, although better to not tell anyone about it:

https://news.ycombinator.com/item?id=21739662


Disabling a captcha and removing a quota for a captcha are two separate things.


You can join a Yahoo group by sending a blank email to "<groupname>-subscribe@yahoogroups.com". There is no CAPTCHA, only an automated email confirmation which you reply to. Is there a reason why that wouldn't work for this project?


If this works, would be much easier to automate the army of manual archivistas that is currently using the chrome extension and blocked by recaptcha.


You can only join a very small number of groups this way before it just stops working - we did try this out early in the archive process


Where is this stated?


Could Mechanical Turk be used to do this? I'd be happy to make a donation to help offset the cost.


This is the third time I have seen this suggested. Maybe get in touch, and try to suggest that. I am also in a tight schedule these days and it would be easier to help with money.


How fitting that Yahoo archival needs to be organized by people.


How so? Are you referring to some kind of history of Yahoo?


Yahoo started by building an index of the internet but instead of using an algorithm, they primarily relied on people crawling the web and categorizing websites and curating things. It was a people powered directory.


Jumped in when we were joining groups starting with 'S', and now we're already at 'E'. Really satisfying.

Given that my account got a connection attempt from Sweden, I guess that's where the archivist live. Hopefully he will have a nice morning tomorrow thanks to the community.


I started joining groups a few hours ago and now have 43. The first group I joined started with U. Then there were some with T, some with S, and I'm at R now.

I'd rather we archive large or relevant groups first instead of going alphabetically and having to join groups with just 1 post.


It's going alphabetically through the group's people have nominated, and once that's done, it'll go by group member count.


Apparently now that they have gotten through the Fandom groups, and are working through the groups that were requested to be archived, they are ranked in order of number of members. So they are working on progressively smaller groups.


Are you going in reverse alphabetical order, or what?


I have read on the other hn thread that this is the case


Some things I noticed about google re-captcha already. It can't tell the difference between a 3x3 matrix of a house and a bus, it doesn't know the difference between scooters and motorcycles, it has a margin of error on fire hydrants/usually the last image that it asks you to identify a fire hydrant in and fades out will not bring up another fire hydrant so you can click verify early.


Recaptcha doesn't know the difference between anything. It's just trying to gather consistently tagged training data of regions which can in turn be used to train something that can identify objects.

Crowdsourced labeling doesn't make sense if you already have something that can tell you what the label should be.


I believe I read something on HN months ago that claimed Google would force failures on otherwise successful captchas. Little bit of gas-lighting if true.


Anecdotal I know, but I have definitely experienced this.


I wonder if groups with actual members will be OK. I'm a member of one and have several years of message digests in my email. I just downloaded all of them. For groups without members, or with no-one bothering to read them anyway, maybe it doesn't actually matter?


If they are public, yes, it should be ok. Regarding groups that are old or inactive, A couple that I'm concerned with had important early discussions, that people on other derivative groups refer to regularly.


I haven't seen anything about this saga in mainstream news; I can't see that it's going to be won without getting papers on-side.


There are three that we've noticed so far:

https://bbs.boingboing.net/t/as-the-end-nears-for-yahoo-grou...

https://www.zdnet.com/article/verizon-kills-email-accounts-o...

https://www.theinquirer.net/inquirer/news/3084557/verizon-bl...

But yes, you are right. We need to keep pushing it into the public sphere. Twitter, Reddit, I hate to mention FB here, but ...


I don't really consider those 'mainstream news', I've only heard of zdnet - I thought boingboing was a public WiFi hotspot provider.

I primarily mean newspapers. The Times (of London & New York), The Guardian, but Bloomberg too.

The story is probably a wider point about the fragility of online information, for which this is a mere significant event, but without that happening I just don't see the give-a-shit count increasing.


How can I recognise services that intend to live forever? Meetup used to have reports and mailing lists but it went to a sinking company and is now a pointless SPA which doesnt let Firefox log in. I know a group that acquired several competing products, and they missed a crucial innovation which the indies had for a while.


Nobody and nothing lives forever.

For organisations, the best you might be able to do is some kind of co-operative: it's much less likely to sell out (although not impossible), you generally get a vote in how it's run, and since they're forced to be self-funding you're not dependent on VC funding whims. With sufficient runway transparency you can always know how far they are away from shutdown and how much funding they need.

Twenty years ago (!) I helped set up a hosting co-op for university societies: https://www.srcf.net/

One of our specific aims was preserving continuity. Most societies are run by undergraduates who do it for a year or two and leave after 3 years, so making it as easy as possible to handle handover was a key feature. It's done pretty well for something that pre-dates Facebook, Github, Myspace, and even Yahoo Groups itself.


I admire the neat design and a backend of (quoting your site) "the server". Have you written an article about this?


Oh, it's been years since I was involved with any of the actual running of it. I don't even have a shell account any more. In the early days it was "the server", a spare PC that was donated. These days it looks like they have a donated cluster: https://www.srcf.net/faq/about#system

The "backend" will be Apache. On day 1 we used SSI (server-side includes) for "theming" pages, which were all in handwritten HTML. I suspect it's still like that given the five blank lines before DOCTYPE. It looks like some bootstrap CSS has been sprinkled on it since then. There's no front-end Javascript because there doesn't need to be.

> Until 2006 we had just one server in use, kern (a dual Athlon 1.6GHz PC with 2GB of RAM and 400GB of disk). Before that we used to run on an ancient Intel Pentium running at 166MHz with 128MB of RAM. How times have changed :)

Indeed. That ancient system was perfectly adequate for serving web pages to a few thousand people for light use. At the time I was carrying around the amazing new thing that was a computer you could fit in your pocket and play music illegally downloaded from the internet on. It was a Toshiba Libretto 30 with 8MB (eight megabytes) of RAM and a PCMCIA sound card.

Our systems approach to the SRCF was very much "what is the simplest thing that could possibly work". Apache+CGI+PHP with UNIX user accounts will get you a long way if you let it.

The real achievement is political and personal. I'm amazed that they've always managed to find good enough volunteer staff for the whole thing for twenty years.


You can’t, this is why we used to design services based on open protocols with data portability. No service lives forever.


How many more groups are left to join? There's no easy to way as far as I can tell to get a sense of completion. The extension goes in reverse alphabetical order, but also loops around again.


Is there any market to run a self sustained website like Yahoo Groups? By self sustained, I mean that there would be income that would pay the hosting costs.


Facebook and Reddit are pretty good at running groups.


Is there any way to make this extension work on Firefox?


Three Ways You Can Help:

Help by Joining Yahoo Groups so the Archive Team can Download them (easy! - this is the link in the OP here): https://github.com/davidferguson/yahoogroups-joiner

(That's the most needed right now so the scripts can get access to the groups.)

Help by Downloading yahoo Groups with the Archive Team's Script (not hard!): https://www.archiveteam.org/index.php?title=ArchiveTeam_Warr...

Get the word out/Call for Action (put pressure on Verizon!): https://modsandmembersblog.wordpress.com/taking-action/

Don't miss the sidebar with these links:

   https://modsandmembersblog.wordpress.com/media-contacts/

   https://modsandmembersblog.wordpress.com/contacting-verizon-directly/

   https://modsandmembersblog.wordpress.com/contacting-verizon-yahoo-stockholders/
Also, you can add these emails to the media contacts:

  "Reporter Katyanna Quach" <kquach@theregister.co.uk>,
   "Managing editor Gavin Clarke" <gavin.clarke@theregister.co.uk>,
   "Corey Wilson & Rachel Janc; Senior Director, Communications" <press@Wired.Com>,
   "Pitches" <submit@wired.com>,
   "Rich Woods" <rich.woods@neowin.net>,
   "Paul Thurrott" <paul@thurrott.com>,
   "Brad Sams" <brad@petri.com>,
    "Kate Rayford, Media Inquiries" <katie.rayford@slate.com>,
    "Bryan Lowder (LGBTQ issues/culture)" < bryan.lowder@slate.com>,
    "Torie Bosch (emerging technology effects on public policy and society)" <torie.bosch@slate.com>,
    "Jonathan Fischer (big tech, cities, media/internet culture)" <jonathan.fischer@slate.com>,
    "Susan Matthews, Health & Science" <susan.matthews@slate.com>,
    "Erika Allen, Executive Managing Editor" <erika.allen@vice.com>,
    "Katie Drummond, SVP, Global Content" <katie.drummond@vice.com>,
    "Press, US" <press@vice.com>,
    "Press, Canada" <presscanada@vice.com>,
    "Press, UK" <ukpressoffice@vice.com>,
    "Pitches, Culture" <culture.pitches@vice.com>,
    "Pitches, Tech" <tech.pitches@vice.com>,
    "Issues" <issues.pitches@vice.com>


Can you only join one group per account?


You can join as many groups as you want. For example: https://df58.host.cs.st-andrews.ac.uk/yahoogroups/leaderboar...


I just joined 137. I hope they get crawled!


An injunction would be nice, though I'm not sure if a court would see anyone as having sufficient standing to stop the gears entirely.


There's absolutely no legal basis for such a thing, though. Legally it's Yahoo's and they can shut it down and delete it tomorrow. It's only morally that they have an obligation.


On It!!!


Wow! Kudos to the Archive Team on not giving up on this mission.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: