Worth emphasizing that this is described as basically just adding new ways to consume Wikipedia at scale, not taking anything away (not that that's impossible in the future, but unlikely IMO). In theory, systems that use a lot of Wikipedia data (search engines, "AI" assistant/question answering systems, language models) can keep doing what they're doing with Wikipedia dumps, but new APIs could make these services better (e.g., faster reaction time to current events).
A big upside of testing this out is the opportunity to keep better track of what services are Wikipedia-dependent. While search results often send people to Wikipedia (see e.g. https://diff.wikimedia.org/2021/09/23/searching-for-wikipedi...), voice assistants or Q&A systems might be (i.e. likely are) surfacing Wikipedia content in more subtle fashion. Same with large language models. Under status quo, it will be increasingly hard to track these relationships, so exploring alternatives is well worth it.
So much of the recent progress in natural language understanding in commercial products like Siri and Alexa is due to the richness and structure in Wikipedia and wiki data. Big tech should be paying for this stuff…
I don’t see anything about anyone paying the content creators. What fairness is it to har one platform currently freeloading of of the work that another platform is freeloading of its users (while begging them for donations mind you) suddenly have to pay the first freeloading platform?
I can’t speak for others naturally, but every vontribution I have made to Wikipedia has been made on the basis of contributing you our common knowledge base, not to try to provide a free service for a profit driven corporation.
> I don’t see anything about anyone paying the content creators.
The knowledge contributed to Wikimedia is understood to be public knowledge. Any contributions are considered voluntary and available to the general public. This includes profit driven corporations who want to create a service wrapper around Wikimedia content or serve the content in a different presentation (licenses withstanding).
> ...not to try to provide a free service for a profit driven corporation.
I personally feel like this is a fair step. Since contributions are made for public consumption, it makes sense that for-profit corporations whose products depend on Wikimedia APIs would pay for access to those APIs. They're not paying for the data; that's free to the public. They're paying ensure their presentation has a guaranteed level of availability and support[0]. The end result is the same: knowledge contributors aren't paid for additions or edits. The context behind the payments are different.
IMO, the best mental model is that this gives a more formal way for Google or Apple to donate to Wikipedia. Wikipedia's mission of free knowledge is unchanged, but hopefully tech companies can provide financial support in a structured, reliable fashion. Ideally, it's even positive sum (i.e. the Wikipedia-dependent services can perform better, so that the efforts of Wikipedia volunteers impact more people).
This is still being run by the non-profit, so I suspect that these can go into ongoing Wikimedia operating costs. They're incredibly efficient for the scale they operate at and I'd say this might improve things even further. The less reliance on giant donation banners, the better.
Wikipedia doesn’t rely on giant donation banners - the Wikipedia foundation had $180M dollars available at the end of 2020, and an annual hosting cost of $2M/year. The vast majority of there expenses are $54M/year of salaries, which I doubt are focused on Wikipedia itself.
If you want a charitable cause to donate to find one that needs the money.
Ah yes, all those servers can just putter off by themselves without any attention at all.
For a site called hacker news, its a wonder that people think system admins are a superfluous job, totally unnessary at scale*
* and yes im aware that wmf employees a lot of people, most aren't sysadmins, and some are more critical than others. I just think its rediculous people think that servers just run themselves and hosting is the only core cost.
Psh, how hard could it be. They only do 21 billion page views a month; 39 million edits, register 264k new users, and have a data change of 23 GiB. Free account at Wix.com ought to do it for hosting costs.
jk, it's fascinating how critical people get with how other people spend their money when asking for donations. When was the last time HN took a close look at how Johnathan Ives spends his proceeds?
If you're going to assert that the expenses aren't focused on Wikimedia's services, you should have proof that backs that up. The foundation is quite transparent about their operations, and as a former employee, I can attest that what they provide publicly matches my experience internally.
You could complain about how they're using some of that money to support the projects (the community teams historically have produced relatively little of value), but it's expensive to pay for a highly experienced engineering department, legal department, etc. For a site with the level of reach they have, they do a very good job of doing it with a relatively low cost.
Ever since the very beginning of wikipedia they've been insistent that the only licenses that count as "open"/"free" are ones that allow commercial use. There's a ton of CC NC content out there that would have been great on Wikipedia, but hasn't been able to be used. It's essentially an MIT/BSD vs GPL debate but for text and media.
It should have been extremely obvious under this paradigm that content would be reused by commercial entities! As a contributor, I don't have a problem with it. The openness of the wiki has been way way way more valuable to me than anything I've contributed back.
It depends if they run this break-even to support bulk APIs and improved workflows or to profit on the content. This content was donated to Wikimedia with the intent of being reused. This isn't AWS and Elasticsearch (though I also argue that you shouldn't open source something you don't want people using); Wikimedia is a middleman, not the author.
Is your implication that there is something wrong about AWS running OSI-licensed open source software as a service?
In general your comment doesn't make much sense to me. First of all it assumes a highly biased position in the AWS vs Elasticsearch war, but furthermore it's not even clear that the two situations aren't comparable. In both cases something that was free is having something built on top of it.
BTW the whole idea of Wikimedia Enterprise is providing higher levels of service availability. It's not really about the "content" itself. It's analogous to when you have a user of some service of yours that is flooding the service with requests and causing performance issues. A common pattern is to convert them into a paying customer and give them some guarantees about availability, while using that revenue to make sure there's enough hardware / engineering time to keep things running smoothly.
Finally, the "middleman vs author" distinction you introduce is entirely irrelevant. To use the Elastic example, it doesn't matter who wrote the software, what matters is it was written under a permissive license, and therefore the author of a given code doesn't have any "ownership". Similarly, someone who contributes to Wikipedia doesn't own that content either. In both cases the actor is contributing to the commons and has no expectation to put any restrictions on their contribution.
Rant: Wikipedia would get my donations easily if they simply included a couple of line items on what they are actually doing to improve Wikipedia! How is Wikipedia’s focus not on building semantic context into their pages? Or adding topical prerequisites? Or any number of meaningful features? Or is their goal to just exist in the present form?
I would prefer they not "improve", especially so in order to sell donations.
Wikipedia has largely stayed the same since its inception and that is part of its power. So many things have "improved" themselves into irrelevance, or wasted enormous amounts of resources changing things that don't need change.
There is a deep pit of improving things to satisfy the loudest people who "want" and not realizing until far too late that you're optimizing to a loud majority while your overall appeal shrinks to nothing. (Think of it like a grocery store dropping the least popular half of its products every few months... eventually they'd be left with light beer and ketchup confused about why nobody comes by any more)
Wikipedia, please continue improving. You first came out in 2001, 6 years before the first iPhone and computing has changed since then. I don't want a grocery store that only has items from 2001 and has never freshened up their products since then. (I wouldn't mind 2001 prices though.) Please keep trying new things with my donation dollars. No one can predict the future perfectly, and especially with how technology continually changes the landscape of the Internet.
And of course, they also publish annual reports, although those tend to be a bit more pr-ish.
p.s. i'm not sure what you mean by semantic context? Do you mean structured data/semantic web type stuff, or do you mean some sort of thing that tries to add context to articles you are reading, or something else?
Appreciate the links. They should include more factuals and vision in their fundraising copy.
All of those are great options for context, but I specifically would like to see logic/math/physics/code with functional context for symbology, usage, and derivation.
You can find out what at least their programmers are doing by hopping on the public Phabricator instance. E.g. tasks closed as resolved less than a day ago: https://phabricator.wikimedia.org/maniphest/query/8esyKcP6SJ... I can see quite a few tasks for Abstract Wikipedia and Wikidata, which are both about adding semantic context, each in their own way.
Is keeping the lights on not enough? It's about the content.
Maybe implementing an Encarta Mind Maze game would do it. Maybe the Encarta 95 splash screen with the Nelson Mandela speech? Hmm, maybe I just want to be 10 again.
They specifically state in the email “Imagine if everyone gave? We could transform the way knowledge is shared online.” But they give zero indication of that vision.
Have you spent any time at all looking into this? Their yearly report lists what they've been spending the money on, including large projects.
The project with the highest level of growth of all the Wikimedia projects is Wikidata, which is focused on bringing semantic content into the other projects (a pretty decent number of articles generate their content panel from wikidata). Check out the list of visualizers that use wikidata: https://www.wikidata.org/wiki/Wikidata:Tools/Visualize_data/.... I'm quite a fan of reasonator, myself: https://reasonator.toolforge.org/?q=Q42
I would donate more often if I wasn't harassed about it later on. I understand they want to email people who are likely to donate again, but between all extra messages and still getting huge banner popups on the website, it put me off to donating again.
See I genuinely wouldn’t mind a reminder email, if that email gave me any information (or links to it) but instead it gives me no context for their costs or how donations will be used. How far does google’s millions of dollars go? Etc.
Well, the community would be happy to take requests, but all those things require work and there's not enough resources (both community and foundation) to go around these days.
Like topical prerequisites: I'm sure if you wanted to come up with a bunch yourself and put them up, people would like it, but I - for one - would get bored after doing the fifth page or so.
That is great to hear! Please promote that they need to give their donors (I.e people who are categorically knowledge seekers) tangible details and data in their fundraising copy!
This is an interesting (and ethical) way for Wikimedia to finance their operations. I have to say I find it funny that back in the days teachers told us not to use Wikipedia and now we’ll have large corporations paying to train their algos using their data. And then those algos will decide whether you get a loan or not, your insurance premiums, and much more.
Back in grad school I had a well-respected tenured prof point out that Wikipedia was great for looking up immutable and uncontroversial technical things (e.g., a list of refractive indices) and could even be good as a starting point for learning about a topic to research it. You just can't stop there.
Wikipedia doesn't need this to finance their operations. They have so much money from donations they don't know what to do with it. If they had used it wisely they could have built a large enough endowment to be self-sustaining by now, but unfortunately they didn't.
You think loan decisions and insurance rates will be made based on a computer program trained on a Wikipedia page? That's... not how any of that works.
English Wikipedia is 20 gigs which takes a few mins to download on a fiber connection. How is this product compelling without Wikipedia actively throttling downloads and access to unpaid users?
>In most cases, commercial entities that reuse Wikimedia content at a high volume have product, service, and system requirements that go beyond what Wikimedia freely provides through publicly available APIs and data dumps.
Slippery slope only applies if there's actually a slippery slope to slide down. Nothing about this change indicates that Wikipedia will start charging the general public for access.
I don't know if this initiative includes Wikimedia Commons but I'm sure they've got substantially more data than that, and higher typical bandwidth usage.
This MediaWiki special page is also available for other wikis; the English Wikipedia itself hosts 152 GB of files that were uploaded there instead of to Commons for one reason or another (e.g. non-free material under fair use).
One use-case could be if you want faster updates for articles on current events. The full database dump for the English Wikipedia runs only twice a month. I don't think that will go away, but you can now pay to get streaming real-time updates.
They probably want data of site visitors - who looks at what, visits which link, edits what etc. That's not contained in the pure content.
If someone shops around for some product, then visits a Wikipedia article related to it and then buys something, you can maybe infer which qualities of the article or what information made them choose. That again, you could try and sell to advertisers as a service.
Disclaimer: I work for the Wikimedia foundation but this is just my personal opinion.
I'm pretty sure that Wikimedia would never sell such a service.
This is about access to content and network resources. Definitely not about access to user's browsing histories.
In my experience, user privacy is taken very seriously by Wikimedia's staff and leadership. The privacy policies/practices[1] are top-notch and not just a gesture or PR/appearance management.
So as a script kiddie / tween I used keyword-related wikipedia content to get my domain parking network to pass Google's "must have content" requirement (that is until Google changed their TOS at midnight one day specifically to ban users like myself). What is an example of an actually legitimate use of something like this, and in particular, a use-case that wouldn't be properly served by simply downloading the current data archive on a ~daily or ~hourly basis as people tend to, or just by straight-up crawling as I did back in the day?
As someone else mentioned in a commment, Wikipedia averages about 2 edits per second and almost 600 new articles daily.
A legitimate use of something like this is having Google search results lookup Wikipedia pages more often and in a timely manner.
As the article states, it also provides SLAs for the APIs/service. So let's for example say Big Movie Star X dies, but the free API for downloading content is down for the day for maintenance, well now Google's flagship search product is serving stale/outdated information. This product provides businesses a means of reliably depending on an updating source of data.
I must admit to a mixed feeling about it. On one side, a lot of large companies already are using Wikipedia data at large scale, already in a disorganized manner and not giving back anything (well, sometimes giving back donations but not in any relation to their use). Creating a venue for proper large-scale use that has a channel to give back money looks good.
On the other hand, a lot of Wikipedia is created by many, many hours of work of many, many volunteers, who aren't (usually) paid a dime for it. Capitalizing on that, and having clients that pay $big_money for the result, creates misaligned incentives, and however noble and pure the motives of everybody are now, incentives always matter and always create pressure. What if some company says "we'll pay millions of dollars, but you're got to remove some of the worst examples of content we don't like - it's not really much, compared to the whole mass of information, nobody would even notice it!" Right now I don't think the community and Wikimedia would agree, but what would happen when they're used to the cash inflow that this program creates and grown to depend on it?
This is quite confusing. According to Wikipedia itself:
> Most text in Wikipedia, excluding quotations, has been released under the Creative Commons Attribution-Sharealike 3.0 Unported License (CC-BY-SA) and the GNU Free Documentation License (GFDL) (unversioned, with no invariant sections, front-cover texts, or back-cover texts) and can therefore be reused only if you release any derived work under the Creative Commons Attribution/Share-Alike License or the GFDL. This requires that, among other things, you attribute the authors and allow others to freely copy your work ...
> Some text has been imported only under CC-BY-SA and CC-BY-SA-compatible licenses and cannot be reused under GFDL; such text will be identified either on the page footer, in the page history or the discussion page of the article that utilizes the text. All text published before June 15th, 2009 on Wikipedia was released under the GFDL, and you may also use the page history to retrieve content published before that date to ensure GFDL compatibility.
> If you are unwilling or unable to use the Creative Commons Attribution/Share-Alike License or the GNU Free Documentation License for your work, use of Wikipedia content is unauthorized.
So what exactly is Wikipedia offering here? They can't change the copyright. They still have to be attribued. And any deriavative using such content will still have to be released under the original viral license. When anyone can download the whole of Wikipedia, with adequate bandwidth and storage space, what advantages does this service really offer?
I'm curious what the strategy is to get companies like Google to pay for it. Is it "all carrots" or are there any "sticks"? Like maybe rate limiting certain consumers?
They already don't want you to just scrape stuff off Wikipedia because mediawiki apparently doesn't do proper caching and does a stupidly expensive re-rendering of every page that is fetched.
So the option you have is using stale data dumps. They're reasonably workable, I even host low-markup versions of every wikipedia article myself in my home lab and it's like some 100 Gb worth of compressed data, images excluded. Uncompressed I don't know where it ends up, but below 1 Tb at any rate. Manageable but very unwieldy. File systems like ext4 struggle with that number of files in a directory so you need to be a bit crafty, but it's not hard to get around. If all you want is an info box, you could probably squeeze it down to a couple of hundred gigabytes uncompressed. Definitely not something you need a data center for.
The catch, as mentioned, is is that the data is a stale snapshot. The only way you are getting current data is on the main site, which may be a problem if you have ambitions of staying up to date. I personally don't mind. I figure they used to sell printed encyclopedias and that worked pretty well so having a 1 year old copy of Wikipedia isn't that bad.
It's also a sort of weird domain. I can see how it might not be easy to host in the cloud in a way that's both affordable and performant. Since so many insist on cloud-based apps nowadays, there's probably a market for Wikipedia APIs with a good SLA.
> They already don't want you to just scrape stuff off Wikipedia because mediawiki apparently doesn't do proper caching and does a stupidly expensive re-rendering of every page that is fetched.
It does cache pages very aggressively. And it even has solutions for hard problems in computer science, cache invalidation.
Pages are cached indefinitely until something pokes it.
Hi, I have been involved with the Wikimedia infrastructure since roughly 2004.
The 50k daily hit on https://meta.wikimedia.org/wiki/Live_mirrors is merely to illustrate a web proxy can dramatically reduce the number of hits to the original site. They could have pick 42 or 7 billions, it is just an example.
We already had a few million pages rather than 50k of them, and they were not static since pages get edited and made live as soon as one save their edit. Obviously, we already had million daily users and definitely requires a ton of caching all across the stack. Also good luck serving flat files that keeps being written too, that does not really work with hard drive and static files storage.
In September 2021 we have served 21 billion pages and 73 billions media files https://stats.wikimedia.org/#/all-projects . Almost all of them served primarily served from caches.
That being said, the wiki pages you have mentioned are a decade old. We definitely had trouble with people having the smart idea of scrapping the whole website, often using either uncacheable content or hitting barely served articles or history that were not cached and thus caused a full rendering of the page. Even with the servers we had, a single user could cause major havoc on the infrastructure and keep a good part of them busy just to serve that single person. Hence the recommendation to use the database dump which are still used: https://dumps.wikimedia.org/
The NSA mirrors all of Wikipedia and Stack Overflow once every 24 hours from the public Internet to the classified Internet, so developers of classified applications can have references without needing to pivot to a separate unclassified workstation that needs to be at least six feet away.
I never found it to be a big deal. Once a day updates were perfectly fine for reference usage.
> because mediawiki apparently doesn't do proper caching and does a stupidly expensive re-rendering of every page
Don’t worry, they are hard at work at making workarounds for all their idiotic design decisions that seem to make absolutely no sense.. and those workarounds will be provided quite cheaply!!
Reminds me of the saying: “ It Is Difficult to Get a Man to Understand Something When His Salary Depends Upon His Not Understanding It”
We can be quite sure that Wikipedias engineering will never improve if their income becomes dependent on providing workarounds for deficiencies in their solutions.
I mean, they should be. Google uses Wikipedia content extensively on search results.
Google USED TO also have lots of prominent links that would drive traffic (and hence, donations) to Wikipedia. Over time, it has removed a lot of those links and replaced them with ones that lead to more Google hosted content.
There's a distinct drop in Google-referred traffic to Wikipedia over time that's in stark contrast to the increased number of eyeballs on Wikipedia sourced content on Google's pages.
It's not clear to me whether Google's sponsorship reflects the value they've gained. And, yes, I get that they are under no legal obligation to do anything at all, since the content license allows for all of it. But as with rich snippets, you can't just "take", or eventually, there won't be anything to "take".
Wikimedia should also let users opt-into (small, tasteful) ads instead of fundraising appeals - a way to contribute indirectly, rather than an explicit donation.
There are deliberate distortions in Wikipedia especially in regards to the national surveillance state. Companies should be aware that they might be paying for misinformation and disinformation.
I am currently researching an long-term editor/abuser who has a rather colorful history.
He runs a couple of attack websites targeting universities both in the US and Switzerland. These are current revisions to a past couple of websites which still appear as sources in numerous articles on Wikipedia. He supposedly volunteered at one of the universities before he was dropped because he lied about his credentials.
He ran in the 2018 Italian elections. His political party was suggested by some Italian and American sources as a fraud. He had tried for several years to inject articles and tidbits about the party into Wikipedia.
The sources which negatively reported on his political party have often been attacked by him. He managed to get one online newspaper's article removed from en-wiki and it-wiki. He appeared to do this for SEO impact to push up sources which are good about him and hide those which are not.
He recently tried to incorporate a dating website promoting himself and his search for a wife. He tried to present the site as a "mirror" of Wikipedia.
He used to run a service to circumvent anti-plagiarism tools like Turnitin. He did this while volunteering at the university mentioned above. There are several places where his service and his LinkedIn stories come up in the history of Wikipedia articles.
He wrote a book about Second Life and a couple of faux academic articles. He tried several times to integrate his book as a source on it-wiki. The faux academic articles were uploaded to Commons, and they were even cited in a few legit academic publications.
He operated a school in Second Life. He created articles about the school on Wikipedia, and one of the faux academic articles was about it. These were eventually discovered and removed.
He also created several articles about himself on en-wiki, it-wiki, simple-wiki, and ja-wiki. These have all been removed because several of his accounts had been blocked for sockpuppetry.
Is this something you're doing specific to wikipedia abuses?
If so, there was also a bit about them in the latest Radio War Nerd podcast episode about the continued propaganda edits related to the US Civil War and various confederate army officers.
I suspect if you got into white supremacist tropes that constantly (try to) edit wikipedia you could get an article into book-length...
Wikimedia is simply productizing commercial access to make it easier for enterprises to support it in a commercially justifiable way (vs pure charity) + setting up processes to improve enterprise-only features.
A big upside of testing this out is the opportunity to keep better track of what services are Wikipedia-dependent. While search results often send people to Wikipedia (see e.g. https://diff.wikimedia.org/2021/09/23/searching-for-wikipedi...), voice assistants or Q&A systems might be (i.e. likely are) surfacing Wikipedia content in more subtle fashion. Same with large language models. Under status quo, it will be increasingly hard to track these relationships, so exploring alternatives is well worth it.