Wikimedia Enterprise: paid commercial product for reusing Wikipedia content

nickvincent · on Oct 27, 2021

Worth emphasizing that this is described as basically just adding new ways to consume Wikipedia at scale, not taking anything away (not that that's impossible in the future, but unlikely IMO). In theory, systems that use a lot of Wikipedia data (search engines, "AI" assistant/question answering systems, language models) can keep doing what they're doing with Wikipedia dumps, but new APIs could make these services better (e.g., faster reaction time to current events).

A big upside of testing this out is the opportunity to keep better track of what services are Wikipedia-dependent. While search results often send people to Wikipedia (see e.g. https://diff.wikimedia.org/2021/09/23/searching-for-wikipedi...), voice assistants or Q&A systems might be (i.e. likely are) surfacing Wikipedia content in more subtle fashion. Same with large language models. Under status quo, it will be increasingly hard to track these relationships, so exploring alternatives is well worth it.

chaps · on Oct 27, 2021

This probably won't do that much on that side of things -- it'd be trivial to mask relationships through proxy orgs and such.

softwaredoug · on Oct 27, 2021

Good.

So much of the recent progress in natural language understanding in commercial products like Siri and Alexa is due to the richness and structure in Wikipedia and wiki data. Big tech should be paying for this stuff…

boublepop · on Oct 27, 2021

> Big tech should be paying for this stuff…

I don’t see anything about anyone paying the content creators. What fairness is it to har one platform currently freeloading of of the work that another platform is freeloading of its users (while begging them for donations mind you) suddenly have to pay the first freeloading platform?

I can’t speak for others naturally, but every vontribution I have made to Wikipedia has been made on the basis of contributing you our common knowledge base, not to try to provide a free service for a profit driven corporation.

testudovictoria · on Oct 27, 2021

> I don’t see anything about anyone paying the content creators.

The knowledge contributed to Wikimedia is understood to be public knowledge. Any contributions are considered voluntary and available to the general public. This includes profit driven corporations who want to create a service wrapper around Wikimedia content or serve the content in a different presentation (licenses withstanding).

> ...not to try to provide a free service for a profit driven corporation.

I personally feel like this is a fair step. Since contributions are made for public consumption, it makes sense that for-profit corporations whose products depend on Wikimedia APIs would pay for access to those APIs. They're not paying for the data; that's free to the public. They're paying ensure their presentation has a guaranteed level of availability and support[0]. The end result is the same: knowledge contributors aren't paid for additions or edits. The context behind the payments are different.

[0]: https://enterprise.wikimedia.com/pricing/

nickvincent · on Oct 27, 2021

IMO, the best mental model is that this gives a more formal way for Google or Apple to donate to Wikipedia. Wikipedia's mission of free knowledge is unchanged, but hopefully tech companies can provide financial support in a structured, reliable fashion. Ideally, it's even positive sum (i.e. the Wikipedia-dependent services can perform better, so that the efforts of Wikipedia volunteers impact more people).

arsome · on Oct 27, 2021

This is still being run by the non-profit, so I suspect that these can go into ongoing Wikimedia operating costs. They're incredibly efficient for the scale they operate at and I'd say this might improve things even further. The less reliance on giant donation banners, the better.

jon-wood · on Oct 27, 2021

Wikipedia doesn’t rely on giant donation banners - the Wikipedia foundation had $180M dollars available at the end of 2020, and an annual hosting cost of $2M/year. The vast majority of there expenses are $54M/year of salaries, which I doubt are focused on Wikipedia itself.

If you want a charitable cause to donate to find one that needs the money.

bawolff · on Oct 27, 2021

Ah yes, all those servers can just putter off by themselves without any attention at all.

For a site called hacker news, its a wonder that people think system admins are a superfluous job, totally unnessary at scale*

* and yes im aware that wmf employees a lot of people, most aren't sysadmins, and some are more critical than others. I just think its rediculous people think that servers just run themselves and hosting is the only core cost.

fragmede · on Oct 27, 2021

Psh, how hard could it be. They only do 21 billion page views a month; 39 million edits, register 264k new users, and have a data change of 23 GiB. Free account at Wix.com ought to do it for hosting costs.

jk, it's fascinating how critical people get with how other people spend their money when asking for donations. When was the last time HN took a close look at how Johnathan Ives spends his proceeds?

ryan_lane · on Oct 28, 2021

If you're going to assert that the expenses aren't focused on Wikimedia's services, you should have proof that backs that up. The foundation is quite transparent about their operations, and as a former employee, I can attest that what they provide publicly matches my experience internally.

You could complain about how they're using some of that money to support the projects (the community teams historically have produced relatively little of value), but it's expensive to pay for a highly experienced engineering department, legal department, etc. For a site with the level of reach they have, they do a very good job of doing it with a relatively low cost.

xxpor · on Oct 27, 2021

Ever since the very beginning of wikipedia they've been insistent that the only licenses that count as "open"/"free" are ones that allow commercial use. There's a ton of CC NC content out there that would have been great on Wikipedia, but hasn't been able to be used. It's essentially an MIT/BSD vs GPL debate but for text and media.

It should have been extremely obvious under this paradigm that content would be reused by commercial entities! As a contributor, I don't have a problem with it. The openness of the wiki has been way way way more valuable to me than anything I've contributed back.

jackdaniel · on Oct 27, 2021

> It's essentially an MIT/BSD vs GPL debate but for text and media.

It is not. CC NC prohibits commercial use while GPL allows it.

bawolff · on Oct 27, 2021

And on top of that, much of the ideaological backing of NC being bad comes from FSF calling it not free.

xxpor · on Oct 28, 2021

the analogy doesn't work perfectly since there's hardly any open source media (whatever that means) but it's similar in terms of vitality

dehrmann · on Oct 27, 2021

It depends if they run this break-even to support bulk APIs and improved workflows or to profit on the content. This content was donated to Wikimedia with the intent of being reused. This isn't AWS and Elasticsearch (though I also argue that you shouldn't open source something you don't want people using); Wikimedia is a middleman, not the author.

__blockcipher__ · on Oct 27, 2021

> This isn't AWS and Elasticsearch

Is your implication that there is something wrong about AWS running OSI-licensed open source software as a service?

In general your comment doesn't make much sense to me. First of all it assumes a highly biased position in the AWS vs Elasticsearch war, but furthermore it's not even clear that the two situations aren't comparable. In both cases something that was free is having something built on top of it.

BTW the whole idea of Wikimedia Enterprise is providing higher levels of service availability. It's not really about the "content" itself. It's analogous to when you have a user of some service of yours that is flooding the service with requests and causing performance issues. A common pattern is to convert them into a paying customer and give them some guarantees about availability, while using that revenue to make sure there's enough hardware / engineering time to keep things running smoothly.

Finally, the "middleman vs author" distinction you introduce is entirely irrelevant. To use the Elastic example, it doesn't matter who wrote the software, what matters is it was written under a permissive license, and therefore the author of a given code doesn't have any "ownership". Similarly, someone who contributes to Wikipedia doesn't own that content either. In both cases the actor is contributing to the commons and has no expectation to put any restrictions on their contribution.

achn · on Oct 27, 2021

Rant: Wikipedia would get my donations easily if they simply included a couple of line items on what they are actually doing to improve Wikipedia! How is Wikipedia’s focus not on building semantic context into their pages? Or adding topical prerequisites? Or any number of meaningful features? Or is their goal to just exist in the present form?

colechristensen · on Oct 27, 2021

I would prefer they not "improve", especially so in order to sell donations.

Wikipedia has largely stayed the same since its inception and that is part of its power. So many things have "improved" themselves into irrelevance, or wasted enormous amounts of resources changing things that don't need change.

There is a deep pit of improving things to satisfy the loudest people who "want" and not realizing until far too late that you're optimizing to a loud majority while your overall appeal shrinks to nothing. (Think of it like a grocery store dropping the least popular half of its products every few months... eventually they'd be left with light beer and ketchup confused about why nobody comes by any more)

fragmede · on Oct 27, 2021

Wikipedia, please continue improving. You first came out in 2001, 6 years before the first iPhone and computing has changed since then. I don't want a grocery store that only has items from 2001 and has never freshened up their products since then. (I wouldn't mind 2001 prices though.) Please keep trying new things with my donation dollars. No one can predict the future perfectly, and especially with how technology continually changes the landscape of the Internet.

bawolff · on Oct 27, 2021

Some things have changed https://nostalgia.wikipedia.org/wiki/HomePage

bawolff · on Oct 27, 2021

Its open source, you could look at git log (of course not all changes are wmf sponsored, but a significant portion are): https://gerrit.wikimedia.org/r/q/status:merged

There is also a short weekly news letter of more breaking changes https://meta.wikimedia.org/wiki/Tech/News

And of course, they also publish annual reports, although those tend to be a bit more pr-ish.

p.s. i'm not sure what you mean by semantic context? Do you mean structured data/semantic web type stuff, or do you mean some sort of thing that tries to add context to articles you are reading, or something else?

achn · on Oct 27, 2021

Appreciate the links. They should include more factuals and vision in their fundraising copy.

All of those are great options for context, but I specifically would like to see logic/math/physics/code with functional context for symbology, usage, and derivation.

yorwba · on Oct 27, 2021

You can find out what at least their programmers are doing by hopping on the public Phabricator instance. E.g. tasks closed as resolved less than a day ago: https://phabricator.wikimedia.org/maniphest/query/8esyKcP6SJ... I can see quite a few tasks for Abstract Wikipedia and Wikidata, which are both about adding semantic context, each in their own way.

achn · on Oct 27, 2021

Thanks kindly - I wish they distilled some of these goals into their fundraising copy.

oliwarner · on Oct 27, 2021

Is keeping the lights on not enough? It's about the content.

Maybe implementing an Encarta Mind Maze game would do it. Maybe the Encarta 95 splash screen with the Nelson Mandela speech? Hmm, maybe I just want to be 10 again.

achn · on Oct 27, 2021

They specifically state in the email “Imagine if everyone gave? We could transform the way knowledge is shared online.” But they give zero indication of that vision.

oliwarner · on Oct 27, 2021

Okay, yeah, that's fair enough. Absolutely should include some detail on that.

ryan_lane · on Oct 28, 2021

Have you spent any time at all looking into this? Their yearly report lists what they've been spending the money on, including large projects.

The project with the highest level of growth of all the Wikimedia projects is Wikidata, which is focused on bringing semantic content into the other projects (a pretty decent number of articles generate their content panel from wikidata). Check out the list of visualizers that use wikidata: https://www.wikidata.org/wiki/Wikidata:Tools/Visualize_data/.... I'm quite a fan of reasonator, myself: https://reasonator.toolforge.org/?q=Q42

treesknees · on Oct 27, 2021

I would donate more often if I wasn't harassed about it later on. I understand they want to email people who are likely to donate again, but between all extra messages and still getting huge banner popups on the website, it put me off to donating again.

achn · on Oct 27, 2021

See I genuinely wouldn’t mind a reminder email, if that email gave me any information (or links to it) but instead it gives me no context for their costs or how donations will be used. How far does google’s millions of dollars go? Etc.

howenterprisey · on Oct 27, 2021

Well, the community would be happy to take requests, but all those things require work and there's not enough resources (both community and foundation) to go around these days.

Like topical prerequisites: I'm sure if you wanted to come up with a bunch yourself and put them up, people would like it, but I - for one - would get bored after doing the fifth page or so.

20after4 · on Oct 27, 2021

Disclaimer: I work for the Wikimedia Foundation but I don't represent them.

I'm pretty sure there is a team working on building semantic context into the wikis.

kemayo · on Oct 27, 2021

There's Abstract Wikipedia: https://meta.wikimedia.org/wiki/Abstract_Wikipedia

achn · on Oct 27, 2021

That is great to hear! Please promote that they need to give their donors (I.e people who are categorically knowledge seekers) tangible details and data in their fundraising copy!

renewiltord · on Oct 27, 2021

It’s likely that proving this to you to get your donation (and that of similar people) will cost more than it will bring in.

achn · on Oct 27, 2021

Do you mean in the ad-effectiveness sense that many donors wouldn’t care or would be lost?

renewiltord · on Oct 27, 2021

Yeah I suspect RoI will be negative

achn · on Oct 27, 2021

Perhaps, but if there was ever an ad target that could be categorically labeled as “knowledge seekers” it should be Wikipedia donors.

human · on Oct 27, 2021

This is an interesting (and ethical) way for Wikimedia to finance their operations. I have to say I find it funny that back in the days teachers told us not to use Wikipedia and now we’ll have large corporations paying to train their algos using their data. And then those algos will decide whether you get a loan or not, your insurance premiums, and much more.

a_e_k · on Oct 27, 2021

Back in grad school I had a well-respected tenured prof point out that Wikipedia was great for looking up immutable and uncontroversial technical things (e.g., a list of refractive indices) and could even be good as a starting point for learning about a topic to research it. You just can't stop there.

IshKebab · on Oct 27, 2021

Wikipedia doesn't need this to finance their operations. They have so much money from donations they don't know what to do with it. If they had used it wisely they could have built a large enough endowment to be self-sustaining by now, but unfortunately they didn't.

staticman2 · on Oct 27, 2021

You think loan decisions and insurance rates will be made based on a computer program trained on a Wikipedia page? That's... not how any of that works.

mmaunder · on Oct 27, 2021

English Wikipedia is 20 gigs which takes a few mins to download on a fiber connection. How is this product compelling without Wikipedia actively throttling downloads and access to unpaid users?

Zhyl · on Oct 27, 2021

From the article:

>In most cases, commercial entities that reuse Wikimedia content at a high volume have product, service, and system requirements that go beyond what Wikimedia freely provides through publicly available APIs and data dumps.

IAmNotAFix · on Oct 27, 2021

Wikipedia averages 598 new articles per day et 2 edits per second.

https://en.wikipedia.org/wiki/Wikipedia:Statistics

boublepop · on Oct 27, 2021

Sure but you only get access to 10 of them unless you pay for Wikipedia premium. /comment from slightly in the future

xmprt · on Oct 27, 2021

Slippery slope only applies if there's actually a slippery slope to slide down. Nothing about this change indicates that Wikipedia will start charging the general public for access.

shock-value · on Oct 27, 2021

I don't know if this initiative includes Wikimedia Commons but I'm sure they've got substantially more data than that, and higher typical bandwidth usage.

(https://commons.wikimedia.org/wiki/Main_Page)

mgdlbp · on Oct 27, 2021

Files hosted on Commons currently total just under 320 TB (excluding old versions, which are retained).

https://commons.wikimedia.org/wiki/Special:MediaStatistics

This MediaWiki special page is also available for other wikis; the English Wikipedia itself hosts 152 GB of files that were uploaded there instead of to Commons for one reason or another (e.g. non-free material under fair use).

_delirium · on Oct 27, 2021

One use-case could be if you want faster updates for articles on current events. The full database dump for the English Wikipedia runs only twice a month. I don't think that will go away, but you can now pay to get streaming real-time updates.

bawolff · on Oct 27, 2021

There is a real time streaming api that is available for free (and has existed for a long time)

https://wikitech.wikimedia.org/wiki/Event_Platform/EventStre...

wnkrshm · on Oct 27, 2021

They probably want data of site visitors - who looks at what, visits which link, edits what etc. That's not contained in the pure content.

If someone shops around for some product, then visits a Wikipedia article related to it and then buys something, you can maybe infer which qualities of the article or what information made them choose. That again, you could try and sell to advertisers as a service.

20after4 · on Oct 27, 2021

Disclaimer: I work for the Wikimedia foundation but this is just my personal opinion.

I'm pretty sure that Wikimedia would never sell such a service.

This is about access to content and network resources. Definitely not about access to user's browsing histories.

In my experience, user privacy is taken very seriously by Wikimedia's staff and leadership. The privacy policies/practices[1] are top-notch and not just a gesture or PR/appearance management.

[1] https://wikimediafoundation.org/news/2018/05/21/wikimedia-fo...

bawolff · on Oct 27, 2021

That sort of thing isn't being sold. There would be mass riots on wikipedia if wmf tried to pull that type of shit

dredmorbius · on Oct 27, 2021

Related discussion from 7 months ago announcing API: https://news.ycombinator.com/item?id=26484080

sam0x17 · on Oct 27, 2021

So as a script kiddie / tween I used keyword-related wikipedia content to get my domain parking network to pass Google's "must have content" requirement (that is until Google changed their TOS at midnight one day specifically to ban users like myself). What is an example of an actually legitimate use of something like this, and in particular, a use-case that wouldn't be properly served by simply downloading the current data archive on a ~daily or ~hourly basis as people tend to, or just by straight-up crawling as I did back in the day?

treesknees · on Oct 27, 2021

As someone else mentioned in a commment, Wikipedia averages about 2 edits per second and almost 600 new articles daily.

A legitimate use of something like this is having Google search results lookup Wikipedia pages more often and in a timely manner.

As the article states, it also provides SLAs for the APIs/service. So let's for example say Big Movie Star X dies, but the free API for downloading content is down for the day for maintenance, well now Google's flagship search product is serving stale/outdated information. This product provides businesses a means of reliably depending on an updating source of data.

smsm42 · on Oct 28, 2021

I must admit to a mixed feeling about it. On one side, a lot of large companies already are using Wikipedia data at large scale, already in a disorganized manner and not giving back anything (well, sometimes giving back donations but not in any relation to their use). Creating a venue for proper large-scale use that has a channel to give back money looks good.

On the other hand, a lot of Wikipedia is created by many, many hours of work of many, many volunteers, who aren't (usually) paid a dime for it. Capitalizing on that, and having clients that pay $big_money for the result, creates misaligned incentives, and however noble and pure the motives of everybody are now, incentives always matter and always create pressure. What if some company says "we'll pay millions of dollars, but you're got to remove some of the worst examples of content we don't like - it's not really much, compared to the whole mass of information, nobody would even notice it!" Right now I don't think the community and Wikimedia would agree, but what would happen when they're used to the cash inflow that this program creates and grown to depend on it?

webmobdev · on Oct 27, 2021

This is quite confusing. According to Wikipedia itself:

> Most text in Wikipedia, excluding quotations, has been released under the Creative Commons Attribution-Sharealike 3.0 Unported License (CC-BY-SA) and the GNU Free Documentation License (GFDL) (unversioned, with no invariant sections, front-cover texts, or back-cover texts) and can therefore be reused only if you release any derived work under the Creative Commons Attribution/Share-Alike License or the GFDL. This requires that, among other things, you attribute the authors and allow others to freely copy your work ...

> Some text has been imported only under CC-BY-SA and CC-BY-SA-compatible licenses and cannot be reused under GFDL; such text will be identified either on the page footer, in the page history or the discussion page of the article that utilizes the text. All text published before June 15th, 2009 on Wikipedia was released under the GFDL, and you may also use the page history to retrieve content published before that date to ensure GFDL compatibility.

> If you are unwilling or unable to use the Creative Commons Attribution/Share-Alike License or the GNU Free Documentation License for your work, use of Wikipedia content is unauthorized.

Source: https://en.wikipedia.org/wiki/Wikipedia:FAQ/Copyright#Can_I_...?

So what exactly is Wikipedia offering here? They can't change the copyright. They still have to be attribued. And any deriavative using such content will still have to be released under the original viral license. When anyone can download the whole of Wikipedia, with adequate bandwidth and storage space, what advantages does this service really offer?

IAmNotAFix · on Oct 27, 2021

I think it's more the way you access the information than the information itself.

tyingq · on Oct 27, 2021

I'm curious what the strategy is to get companies like Google to pay for it. Is it "all carrots" or are there any "sticks"? Like maybe rate limiting certain consumers?

marginalia_nu · on Oct 27, 2021

They already don't want you to just scrape stuff off Wikipedia because mediawiki apparently doesn't do proper caching and does a stupidly expensive re-rendering of every page that is fetched.

So the option you have is using stale data dumps. They're reasonably workable, I even host low-markup versions of every wikipedia article myself in my home lab and it's like some 100 Gb worth of compressed data, images excluded. Uncompressed I don't know where it ends up, but below 1 Tb at any rate. Manageable but very unwieldy. File systems like ext4 struggle with that number of files in a directory so you need to be a bit crafty, but it's not hard to get around. If all you want is an info box, you could probably squeeze it down to a couple of hundred gigabytes uncompressed. Definitely not something you need a data center for.

The catch, as mentioned, is is that the data is a stale snapshot. The only way you are getting current data is on the main site, which may be a problem if you have ambitions of staying up to date. I personally don't mind. I figure they used to sell printed encyclopedias and that worked pretty well so having a 1 year old copy of Wikipedia isn't that bad.

It's also a sort of weird domain. I can see how it might not be easy to host in the cloud in a way that's both affordable and performant. Since so many insist on cloud-based apps nowadays, there's probably a market for Wikipedia APIs with a good SLA.

georgyo · on Oct 27, 2021

> They already don't want you to just scrape stuff off Wikipedia because mediawiki apparently doesn't do proper caching and does a stupidly expensive re-rendering of every page that is fetched.

It does cache pages very aggressively. And it even has solutions for hard problems in computer science, cache invalidation.

Pages are cached indefinitely until something pokes it.

Proof? Add ?action=purge to a wiki page.

marginalia_nu · on Oct 27, 2021

Maybe I'm misinterpreting them then, they seem quite concerned about automated page visits.

Like for example this: https://en.wikipedia.org/wiki/Wikipedia:Mirrors_and_forks#Re...

and this: https://meta.wikimedia.org/wiki/Live_mirrors

Something resulting in 50k page fetches of cached data in a day is nothing. A raspberry pi could handle 50k static documents in an hour.

hashar · on Oct 27, 2021

Hi, I have been involved with the Wikimedia infrastructure since roughly 2004.

The 50k daily hit on https://meta.wikimedia.org/wiki/Live_mirrors is merely to illustrate a web proxy can dramatically reduce the number of hits to the original site. They could have pick 42 or 7 billions, it is just an example.

The text dates from 2009, and in April 2009 this is roughly what we had in term of servers https://commons.wikimedia.org/wiki/File:Wikimedia-servers-20... Arguably a little more than a Raspberry pi

We already had a few million pages rather than 50k of them, and they were not static since pages get edited and made live as soon as one save their edit. Obviously, we already had million daily users and definitely requires a ton of caching all across the stack. Also good luck serving flat files that keeps being written too, that does not really work with hard drive and static files storage.

In September 2021 we have served 21 billion pages and 73 billions media files https://stats.wikimedia.org/#/all-projects . Almost all of them served primarily served from caches.

That being said, the wiki pages you have mentioned are a decade old. We definitely had trouble with people having the smart idea of scrapping the whole website, often using either uncacheable content or hitting barely served articles or history that were not cached and thus caused a full rendering of the page. Even with the servers we had, a single user could cause major havoc on the infrastructure and keep a good part of them busy just to serve that single person. Hence the recommendation to use the database dump which are still used: https://dumps.wikimedia.org/

cheers

nonameiguess · on Oct 27, 2021

The NSA mirrors all of Wikipedia and Stack Overflow once every 24 hours from the public Internet to the classified Internet, so developers of classified applications can have references without needing to pivot to a separate unclassified workstation that needs to be at least six feet away.

I never found it to be a big deal. Once a day updates were perfectly fine for reference usage.

boublepop · on Oct 27, 2021

> because mediawiki apparently doesn't do proper caching and does a stupidly expensive re-rendering of every page

Don’t worry, they are hard at work at making workarounds for all their idiotic design decisions that seem to make absolutely no sense.. and those workarounds will be provided quite cheaply!!

Reminds me of the saying: “ It Is Difficult to Get a Man to Understand Something When His Salary Depends Upon His Not Understanding It”

We can be quite sure that Wikipedias engineering will never improve if their income becomes dependent on providing workarounds for deficiencies in their solutions.

Hamuko · on Oct 27, 2021

FWIW, Google is one of the main sponsors of MusicBrainz and the core MusicBrainz data is (effectively) in the public domain.

https://musicbrainz.org/doc/About/Data_License

https://metabrainz.org/sponsors

H8crilA · on Oct 27, 2021

Not only that it's one of the top donors to the Wikimedia Foundation.

tyingq · on Oct 27, 2021

I mean, they should be. Google uses Wikipedia content extensively on search results.

Google USED TO also have lots of prominent links that would drive traffic (and hence, donations) to Wikipedia. Over time, it has removed a lot of those links and replaced them with ones that lead to more Google hosted content.

There's a distinct drop in Google-referred traffic to Wikipedia over time that's in stark contrast to the increased number of eyeballs on Wikipedia sourced content on Google's pages.

It's not clear to me whether Google's sponsorship reflects the value they've gained. And, yes, I get that they are under no legal obligation to do anything at all, since the content license allows for all of it. But as with rich snippets, you can't just "take", or eventually, there won't be anything to "take".

BiteCode_dev · on Oct 27, 2021

Rate limiting for all would be fair. You can torrent the whole db anyhow.

kbsspl · on Oct 27, 2021

Was just thinking about exactly the whole Wiki ecosystem.

It's created by individuals like us and should be available via maybe archive.org as a complete dump for individuals to play with.

Are they available ?

humanistbot · on Oct 27, 2021

https://dumps.wikimedia.org

kbsspl · on Oct 27, 2021

Thank you. Will explore.

gojomo · on Oct 27, 2021

Wikimedia should also let users opt-into (small, tasteful) ads instead of fundraising appeals - a way to contribute indirectly, rather than an explicit donation.

cogburnd02 · on Oct 27, 2021

> (small, tasteful) ads

These did used to exist. They were called 120x60 px buttons.

gojomo · on Oct 27, 2021

The original AdWords & AdSense pure-text formats weren't so bad, either.

1cvmask · on Oct 27, 2021

There are deliberate distortions in Wikipedia especially in regards to the national surveillance state. Companies should be aware that they might be paying for misinformation and disinformation.

https://news.ycombinator.com/item?id=28808516

https://www.thecanary.co/discovery/analysis-discovery/2018/0...

https://www.bbc.com/news/blogs-trending-44495696

RogerSamson · on Oct 27, 2021

I am currently researching an long-term editor/abuser who has a rather colorful history.

He runs a couple of attack websites targeting universities both in the US and Switzerland. These are current revisions to a past couple of websites which still appear as sources in numerous articles on Wikipedia. He supposedly volunteered at one of the universities before he was dropped because he lied about his credentials.

He ran in the 2018 Italian elections. His political party was suggested by some Italian and American sources as a fraud. He had tried for several years to inject articles and tidbits about the party into Wikipedia.

The sources which negatively reported on his political party have often been attacked by him. He managed to get one online newspaper's article removed from en-wiki and it-wiki. He appeared to do this for SEO impact to push up sources which are good about him and hide those which are not.

He recently tried to incorporate a dating website promoting himself and his search for a wife. He tried to present the site as a "mirror" of Wikipedia.

He used to run a service to circumvent anti-plagiarism tools like Turnitin. He did this while volunteering at the university mentioned above. There are several places where his service and his LinkedIn stories come up in the history of Wikipedia articles.

He wrote a book about Second Life and a couple of faux academic articles. He tried several times to integrate his book as a source on it-wiki. The faux academic articles were uploaded to Commons, and they were even cited in a few legit academic publications.

He operated a school in Second Life. He created articles about the school on Wikipedia, and one of the faux academic articles was about it. These were eventually discovered and removed.

He also created several articles about himself on en-wiki, it-wiki, simple-wiki, and ja-wiki. These have all been removed because several of his accounts had been blocked for sockpuppetry.

The story grows crazier the I dig deeper.

RNCTX · on Oct 27, 2021

Is this something you're doing specific to wikipedia abuses?

If so, there was also a bit about them in the latest Radio War Nerd podcast episode about the continued propaganda edits related to the US Civil War and various confederate army officers.

I suspect if you got into white supremacist tropes that constantly (try to) edit wikipedia you could get an article into book-length...

https://player.fm/series/war-nerd-radio-subscriber-feed-2633...

dredmorbius · on Oct 27, 2021

https://nitter.kavin.rocks/Wikimedia/status/1452686760111185...

Principal link is: https://wikimediafoundation.org/news/2021/10/25/wikimedia-fo...

dang · on Oct 27, 2021

Changed from https://twitter.com/Wikimedia/status/1452686760111185924. Thanks!

trhway · on Oct 27, 2021

Breaking net neutrality is bad while doing that to Wikipedia is good. How come?

londons_explore · on Oct 27, 2021

I've got a feeling it won't be long before wikimedia stop making database dumps available, and throttling people crawling the website...

A bit of a shame really!

infofarmer · on Oct 27, 2021

Doesn’t have to be that way. For a much smaller, but relevant example - see MusicBrainz / MetaBrainz https://metabrainz.org/supporters/account-type

Wikimedia is simply productizing commercial access to make it easier for enterprises to support it in a commercially justifiable way (vs pure charity) + setting up processes to improve enterprise-only features.

londons_explore · on Oct 27, 2021

> This will not impact smaller content reusers who can continue to leverage Wikimedia’s data dumps and APIs freely for their own use.

This is another way to say "if you are a big business, we're going to block you downloading these unless you pay".

Won't be long before those data dumps require a "free account" or captcha to download...

afandian · on Oct 27, 2021

Seems like an odd speculation. Can you provide any evidence for that?