That gives you Wikitext encapsulated in XML. How do you get at the *content* of ...

10165 · on May 7, 2017

"That gives you Wikitext encapsulated in XML."

avar: "The goal of Wikipedia should be to spread the content as far & wide as possible, the way OpenStreetMap operates is a better model."

I am confused.

Doesn't OSM data come encapsulated in XML or some binary format?

As for dispersion of content, I could have sworn I have seen Wikipedia content on non-Wikipedia websites. Is there some restriction that prohibits this?

I have seen Wikipedia data offered in DNS TXT records as well.

3131s · on May 7, 2017

For each article there is some metadata, but the entire text of an article is just a blob inside one XML element.

For anyone who has not worked with the Wikipedia data dumps extensively before, trust us that it is not easily machine-readable and that even solutions like DBPedia / Wikidata are not yet suitable for many purposes.

WikipediasBad · on May 7, 2017

As someone who contributes to many knowledge projects, including Wikipedia and Wikidata frequently, I'm curious about what you mean that Wikidata is not yet suitable for any purposes. Am I wasting my time contributing to it? I thought that it was helping a lot of machines understand data. Can you please explain further?

3131s · on May 8, 2017

Please reread, for many purposes! I love Wikipedia.

The Wiki markup is extremely complicated and being user created, it is also inconsistent and error prone. I believe the MediaWiki parser itself is something like a single 5000 line PHP function! All of the alternate parsers I've tried are not perfect. There is a ton of information encoded in the semi-structured markup, but it's still not easy to turn that into actual structured data. That's where the problem lies.

lacksconfidence · on May 8, 2017

> believe the MediaWiki parser itself is something like a single 5000 line PHP function!

It's not. I'm on mobile so not easiest to link, but the PHP versio of the parser is nothing like a single function. There is also a nodejs version of the parser under active development with the goal of replacing the php parser.

3131s · on May 8, 2017

Thanks, I had heard that somewhere but stand corrected.

10165 · on May 8, 2017

"... into actual structured data."

Would there be some particular structure that everyone would agree on?

Alternatively, what is the desired structure you want?

Because the current format is so messy, I just focus on what I believe is most important: titles and externallinks. IMO, often the most interesting content in an article is lifted from content found via the external links. I also would like to capture the talk pages. Maybe just the contributing usernames and IP addresses.

Opinions or explanations that have no supporting reference are inexpensive. One can always these for free on the web. No problem recruiting "contributors" for that sort of "content".

Back to the question: I am curious what structure would you envision would be best for Wikipedia data? Assume hypothetically that a "perfect" parser has been written for you to do the transformation.

rspeer · on May 8, 2017

The structure I need for my particular project (ConceptNet) is:

* The definitions from each Wiktionary entry.

* The links between those definitions, whether they are explicit templated links in a Translation or Etymology section, or vaguer links such as words in double-brackets in the definition. (These links carry a lot of information, and they're why I started my own parser instead of using DBnary.)

* The relations being conveyed by those links. (Synonyms? Hypernyms? Part of the definition of the word?)

* The links should clarify the language of the word they are linking to. (This takes some heuristics and some guessing so far, because Wiktionary pages define every word in any language that uses the same string of letters, and often the link target doesn't specify the language.)

* The languages involved should be identified by BCP 47 language codes, not by their names, because names are ambiguous. (Every Wiktionary but the English one is good at this.)

There are probably analogous relations to be extracted from Wikipedia, but it seems like an even bigger task whose fruit is higher-hanging.

Don't get me wrong: Wiktionary is an amazing, world-changing source of multilingual knowledge. Wiktionary plus Games With A Purpose are most of the reason why ConceptNet works so well and is mopping the floor with word2vec. And that's why I'm so desperate to get at what the knowledge is.

jacquesm · on May 8, 2017

I don't think you are using this in the way it was meant to be used. Wikipedia is a user edited, human centered project. Humans are error prone and that's something that you are going to have to live with if you want to re-purpose the data.

The burden of repurposing falls on you and wikipedia makes the exact same data that they have at their disposal available to you, to expect it in a more structured format that is usable by you and your project but that goes beyond what Wikipedia needs in order to function is asking for a bit much I think.

They make the dumps available, they make the parser they use available, what more could you reasonably ask for that does not exceed the intended use case for Wikipedia?

Afaics any work they do that increases the burden on Wikipedia contributors that would make your life easier would be too much.

But since you are already so far along with this and you have your parser, what you could do is to re-release your own intermediary format dumps that would make the lives of other researchers easier.

rspeer · on May 8, 2017

Yeah, I understand that. I'm re-purposing the data and it's my job to decide how that works.

But this could be easier. What I hate about Wikimedia's format is templates. They are not very human-editable (try editing a template sometime; unless you're an absolute pro, you will break thousands of articles and be kindly asked to never to do that again) and not very computer-parseable. They're just the first thing someone thought of that worked with MediaWiki's existing feature set and put the right thing on the screen.

Replacing templates with something better -- which would require a project-wide engineering effort -- could make things more accessible to everyone.

FWIW, I do make the intermediate results of my parser downloadable, although to consider them "released" would require documenting them. For example: [1]

[1] https://s3.amazonaws.com/conceptnet/precomputed-data/2016/wi...

sogen · on May 8, 2017

Agreed, editing anything more complex than a simple text i.e. a table or some note is a shore. And I'm an advanced user!

rspeer · on May 8, 2017

The GP said Wikidata isn't suitable for many purposes, different from any.

It's a nice agreed-upon vocabulary for linked data. But you still need the data that the vocabulary refers to. The information you can get without ever leaving the Wikidata representation is still too sparse.

stavros · on May 8, 2017

He's saying that Wikipedia doesn't give you clean, usable data, it gives you data with weird markup everywhere.

shshhdhs · on May 7, 2017

Thanks for working on that! Didn't know it was so bad. The following is a possibly stupid idea, but I'd like to hear your thoughts:

What if you just render the content into HTML and then "screen scraped" the text, and then convert into a more useful format (MarkDown, JSON, etc). Is that plausible?

GhostVII · on May 7, 2017

That would allow a basic UI change on Wikipedia to break your code. Sometimes it is necessary, but not usually the best option in my experience, and it's pretty annoying to do.

3131s · on May 7, 2017

Wikipedia used to do HTML dumps but stopped a long time ago, unfortunately.

int_19h · on May 8, 2017

You can get what amounts to an HTML dump (which is then indexed and compressed in a single huge archive) from Kiwix. Although they do them basically twice a year or so.

Tpt · on May 8, 2017

You should have a look at [1] that outputs an HTML rendering of pages with a lot of metadata.

[1] https://en.wikipedia.org/api/rest_v1/#!/Page_content/get_pag...

lacksconfidence · on May 7, 2017

You could download the search indexes, also on the dumps site, that has the text content among other things.