Hacker News new | past | comments | ask | show | jobs | submit login

That gives you Wikitext encapsulated in XML. How do you get at the content of the Wikitext?

I work on a Wikitext parser [1]. So do many other people, in different ways. Wikitext syntax is horrible and it mixes content and presentation indiscriminately (for example, it contains most of HTML as a sub-syntax).

The problem is basically unsolvable, as the result of parsing a Wiki page is defined only by a complete implementation of MediaWiki (with all its recursively-evaluated template pages, PHP code, and Lua code), but if you run that whole stack what you get in the end is HTML -- just the presentation, not the content you presumably wanted.

So people solve various pieces of the problem instead, creating approximate parsers that oversimplify various situations to meet their needs.

One of these solutions is DBPedia [2], but if you use DBPedia you have to watch out for the parts that are false or misleading due to parse errors.

[1] https://github.com/LuminosoInsight/wikiparsec

[2] http://wiki.dbpedia.org/




"That gives you Wikitext encapsulated in XML."

avar: "The goal of Wikipedia should be to spread the content as far & wide as possible, the way OpenStreetMap operates is a better model."

I am confused.

Doesn't OSM data come encapsulated in XML or some binary format?

As for dispersion of content, I could have sworn I have seen Wikipedia content on non-Wikipedia websites. Is there some restriction that prohibits this?

I have seen Wikipedia data offered in DNS TXT records as well.


For each article there is some metadata, but the entire text of an article is just a blob inside one XML element.

For anyone who has not worked with the Wikipedia data dumps extensively before, trust us that it is not easily machine-readable and that even solutions like DBPedia / Wikidata are not yet suitable for many purposes.


As someone who contributes to many knowledge projects, including Wikipedia and Wikidata frequently, I'm curious about what you mean that Wikidata is not yet suitable for any purposes. Am I wasting my time contributing to it? I thought that it was helping a lot of machines understand data. Can you please explain further?


Please reread, for many purposes! I love Wikipedia.

The Wiki markup is extremely complicated and being user created, it is also inconsistent and error prone. I believe the MediaWiki parser itself is something like a single 5000 line PHP function! All of the alternate parsers I've tried are not perfect. There is a ton of information encoded in the semi-structured markup, but it's still not easy to turn that into actual structured data. That's where the problem lies.


> believe the MediaWiki parser itself is something like a single 5000 line PHP function!

It's not. I'm on mobile so not easiest to link, but the PHP versio of the parser is nothing like a single function. There is also a nodejs version of the parser under active development with the goal of replacing the php parser.


Thanks, I had heard that somewhere but stand corrected.


"... into actual structured data."

Would there be some particular structure that everyone would agree on?

Alternatively, what is the desired structure you want?

Because the current format is so messy, I just focus on what I believe is most important: titles and externallinks. IMO, often the most interesting content in an article is lifted from content found via the external links. I also would like to capture the talk pages. Maybe just the contributing usernames and IP addresses.

Opinions or explanations that have no supporting reference are inexpensive. One can always these for free on the web. No problem recruiting "contributors" for that sort of "content".

Back to the question: I am curious what structure would you envision would be best for Wikipedia data? Assume hypothetically that a "perfect" parser has been written for you to do the transformation.


The structure I need for my particular project (ConceptNet) is:

* The definitions from each Wiktionary entry.

* The links between those definitions, whether they are explicit templated links in a Translation or Etymology section, or vaguer links such as words in double-brackets in the definition. (These links carry a lot of information, and they're why I started my own parser instead of using DBnary.)

* The relations being conveyed by those links. (Synonyms? Hypernyms? Part of the definition of the word?)

* The links should clarify the language of the word they are linking to. (This takes some heuristics and some guessing so far, because Wiktionary pages define every word in any language that uses the same string of letters, and often the link target doesn't specify the language.)

* The languages involved should be identified by BCP 47 language codes, not by their names, because names are ambiguous. (Every Wiktionary but the English one is good at this.)

There are probably analogous relations to be extracted from Wikipedia, but it seems like an even bigger task whose fruit is higher-hanging.

Don't get me wrong: Wiktionary is an amazing, world-changing source of multilingual knowledge. Wiktionary plus Games With A Purpose are most of the reason why ConceptNet works so well and is mopping the floor with word2vec. And that's why I'm so desperate to get at what the knowledge is.


I don't think you are using this in the way it was meant to be used. Wikipedia is a user edited, human centered project. Humans are error prone and that's something that you are going to have to live with if you want to re-purpose the data.

The burden of repurposing falls on you and wikipedia makes the exact same data that they have at their disposal available to you, to expect it in a more structured format that is usable by you and your project but that goes beyond what Wikipedia needs in order to function is asking for a bit much I think.

They make the dumps available, they make the parser they use available, what more could you reasonably ask for that does not exceed the intended use case for Wikipedia?

Afaics any work they do that increases the burden on Wikipedia contributors that would make your life easier would be too much.

But since you are already so far along with this and you have your parser, what you could do is to re-release your own intermediary format dumps that would make the lives of other researchers easier.


Yeah, I understand that. I'm re-purposing the data and it's my job to decide how that works.

But this could be easier. What I hate about Wikimedia's format is templates. They are not very human-editable (try editing a template sometime; unless you're an absolute pro, you will break thousands of articles and be kindly asked to never to do that again) and not very computer-parseable. They're just the first thing someone thought of that worked with MediaWiki's existing feature set and put the right thing on the screen.

Replacing templates with something better -- which would require a project-wide engineering effort -- could make things more accessible to everyone.

FWIW, I do make the intermediate results of my parser downloadable, although to consider them "released" would require documenting them. For example: [1]

[1] https://s3.amazonaws.com/conceptnet/precomputed-data/2016/wi...


Agreed, editing anything more complex than a simple text i.e. a table or some note is a shore. And I'm an advanced user!


The GP said Wikidata isn't suitable for many purposes, different from any.

It's a nice agreed-upon vocabulary for linked data. But you still need the data that the vocabulary refers to. The information you can get without ever leaving the Wikidata representation is still too sparse.


He's saying that Wikipedia doesn't give you clean, usable data, it gives you data with weird markup everywhere.


Thanks for working on that! Didn't know it was so bad. The following is a possibly stupid idea, but I'd like to hear your thoughts:

What if you just render the content into HTML and then "screen scraped" the text, and then convert into a more useful format (MarkDown, JSON, etc). Is that plausible?


That would allow a basic UI change on Wikipedia to break your code. Sometimes it is necessary, but not usually the best option in my experience, and it's pretty annoying to do.


Wikipedia used to do HTML dumps but stopped a long time ago, unfortunately.


You can get what amounts to an HTML dump (which is then indexed and compressed in a single huge archive) from Kiwix. Although they do them basically twice a year or so.


You should have a look at [1] that outputs an HTML rendering of pages with a lot of metadata.

[1] https://en.wikipedia.org/api/rest_v1/#!/Page_content/get_pag...


You could download the search indexes, also on the dumps site, that has the text content among other things.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: