Hacker News new | past | comments | ask | show | jobs | submit login

The structure I need for my particular project (ConceptNet) is:

* The definitions from each Wiktionary entry.

* The links between those definitions, whether they are explicit templated links in a Translation or Etymology section, or vaguer links such as words in double-brackets in the definition. (These links carry a lot of information, and they're why I started my own parser instead of using DBnary.)

* The relations being conveyed by those links. (Synonyms? Hypernyms? Part of the definition of the word?)

* The links should clarify the language of the word they are linking to. (This takes some heuristics and some guessing so far, because Wiktionary pages define every word in any language that uses the same string of letters, and often the link target doesn't specify the language.)

* The languages involved should be identified by BCP 47 language codes, not by their names, because names are ambiguous. (Every Wiktionary but the English one is good at this.)

There are probably analogous relations to be extracted from Wikipedia, but it seems like an even bigger task whose fruit is higher-hanging.

Don't get me wrong: Wiktionary is an amazing, world-changing source of multilingual knowledge. Wiktionary plus Games With A Purpose are most of the reason why ConceptNet works so well and is mopping the floor with word2vec. And that's why I'm so desperate to get at what the knowledge is.




I don't think you are using this in the way it was meant to be used. Wikipedia is a user edited, human centered project. Humans are error prone and that's something that you are going to have to live with if you want to re-purpose the data.

The burden of repurposing falls on you and wikipedia makes the exact same data that they have at their disposal available to you, to expect it in a more structured format that is usable by you and your project but that goes beyond what Wikipedia needs in order to function is asking for a bit much I think.

They make the dumps available, they make the parser they use available, what more could you reasonably ask for that does not exceed the intended use case for Wikipedia?

Afaics any work they do that increases the burden on Wikipedia contributors that would make your life easier would be too much.

But since you are already so far along with this and you have your parser, what you could do is to re-release your own intermediary format dumps that would make the lives of other researchers easier.


Yeah, I understand that. I'm re-purposing the data and it's my job to decide how that works.

But this could be easier. What I hate about Wikimedia's format is templates. They are not very human-editable (try editing a template sometime; unless you're an absolute pro, you will break thousands of articles and be kindly asked to never to do that again) and not very computer-parseable. They're just the first thing someone thought of that worked with MediaWiki's existing feature set and put the right thing on the screen.

Replacing templates with something better -- which would require a project-wide engineering effort -- could make things more accessible to everyone.

FWIW, I do make the intermediate results of my parser downloadable, although to consider them "released" would require documenting them. For example: [1]

[1] https://s3.amazonaws.com/conceptnet/precomputed-data/2016/wi...




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: