I have helped with the writing and editing of a number of dictionaries over the ...

tasogare · on Sept 13, 2021

Yes, dictionary content is the revenue source of dictionary vendors, so of course they don't want anyone to use it without permission. On the other hand there are more and more open-data projects (I started one myself), often based on printed dictionaries that felt in the public domain.

tkgally · on Sept 13, 2021

I started one myself, too, more than twenty years ago, in a naive burst of enthusiasm about the potential of online collaboration. It was an attempt to create a new comprehensive Japanese-English dictionary from scratch, not based on existing dictionaries [1]. Other volunteers were equally enthusiastic, but the immensity of the task before us, and the competing time pressures of paying work, caused us to gradually stop working on it.

[1] https://t-nex.jp/dictionaries/jekai/index.html

tasogare · on Sept 13, 2021

Did you start it before JEDict from Breen started existing? It's great for English language for other Japanese-* pairs are poorly endowed. There is also focused work on neologisms that could be done with a few people and could really improve existing free solutions.

I'm personally using the route of digitizing an existing dictionary, albeit manually because that can't be automated in this case.

tkgally · on Sept 13, 2021

JEDict already existed, but it was just a glossary, i.e., Japanese words with English equivalents. But there aren’t many one-to-one correspondences in meaning between Japanese and English words, and such glossaries, while useful, can also be frustrating and misleading to users. My idea for jeKai was to create a dictionary with explanatory definitions, like those that appear in monolingual dictionaries. That turned out to be much more work than we were ready to do, though.

A paper I wrote nine years ago on related issues is here, in case you are interested:

“Kokugo Dictionaries as Tools for Learners: Problems and Potential”

https://researchmap.jp/multidatabases/multidatabase_contents...

tasogare · on Sept 13, 2021

Well source language word paired with one or more translations is the minimal structure for a dictionary and some printed dictionaries are indeed like this.

I browsed some entries of your dictionary and indeed their are (sometimes quite elaborate) explanations. I could easily why it ran out of steam, especially even a basic compiling is a daunting task. Also contributors are fews. On the Jibiki.fr (Japanese-French) project, most of the corrections are made by two members. They're starting from an existing dictionary and it still took years to just check the headwords.

Thanks very much for the paper. I'll read it. I actually already had one of your paper on my machine (Asialex 2011 proceedings) but haven't read it yet.

tkgally · on Sept 13, 2021

Thank you. I tried to e-mail you at the address on your profile page to suggest that we continue this discussion privately, but Gmail replied with “Your message wasn't delivered to [that address] because the address couldn't be found, or is unable to receive mail.” If you would like to chat about these issues further, please e-mail me at the address on my paper or at my personal website.

simondotau · on Sept 13, 2021

What’s the state of Wiktionary like in your opinion?

tkgally · on Sept 13, 2021

I hadn’t used Wiktionary for a few years, so I just spent some time looking through it. It was pretty good a few years ago, and now it looks even better. I’m sure many people find it very useful. The amount of information on each page, though, might make it a bit intimidating to some users.

It also seems to have some unevenness in coverage. For example, the entry for the word “anecdata” (a word discussed recently at [1]), has five illustrative quotations, which are quite handy [2]. The entry for the more mundane “anecdote,” however, has none [3]. Such unevenness might be inevitable in volunteer dictionary projects, as volunteers like to work on the more interesting words.

[1] https://news.ycombinator.com/item?id=28375767

[2] https://en.wiktionary.org/wiki/anecdata

[3] https://en.wiktionary.org/wiki/anecdote

peterburkimsher · on Sept 13, 2021

I use Wiktionary pretty often, and it has come in particularly useful this past week!

We're translating some strings on our software user interface, and checking the abbreviations and acronyms used. Sometimes there are amusing or [nsfw] connotations in other languages! Thank you Wiktionary for warning us about abbreviating "low pressure" as "LP" in Taiwan.

https://en.wiktionary.org/wiki/LP#Noun_2

tasogare · on Sept 13, 2021

I don't really use it often as a user nor i my projects to have a definite opinion. There is some pairs of words (about 5K) in Sino-Vietnamese that came with their chu nom writing which was very helpful to one of project. Otherwise I think it lacks structure and can't be harvested automatically easily (I don't think Wikidata integrate it all, and that website is a non-starter for me). Also every language is structured differently so Wiktionary can hardly be commented as a whole.

Someone · on Sept 13, 2021

https://en.wiktionary.org/wiki/Help:FAQ:

Q: Is it possible to download Wiktionary?

A: Yes. https://dumps.wikimedia.org/enwiktionary/ should have the latest copy of the main namespace. The cleanest navigation page is https://dumps.wikimedia.org/. Just download a -articles.xml.bz2 file and some software to read it (for nix, for Windows).

Q: Can I use data from Wiktionary in my program?

A: As long as you meet the conditions of the GNU Free Documentation License or Creative Commons Attribution/Share-Alike License, certainly.

Latest dump for English is from September 1. I wouldn’t know whether it has all the data or how easy it is to parse it.

thombles · on Sept 13, 2021

> Otherwise I think it lacks structure and can't be harvested automatically easily

Indeed, it depends on the language and your goals - I had a very high success rate plucking out Russian grammatical tables from English Wiktionary with a few hours of scripting the data cleaning (https://github.com/thombles/declensions). I have a theory that you could get better results using an offline archive of the page sources but haven't tried this yet.