Hacker News new | past | comments | ask | show | jobs | submit login

An official set of words is only useful if your NLP task is restricted to items that themselves are restricted in their language use. Twitter and SMS data sets are interesting because they represent something closer to casual speech rather than formal writing.

The French Academy provides an official dictionary and language usage, but speakers hardly restrict themselves to its contents.

Filipino mostly refers to the Manila dialect of Tagalog, whereas Tagalog is a language with many dialects existing in the Philippines. There are lots of languages in the Philippines but as far as I know they aren't referred to as Filipino.

For a lot of NLP problems you will probably have to make your own data set. It can be a lot of work.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: