Hacker News new | past | comments | ask | show | jobs | submit login

Facebook’s librilight taps this potential. They have a script for easily fetching an entire language from librivox, and I have a pipeline on top of that for book fetching and transcript alignment that I’ve successfully trained from.

I think the next step here needs to be auto-finding librivox source books that aren’t in e.g. Gutenberg. I only auto-found books for 75% of English. Only found books for 20% of German. I don’t actually know where to look beyond Gutenberg for books that don’t have easily linked text. Common crawl? Librivox forums?

(Or you can take Facebook’s semi-supervised approach and generate a transcript from thin air instead, but imo that will work better for English than languages that don’t have good baseline models yet)




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: