Facebook’s librilight taps this potential. They have a script for easily fetching an entire language from librivox, and I have a pipeline on top of that for book fetching and transcript alignment that I’ve successfully trained from.
I think the next step here needs to be auto-finding librivox source books that aren’t in e.g. Gutenberg. I only auto-found books for 75% of English. Only found books for 20% of German. I don’t actually know where to look beyond Gutenberg for books that don’t have easily linked text. Common crawl? Librivox forums?
(Or you can take Facebook’s semi-supervised approach and generate a transcript from thin air instead, but imo that will work better for English than languages that don’t have good baseline models yet)
I think the next step here needs to be auto-finding librivox source books that aren’t in e.g. Gutenberg. I only auto-found books for 75% of English. Only found books for 20% of German. I don’t actually know where to look beyond Gutenberg for books that don’t have easily linked text. Common crawl? Librivox forums?
(Or you can take Facebook’s semi-supervised approach and generate a transcript from thin air instead, but imo that will work better for English than languages that don’t have good baseline models yet)