This is awesome. Long time I wanted to pre-train a language model in Brazilian Portuguese and just got a 24GB 3090. This is the perfect timing. Thank you for sharing your notebook.
Nice! You'd get a long way just by swapping out the wikipedia dataset being pulled in. Though you might need to still accommodate for some special characters as well, I don't know how important those are in Brazilian Portuguese.