Duarte O.Carmo's articles

#59 Bagaço: A pretraining dataset for European Portuguese


Listen Later

Let's say your goal is to train a Large Language Model only on European Portuguese. Where do you start? What datasets are out there? What websites are being scraped for the large black box? Bagaço - named after the popular Portuguese moonshine - is a small step in that direction.

In June …

...more
View all episodesView all episodes
Download on the App Store

Duarte O.Carmo's articlesBy Duarte O.Carmo