Duarte O.Carmo's articles

#66 The largest open pretraining dataset for European Portuguese


Listen Later

Educational score over time vs. document count for Bagaco v2

A couple of months ago I released Bagaço - a pretraining dataset for European Portuguese. The idea was simple: take the FineWeb 2 dataset, limit it to web pages that look like they came from Portugal, and classify them into categories …

...more
View all episodesView all episodes
Download on the App Store

Duarte O.Carmo's articlesBy Duarte O.Carmo