
Sign up to save your podcasts
Or


A couple of months ago I released Bagaço - a pretraining dataset for European Portuguese. The idea was simple: take the FineWeb 2 dataset, limit it to web pages that look like they came from Portugal, and classify them into categories …
By Duarte O.CarmoA couple of months ago I released Bagaço - a pretraining dataset for European Portuguese. The idea was simple: take the FineWeb 2 dataset, limit it to web pages that look like they came from Portugal, and classify them into categories …