Learning GenAI via SOTA Papers

EP045: BLOOM The Open Source Rival To GPT-3


Listen Later

This paper introduces BLOOM, a 176-billion-parameter open-access multilingual language model developed by the BigScience Workshop, an international collaboration of hundreds of researchers. The project was motivated by the need to democratize powerful large language model (LLM) technology, which has predominantly been controlled by well-resourced corporate entities, and to conduct research with a strong emphasis on ethics, inclusivity, and data governance.

Key highlights of the paper include:

  • Training Dataset: BLOOM was trained on the ROOTS corpus, a carefully curated 1.61-terabyte dataset comprising 46 natural languages and 13 programming languages. The curation process prioritized human involvement, local language expertise, and respect for data rights.
  • Architecture and Engineering: The model is a causal, decoder-only Transformer utilizing ALiBi positional embeddings and a 250,000-token byte-level BPE tokenizer. It was trained over 3.5 months on the French Jean Zay supercomputer using 3D parallelism (data, tensor, and pipeline parallelism) via the Megatron-DeepSpeed framework.
  • Performance: Through extensive evaluation in zero-shot and one-shot settings, BLOOM demonstrated competitive capabilities across various tasks, including multilingual machine translation, abstractive summarization, and code generation. The authors also released BLOOMZ, a variant that underwent multitask prompted finetuning on a massive dataset of prompts, which significantly improved its zero-shot task generalization.
  • Environmental Impact: The researchers conducted a Life Cycle Assessment to estimate BLOOM's carbon footprint. Training BLOOM produced roughly 25 tons of CO2eq—significantly less than similar models like GPT-3 or OPT—largely due to the low-carbon nuclear energy grid powering the Jean Zay supercomputer.
  • Open Access and Licensing: To ensure responsible usage, the model, its code, and its various checkpoints were publicly released under a specialized Responsible AI License (RAIL). This licensing grants open access to the research community while legally restricting specific harmful or malicious use cases.

Overall, the paper not only presents a state-of-the-art multilingual language model but also provides a comprehensive blueprint of the massive, coordinated open-science effort required to responsibly design, train, and evaluate it.

...more
View all episodesView all episodes
Download on the App Store

Learning GenAI via SOTA PapersBy Yun Wu