AI: post transformers

Marin: Open LLM Optimization & Diagnostics


Listen Later

Marin is an open lab dedicated to the transparent research and development of foundation models (FMs), focusing its core mission on identifying **how to build the best model with a fixed resource budget**, encompassing both compute and data. The lab employs a philosophy of complete transparency from day one, organizing its entire research and development process through **GitHub**. Every research effort, from wish list items to completed runs, is tracked by a dedicated **GitHub issue** that serves as a mini-preregistration, with experiments declared in code and open to community review. This methodology, leveraging practices developed for open-source software, promotes reproducibility and public disclosure of all results, including mistakes and negative outcomes. Marin has successfully produced models like the **Marin 8B Base**, which outperforms Llama 3.1 8B Base on a majority of standard evaluations.


We are reviewing the academic papers they build upon, they detail the **foundational LLM research** that is essential to Marin's work and define the competitive ecosystem. The sources cover crucial topics directly informing Marin's efforts, such as data curation strategies (DCLM-BASELINE), modern open-source model architectures (OLMo 2), and theoretical explanations for optimization dynamics like the **Warmup-Stable-Decay (WSD) learning rate schedule** using the **river valley loss landscape** metaphor. This foundational research relates directly to the internal project referenced, GitHub **Issue #826**, which proposes to add **Levanter's visualization capabilities** to the codebase. This specific initiative aims to diagnose and understand training stability issues, specifically investigating unexpected **jumps in log probabilities** to determine if they are due to real domain challenges or simple formatting errors.


Sources:


1. arXiv ID 2406.11794: Introduces the DataComp for Language Models (DCLM) benchmark and the superior DCLM-BASELINE dataset (240T token corpus) achieved through model-based filtering.

◦ URL: https://arxiv.org/pdf/2406.11794

2. arXiv ID 2501.00656: Presents the OLMo 2 family (7B, 13B, 32B), a second generation of fully open LLMs (including weights, data, and code). Key features include improved stability and the use of the specialized Dolmino Mix 1124 data during mid-training.

◦ URL: https://arxiv.org/pdf/2501.00656

3. Introduces Marin, an open lab dedicated to transparent foundation model research, tracking all experiments via GitHub issues and offering the Datashop platform for expert data curation.

◦ URL: https://marin.community/

4. Feb 22, 2025: GitHub Issue #826 in marin-community/marin proposing to add Levanter visualization capabilities to analyze jumps in log probabilities (checking for domain or formatting issues).

◦ URL: https://github.com/marin-community/marin/issues/826

5. arXiv ID 2410.05192: Analyzes the Warmup-Stable-Decay (WSD) learning rate schedule using the river valley loss landscape perspective, proposing WSD-S for efficient continual LLM pretraining (0.1B to 1.2B models).

◦ URL: https://arxiv.org/pdf/2410.05192 (Derived from context)

...more
View all episodesView all episodes
Download on the App Store

AI: post transformersBy mcgrof