AI: post transformers

By mcgrof

The transformer architecture revolutionized the world of Neural Networks. It was a springboard for what we know today as modern artificial intelligence. This podcast focuses on modern state of the art... more

· Technology

Download on the App Store

Download on the App Store

Get it on Google Play

FAQs about AI: post transformers:

How many episodes does AI: post transformers have?

The podcast currently has 340 episodes available.

AI: post transformers episodes:

August 07, 2025 LSTM: the forget gate
This 2000 paper introduces a novel solution to a weakness found in Long Short-Term Memory (LSTM) networks, specifically when processing continuous data streams without predefined segmentation. The core problem addressed is the unbounded growth of internal cell states within standard LSTM networks, which can lead to performance degradation. The authors propose and implement "forget gates", an adaptive mechanism that allows LSTM cells to learn when to reset their internal memory at appropriate times, thus managing resources effectively. Through experiments with complex, continual versions of benchmark problems, the paper demonstrates that LSTMs equipped with these forget gates successfully overcome limitations faced by standard LSTMs and other recurrent neural networks. Ultimately, the work highlights the importance of adaptive forgetting for neural networks dealing with ongoing, unsegmented input.
...more
14min
August 07, 2025 GPT4 Technical Report
This 2023 paper, GPT-4 Technical Report from OpenAI introduces GPT-4, a multimodal AI model capable of processing both image and text inputs to produce text outputs, demonstrating human-level performance on various professional and academic benchmarks, such as the bar exam. The report highlights the predictable scaling of the model's performance and its improved factual accuracy and adherence to desired behaviors achieved through post-training alignment processes like Reinforcement Learning from Human Feedback (RLHF). It also addresses the model's limitations, including "hallucinations" and potential for misuse, detailing the safety evaluations and mitigation strategies implemented to reduce harmful outputs.
...more
32min
August 07, 2025 GPT3
This 2020 paper outlines the development and evaluation of GPT-3, a large language model, exploring its performance across various natural language processing tasks under zero-shot, one-shot, and few-shot learning conditions, which involve providing minimal to no task-specific examples during inference. It details the model's architecture, training methodology, including its use of a massive dataset, and analyzes its limitations and broader impacts, such as the potential for misuse and the presence of biases related to gender, race, and religion inherited from its training data. The document also discusses the challenges of data contamination and the computational resources required for training such a large model.
...more
28min
August 07, 2025 GPT2
This 2019 paper, "Language Models are Unsupervised Multitask Learners," introduces GPT-2, a large language model designed for zero-shot learning, meaning it can perform tasks without explicit, task-specific training. The research highlights the model's ability to learn various natural language processing (NLP) tasks, such as question answering, summarization, and translation, by being trained on a diverse and extensive dataset called WebText, composed of millions of high-quality webpages. The paper demonstrates that increasing the model's capacity significantly improves performance across these tasks, often achieving state-of-the-art results in a zero-shot setting. While showing promising results, the authors acknowledge that GPT-2's practical applications are still developing, particularly in areas like summarization and translation where performance remains rudimentary compared to human benchmarks
...more
15min
August 07, 2025 GELU
This 2023 paper Gaussian Error Linear Units (GELUs), a novel activation function for neural networks that outperforms traditional activations like Rectified Linear Units (ReLUs) and Exponential Linear Units (ELUs) across various tasks. GELUs operate by weighting inputs by their value using the standard Gaussian cumulative distribution function, providing a probabilistic interpretation unlike the sign-based gating of ReLUs. Empirical evaluations demonstrate consistent performance improvements in computer vision, natural language processing, and speech recognition tasks. The paper also discusses the historical context and challenges of credit assignment for a related activation function, the Sigmoid Linear Unit (SiLU), which was independently rediscovered and mislabeled as "swish" by other research groups. Ultimately, GELUs have gained prominence as a default activation in advanced Transformer models, indicating their significant impact on deep learning.
...more
18min
August 07, 2025 Dropout
This 2014 journal article introduces "Dropout", a novel technique designed to combat overfitting in deep neural networks, which are powerful but prone to memorizing training data. The core concept involves randomly deactivating a subset of neurons and their connections during the training phase, which prevents hidden units from overly relying on each other. This process effectively trains an exponential number of "thinned" networks, improving the model's robustness and generalization to new data. The authors demonstrate that dropout significantly enhances performance across diverse applications, including image recognition, speech processing, and document classification, often achieving state-of-the-art results by producing more meaningful and sparse features. The paper also compares dropout to other regularization methods, explores its impact on network behavior, and discusses its extension to Restricted Boltzmann Machines, highlighting its general applicability as a method for model averaging.

Source:
https://arxiv.org/pdf/1207.0580
...more
28min
August 07, 2025 ResNets - residual block
What ResNet introduced is adding the input of a block directly to its output, like this:

Output = 𝐹(𝑥)+ 𝑥

This academic paper introduces Deep Residual Learning, a novel framework designed to facilitate the training of exceptionally deep neural networks for image recognition. The core innovation lies in reformulating layers to learn residual functions, meaning they learn the difference from the input rather than an entirely new function. This approach effectively addresses the degradation problem, where increasing network depth paradoxically leads to higher training error, allowing for the creation of networks up to 152 layers deep, significantly outperforming shallower models. The authors demonstrate the efficacy of their Residual Networks (ResNets) across various image recognition tasks, securing first place in multiple ILSVRC and COCO 2015 competitions for classification, detection, and localization, proving the generalizability and power of their method.
...more
13min
August 07, 2025 BERT
Review of the 2017 paper "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding", leveraging the transformer architecture, by Google.

This paper introduces BERT (Bidirectional Encoder Representations from Transformers), a novel language representation model designed for pre-training deep bidirectional representations from unlabeled text. Unlike prior models that process text unidirectionally, BERT conditions on both left and right context in all layers, enabling it to achieve state-of-the-art results across eleven natural language processing (NLP) tasks, including question answering and language inference. The model utilizes two primary pre-training tasks: Masked LM for bidirectional learning and Next Sentence Prediction to understand sentence relationships. The authors demonstrate that this bidirectional approach, coupled with fine-tuning the pre-trained model for specific tasks, significantly outperforms previous methods, even with minimal task-specific architectural modifications.
...more
18min
August 07, 2025 BART
Review of the 2019 pun titled paper "BART: Denoising Sequence-to-Sequence Pre-training for Natural
Language Generation, Translation, and Comprehension" by the folks at Facebook.

This research introduces BART, a novel denoising autoencoder designed for pre-training sequence-to-sequence models, which proves effective for various natural language processing tasks, including generation, translation, and comprehension. BART distinguishes itself by corrupting text with arbitrary noising functions and learning to reconstruct the original input, combining elements of existing models like BERT and GPT. The study evaluates different noising strategies, finding that random sentence shuffling and a text-infilling scheme yield the best performance. Results indicate BART performs comparably to state-of-the-art models on classification tasks while achieving new benchmarks in text generation for abstractive dialogue, question answering, and summarization. Furthermore, the paper demonstrates BART's utility in enhancing machine translation decoders, offering a flexible and robust pre-training framework.
...more
16min
August 07, 2025 Attention is all you need
Review of the seminal 2017 paper Attention is all you need.

These paper introducea the Transformer architecture, a dominant model in natural language processing that relies entirely on multi-head attention instead of recurrent or convolutional networks. The first paper, "Attention Is All You Need," introduces the Transformer, showcasing its superior performance and training efficiency in machine translation and constituency parsing. The subsequent papers, "Analyzing Multi-Head Self-Attention" and "Are Sixteen Heads Really Better than One?", investigate the importance and interpretability of individual attention heads within the Transformer. Both studies surprisingly conclude that a significant portion of attention heads can be removed without substantially impacting performance, revealing that many heads are redundant and that specialized heads perform the most critical functions, particularly in encoder-decoder attention.
...more
20min

FAQs about AI: post transformers:

How many episodes does AI: post transformers have?

The podcast currently has 340 episodes available.