Share How does Architecture Influence the Base Capabilities of Pre-trained Language Models? A Case Study Based on FFN-Wider and MoE Transformers

Copy link

November 04, 2024

How does Architecture Influence the Base Capabilities of Pre-trained Language Models? A Case Study Based on FFN-Wider and MoE Transformers

18 minutes

This research explores how the architecture of pre-trained language models influences their base capabilities, specifically focusing on the FFN-Wider Transformer architecture. The study identifies a key factor in model performance: the contribution ratio of the Multi-Head Attention (MHA) layer, which acts as a combination function that reflects the model's ability to combine linguistic features. The authors demonstrate that FFN-Wider Transformers reduce the contribution ratio of this combination function, leading to a decline in base capabilities. To address this issue, they propose a Combination Enhanced Architecture (CEA) that redistributes the wider FFN layer, enhancing the combination function and ultimately improving base capabilities. The effectiveness of CEA is further validated by its successful application to Mixture of Experts (MoE) Transformers, highlighting its potential for broader architecture improvement.

...more

View all episodes

By Kenpachi

November 04, 2024

How does Architecture Influence the Base Capabilities of Pre-trained Language Models? A Case Study Based on FFN-Wider and MoE Transformers

18 minutes

...more

Sign up to save your podcasts