We provide a practical implementation for accelerators that requires O( √ n) memory, is numerically stable, and is within a few percent of the runtime of the standard implementation of attention. We also demonstrate how to differentiate the function while remaining memory-efficient.

2021: Markus N. Rabe, Charles Staats



https://arxiv.org/pdf/2112.05682v2.pdf

We provide a practical implementation for accelerators that requires O( √ n) memory, is numerically stable, and is within a few percent of the runtime of the standard implementation of attention. We also demonstrate how to differentiate the function while remaining memory-efficient. 2021: Markus N. Rabe, Charles Staats https://arxiv.org/pdf/2112.05682v2.pdf

Self-attention Does Not Need O(n2) Memory

Keeping you up to date with the latest trends and best performing architectures in this fast evolving field in computer science.

Selecting papers by comparative results, citations and influence we educate you on the latest research.

Consider supporting us on Patreon.com/PapersRead for feedback and ideas.

News

Tech News

Keeping you up to date with the latest trends and best performing architectures in this fast evolving field in computer science. Selecting papers by comparative results, citations and influence we educate you on the latest research. Consider supporting us on Patreon.com/PapersRead for feedback and ideas.

Share Self-attention Does Not Need O(n2) Memory

Sign up to save your podcasts

Self-attention Does Not Need O(n2) Memory

Self-attention Does Not Need O(n2) Memory