♾️ Scaling LLMs 2 Infinity

And the next level of in-context learning

Pascal Biese

Apr 19, 2024

Sometimes it takes more than one shot
Infini-attention leaving no context behind
Playing Mergers & Acquisitions with LLMs

Voiceflow is a low-code platform for building AI agents that go beyond basic chatbots.

Give it a try yourself by signing up for a free Voiceflow account.

Click here

1. Many-Shot In-Context Learning

Watching: MSICL (paper)

What problem does it solve? In-context learning (ICL) has been a major focus of research in the Large Language Model (LLM) space. The ability to learn from just a few examples without any weight updates is quite remarkable and has a lot of potential for practical applications. However, ICL performance has been limited by the small number of examples that can fit into the context window. With the recent expansion of context windows, we can now investigate ICL with hundreds or even thousands of examples - the many-shot regime.

How does it solve the problem? The researchers explore two new settings to address the limitation of available human-generated examples in many-shot ICL: Reinforced ICL and Unsupervised ICL. Reinforced ICL uses model-generated chain-of-thought rationales in place of human examples, while Unsupervised ICL removes rationales from the prompt altogether and only prompts the model with domain-specific questions. Both approaches prove to be effective in the many-shot regime, particularly for complex reasoning tasks. This opens up new possibilities for leveraging the power of many-shot ICL even when human-generated examples are scarce.

What's next? The findings in this research have significant implications for the future of ICL and LLMs. The ability to override pretraining biases and learn high-dimensional functions with numerical inputs through many-shot learning could have major implications. However, the limitations of next-token prediction loss as an indicator of downstream ICL performance also highlight the need for further research in this area.

2. Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention

Watching: Infini-attention (paper)

What problem does it solve? While the attention mechanism in Transformers has proven to be powerful, it still struggles with very long input sequences. The computational and memory requirements of attention grow quadratically with the input length, making it infeasible to process extremely long sequences efficiently. This limitation hinders the application of Large Language Models (LLMs) to tasks that require processing and understanding of long-form content, such as books or extensive documents.

How does it solve the problem? Infini-attention addresses the scalability issue by incorporating a compressive memory into the standard attention mechanism. It combines masked local attention and long-term linear attention within a single Transformer block. The masked local attention captures short-term dependencies, while the long-term linear attention allows for efficient processing of longer-range dependencies. By compressing the memory, Infini-attention reduces the computational and memory overhead, enabling LLMs to handle infinitely long input sequences with bounded resources.

What's next? The introduction of Infini-attention opens up new possibilities for applying LLMs to tasks involving extremely long sequences. The researchers demonstrate its effectiveness on long-context language modeling benchmarks, passkey context block retrieval with 1M sequence length, and book summarization with 500K length. As the demand for processing and understanding long-form content continues to grow, techniques like Infini-attention will become increasingly important.

3. Hierarchical Context Merging: Better Long Context Understanding for Pre-trained LLMs

Watching: HOMER (paper/code)

What problem does it solve? The context window of Large Language Models (LLMs) is one of the main bottlenecks for a lot of applications. Most current models are limited to a few thousand tokens, which might sound like a lot but can be exhausted quite quickly when you're trying to have a conversation about a specific topic or work with larger documents. There have been different approaches to increasing the context window, with varying degrees of success.

How does it solve the problem? HOMER takes a hierarchical approach to processing longer sequences of text. The input gets divided into smaller chunks that fit into the native context window of the model. Each chunk then gets processed individually before getting merged back together. The merging happens across the different layers of the Transformer, with a dedicated token reduction step to keep everything memory-efficient. Probably the most interesting part is that HOMER is completely training-free and can therefore be applied to any pre-trained model.

What's next? Techniques for increasing the context window of LLMs are becoming more important as we're trying to push these models to more complex applications. I wouldn't be surprised if we saw more of these hierarchical approaches in the future, as they seem to offer a good balance between performance and efficiency. It will be interesting to see how HOMER compares to other methods in terms of both performance and ease of use.

LLM Watch

Discussion about this post