📢 The Next Phase of Scaling LLMs

And also my substack

Feb 24, 2024

Foreword:

Welcome everyone!

As we’re entering the 24th week of LLM Watch, it’s finally time to announce the next phase of my substack. If you’re reading this on LinkedIn, you might want to consider subscribing over there for additional content. As for this newsletter: nothing will change. It will still be published weekly on both platforms.

The goal I had in mind when I started LLM Watch was to regularly assess all LLM papers coming out in order to reduce the gigantic information overload we’ve been experiencing ever since 2023. But this newsletter was only supposed to be the first step in doing so - and now that it’s standing on solid ground, I want to expand my efforts.

Over the coming weeks, I will introduce the next steps. This will include turning on paid subscriptions (as of today). While there will also be something additional for my free readers, the focus of this phase will be on paid content. I’m planning to invest a large portion of my time into creating these resources and they will consist of information that won’t be readily obtainable elsewhere on the internet.

Roughly speaking, these are supposed to be actionable pieces of information, something you can base decisions on without having to research everything yourself. You can think of it as exclusive research reports directly delivered to your inbox. My current plan would be to have two parts: one focused on research and one focused on application. Either a monthly publication for both or a bi-weekly rotation.

The first report will be released in roughly two weeks and it will be free because I want to give everyone that’s interested in subscribing the chance to see exactly what they’ll sign up for. If, for whatever reason, you wish to subscribe or upgrade your subscription already, then I couldn’t be happier to not hold you back.

Thank you all for the support so far,

Pascal

In this issue:

Raising hardware-awareness
Sharing (weights) is caring
Less SQL is more

Want to market your brand? I’ve been personally using passionfroot since its launch and have found several partners on their platform. They make it easy for companies to find fitting creators for their brand and I’ve found their streamlined collaboration process to be more efficient and more enjoyable for both sides.

Become a Sponsor

1. Sequoia: Scalable, Robust, and Hardware-aware Speculative Decoding

Watching: Sequoia (paper)

What problem does it solve? With the growing adoption of LLMs across various sectors, fast and efficient model inference has become a critical need. Current methods for accelerating inference, especially speculative decoding which predicts multiple possible continuations to speed up the process, fail to effectively manage larger speculation budgets and vary across different hardware and hyperparameters. The challenge lies in enhancing the speed of these models without compromising their performance too significantly.

How does it solve the problem? Sequoia addresses the need for efficient speculative decoding by implementing three key innovations. First, it uses a dynamic programming approach to optimally organize speculated tokens, making better use of speculation budgets. Second, Sequoia employs a new method of sampling and verifying speculated outcomes to ensure robust performance even when varying the decoding parameters. Third, the algorithm includes a hardware-aware optimization that adjusts the speculated token tree size and depth to suit the specific hardware it's running on, thus making it highly adaptable and maximizes performance.

What's next? The tri-fold improvement Sequoia offers—scalability, robustness, and hardware-awareness—sets the stage for its wide adoption in practical LLM applications. Considering the impressive speed gains reported across various LLM scales and hardware platforms, further development and refinement of Sequoia could lead to it becoming a standard component in Large Language Model inference engines.

2. Head-wise Shareable Attention for Large Language Models

Watching: DirectShare (paper)

What problem does it solve? The scalability of Large Language Models (LLMs) is a pressing issue, as their extensive parameter count limits their deployment, especially on devices with constrained resources. This challenge has driven research towards weight sharing, a technique that can mitigate memory demands by reusing model parameters. Existing weight sharing methods are more suited to smaller models and typically employ a blunt approach like layer-wise sharing, which compromises model flexibility. To address this limitation, the research introduces refined weight-sharing techniques that operate at the granularity of attention heads within LLMs, thus offering a potential solution for deploying these models on edge devices without excessively sacrificing performance.

How does it solve the problem? The paper proposes two novel memory-efficient weight sharing methods that operate on a finer scale compared to existing approaches—specifically focusing on sharing across attention heads, an innovative concept in the realm of LLMs. The first, DirectShare, repurposes pre-trained weights directly without additional training, making it quick and simple to implement. The second, PostShare, involves a phase of post-training where it enforces constraints to ensure similarity between weights before they are shared. This dynamic strategy of selecting weight matrices for sharing provides a means to balance the trade-off between model size and performance more delicately than previous coarse-grained methods.

What’s next? The study has opened the door to further refinements in parameter sharing strategies, which seem promising in maintaining a high level of performance while significantly reducing model size. The next steps would likely revolve around validating these techniques across a broader range of tasks and LLM architectures. Moreover, exploring how these head-wise sharing methods can be combined with other model compression or optimization techniques might lead to even more efficient deployment strategies for LLMs.

3. OpenTab: Advancing Large Language Models as Open-domain Table Reasoners

Watching: OpenTab (paper)

What problem does it solve? While LLMs are adept at a range of language tasks, they struggle when tasked with handling information beyond their training data, particularly with structured table data. Such data often comes in diverse forms and can be quite extensive, making standard text-oriented retrieval methods inadequate. The challenge lies in not only fetching relevant tables but also parsing and reasoning with the data they contain to offer precise responses, a task that is essential for a variety of applications, from financial analysis to inventory management.

How does it solve the problem? OpenTab directly addresses these issues by incorporating a specialized table retriever that fetches pertinent tables from a given dataset. It then leverages SQL to generate programs which can efficiently parse these tables. The resulting data is put through grounded inference, producing accurate responses by understanding and contextualizing the structured data. This approach marries the text-processing power of LLMs with the precision of SQL querying, providing a method to interpret and respond to queries referencing structured datasets in a way that's more natural for end-users.

What’s next? The significant performance leap suggested by OpenTab indicates that the industry could see broader applications of LLMs in scenarios involving structured data. Looking forward, we might see more sophisticated versions of OpenTab, possibly integrating natural language feedback loops to refine search criteria or to tailor SQL queries in real-time.

LLM Watch