👋 A New AI Software Engineer

I'm sorry, Devin, but you're fired

Pascal Biese

Mar 29, 2024

Learn to adapt your RAG, or get left behind
When 1+1 is more than 2
MAGIS wants to take Devin’s job

1. Adaptive-RAG: Learning to Adapt Retrieval-Augmented Large Language Models through Question Complexity

Watching: Adaptive-RAG (paper)

What problem does it solve? Retrieval-Augmented Language Models (RALMs) have shown great promise in enhancing the accuracy of Large Language Models (LLMs) on tasks like Question-Answering (QA) by incorporating external knowledge. However, existing approaches often struggle to efficiently handle queries of varying complexity. Simple queries are processed with unnecessary computational overhead, while complex multi-step queries are not adequately addressed. This leads to suboptimal performance and resource utilization, as real-world user requests span a range of complexity levels.

How does it solve the problem? The proposed adaptive QA framework dynamically selects the most suitable strategy for retrieval-augmented LLMs based on the complexity of the incoming query. It employs a smaller LM-based classifier trained to predict the complexity level of queries using automatically collected labels derived from the actual predicted outcomes of models and inherent inductive biases in datasets. By seamlessly adapting between iterative and single-step retrieval-augmented LLMs, as well as no-retrieval methods, the framework efficiently handles queries of varying complexity. This approach strikes a balance between computational efficiency and accuracy, ensuring that the most appropriate strategy is applied to each query.

What's next? The adaptive QA framework demonstrates the potential for intelligent resource allocation in RALMs based on query complexity. Future research could explore more sophisticated methods for query complexity estimation, such as incorporating user feedback or leveraging unsupervised learning techniques. Additionally, the framework could be extended to other NLP tasks beyond QA, such as text summarization or dialogue systems, where adapting to varying input complexity could yield significant improvements in efficiency and performance. As RALMs continue to evolve, developing adaptive strategies that optimize resource utilization while maintaining high accuracy will be crucial for their practical deployment in real-world applications.

2. BLADE: Enhancing Black-box Large Language Models with Small Domain-Specific Models

Watching: BLADE (paper)

What problem does it solve? While Large Language Models (LLMs) have demonstrated remarkable versatility and performance across a wide range of tasks, they often lack the specialized knowledge required for domain-specific applications, such as those in the legal or medical fields. Adapting these general-purpose LLMs to vertical domains has proven to be challenging, with existing approaches being either cost-prohibitive or unreliable in practical settings.

How does it solve the problem? BLADE (Black-box LArge language models with small Domain-spEcific models) addresses this issue by combining the strengths of a black-box LLM and a small domain-specific LM. The small LM is pre-trained on domain-specific data to capture specialized knowledge and insights, while the general LLM contributes robust language comprehension and reasoning capabilities. The integration of these two models is achieved through a three-step process: pre-training the small LM, fine-tuning it using knowledge instruction data, and jointly optimizing both models using Bayesian optimization.

What's next? The promising results of BLADE on public legal and medical benchmarks suggest that this framework could be a cost-effective and efficient solution for adapting general LLMs to various vertical domains. As more specialized applications of LLMs emerge, it will be interesting to see how BLADE and similar approaches evolve to address the unique challenges and requirements of different industries. Furthermore, the integration of domain-specific knowledge into LLMs could lead to the development of more accurate and reliable AI systems for critical applications, such as legal advice or medical diagnosis.

3. MAGIS: LLM-Based Multi-Agent Framework for GitHub Issue Resolution

Watching: MAGIS (paper)

What problem does it solve? Resolving GitHub issues is a complex task that requires understanding the context of the repository, the existing codebase, and the specific requirements of the issue. Large Language Models (LLMs) have shown impressive capabilities in code generation and understanding, but they often struggle with making appropriate code changes at the repository level. This is because resolving issues involves not only generating new code but also maintaining the existing functionalities and ensuring compatibility with the rest of the codebase.

How does it solve the problem? To address the challenges of resolving GitHub issues using LLMs, the authors propose MAGIS, a Multi-Agent framework that leverages the collaboration of four specialized agents: Manager, Repository Custodian, Developer, and Quality Assurance Engineer. The Manager agent breaks down the issue into subtasks and assigns them to the appropriate agents. The Repository Custodian agent maintains an understanding of the repository's structure and existing functionalities. The Developer agent generates code changes based on the subtasks, while the Quality Assurance Engineer agent verifies the correctness and compatibility of the generated code. By decomposing the issue resolution process and leveraging the strengths of each agent, MAGIS significantly improves the performance of LLMs in resolving GitHub issues.

What's next? The success of MAGIS in resolving GitHub issues opens up new possibilities for applying LLMs in software evolution tasks. Future research could explore the integration of MAGIS with other software development tools and processes, such as continuous integration and deployment pipelines. Additionally, the multi-agent approach used in MAGIS could be adapted to other domains where LLMs struggle with complex, multi-step tasks that require collaboration and specialized knowledge.

Editor’s note: It’s not clear yet which solution seems more promising. Devin or MAGIS. Contrary to Devin, MAGIS achieves its impressive results with access only to the shell and at ~3x the speed. But Devin is obviously a more polished product with a wider range of features. Most importantly, the models have both been evaluated on subsets of SWE-bench - different subsets.

LLM Watch

Discussion about this post