SFR-RAG: How Open AI Can Beat OpenAI
Advancing Contextual Understanding in Large Language Models
Retrieval Augmented Generation (RAG) has emerged as a critical paradigm for enhancing the capabilities of large language models (LLMs). The recently introduced SFR-RAG model, developed by researchers at Salesforce AI Research, represents a promising direction we’ve seen evolving this year: small(er) models closing the performance gap between proprietary and open AI.
This article will explore the key features of SFR-RAG, its performance on various benchmarks, and a step by step explanation of how it improves RAG performance.
Key Features of SFR-RAG
SFR-RAG is a 9-billion parameter language model specifically designed to excel in RAG applications. The model's primary goal is to faithfully and comprehensively understand provided context and user questions, avoid hallucination, handle challenging scenarios, perform complex reasoning, and produce reliable citations. Let's break down the key aspects of SFR-RAG and how it achieves these objectives.
Novel Chat Template
Traditional LLMs typically use three roles in their conversational structure: System, User, and Assistant. SFR-RAG expands on this by adding two new roles: Thought and Observation.
This comes with the following benefits:
a) Role Clarification: By introducing separate roles for Thought and Observation, SFR-RAG creates a clearer structure for different types of information. This helps the model distinguish between its internal reasoning process (Thought) and external information (Observation).
b) Easier Masking During Training: The new template allows for more precise control over which parts of the conversation should be included in the training loss. Specifically, System, User, and Observation turns can be masked out, while Thought and Assistant turns are included in the fine-tuning process.
c) Enhanced Security: The separation of roles facilitates better instruction hierarchy enforcement. This makes the model more resistant to potential jailbreaks or malicious instructions injected through User or Observation turns.
d) Improved Developer Control: The new template streamlines the process of building reliable and secure RAG applications. Developers can more easily control which parts of the internal processing to display or hide from end-users.
e) Consistent Function Calling: By designating a specific role (Thought) for internal reasoning and tool use syntax, SFR-RAG avoids the need to parse custom keywords from the Assistant output, leading to more reliable function calling.
Comprehensive Fine-tuning Process
The model underwent an extensive fine-tuning process designed to enhance its contextual understanding and generation abilities. This process focused on several key capabilities:
Extracting Relevant Information: SFR-RAG is trained to efficiently extract pertinent information from long contexts. This is crucial for RAG applications where the model needs to sift through large amounts of retrieved data.
Recognizing Information Gaps: The model is trained to identify when relevant information is lacking in the provided context. This helps prevent hallucination by encouraging the model to abstain from generating responses when it lacks sufficient information.
Handling Conflicting Information: SFR-RAG is equipped to recognize and deal with potentially conflicting information in contextual passages. This is essential for real-world applications where retrieved information may be inconsistent or contradictory.
Resilience to Distractions: The fine-tuning process includes exposure to distracting, counter-intuitive, or out-of-distribution content. This helps the model maintain focus on relevant information even in the presence of noise.
Diverse Instruction Following: By using extensive instruction-following data that mimics real-world retrieval question answering applications, SFR-RAG is trained to handle a wide variety of tasks and query types.
Model Performance
To evaluate SFR-RAG's performance, the researchers introduced ContextualBench, a comprehensive evaluation suite comprising seven popular contextual question-answering tasks. This standardized benchmark allows for consistent comparison across different models and studies.
ContextualBench ensures that all models are evaluated under the same instructions and with consistent specification of contextual contents. This creates a level playing field for comparing different models. The benchmark also offers various scoring methods (Exact Match, Easy Match, and F1 score) to account for variations in answer generation styles across different models. It provides multiple setups common in RAG scenarios, including options for retrieving top-k chunks using consistent embedding models or feeding entire available contextual documents directly to the LLM.
SFR-RAG-9B achieved state-of-the-art performance in 3 out of 7 benchmarks: TruthfulQA, 2WikiHopQA, and HotpotQA.
It outperformed or was competitive with GPT-4o on all tasks in ContextualBench.
The model showed particularly strong performance on 2WikiHopQA, with nearly a 25% increase compared to GPT-4o.
These results demonstrate SFR-RAG's exceptional ability to understand and utilize contextual information across a diverse range of tasks.
Resilience to Challenging Contexts
One of the most impressive aspects of SFR-RAG is its ability to handle challenging contextual scenarios. The researchers evaluated this using the FaithEval suite, which tests models on three specific situations:
Handling Unanswerable Questions: In the "Unknown" scenario, relevant facts are removed from the context, making the original question unanswerable. SFR-RAG showed a strong ability to recognize when it lacks sufficient information to answer a question, rather than hallucinating a response.
Identifying Conflicting Information: The "Conflict" scenario provides multiple context documents with contradicting information. SFR-RAG demonstrated a superior ability to recognize and handle such conflicts compared to other models.
Adapting to Counterfactual Information: In the "Counterfactual" scenario, commonsense facts are altered by introducing falsely fabricated context. SFR-RAG showed remarkable adaptability, remaining faithful to the provided context even when it contradicted common knowledge.
The model scored particularly well in the Counterfactual setting, where larger models like GPT-4o struggled due to their stronger resistance to factual changes. Even in challenging scenarios where the context may be incomplete, contradictory, or counter to common knowledge, SFR-RAG provided faithful contexts.
Maintaining General Capabilities
Despite its focus on RAG and contextual applications, SFR-RAG maintains strong performance on general instruction-following tasks and world knowledge benchmarks.
The researchers ensured that SFR-RAG's specialized training for RAG applications did not come at the cost of general language understanding and generation abilities. The model was evaluated on traditional few-shot prompting benchmarks to measure its parametric knowledge and general instruction-following abilities.
On the MMLU benchmark, SFR-RAG-9B achieved a score of 70.15, outperforming larger models like Command-R (35B) and remaining competitive with other state-of-the-art models in its size range. Overall, it maintains strong general language understanding and reasoning capabilities while excelling in RAG-specific tasks.
How SFR-RAG Improves RAG
Now that we've explored the key features and capabilities of SFR-RAG, let's summarize how it improves RAG performance step by step:
Step 1: Enhanced Context Understanding
The novel chat template with Thought and Observation roles allows for clearer separation of contextual information and internal reasoning.
Extensive fine-tuning on diverse contextual tasks improves the model's ability to extract relevant information from long contexts.
Step 2: Reduced Hallucination
Training to recognize information gaps helps the model abstain from generating responses when it lacks sufficient context.
The ability to identify conflicting information in contexts prevents the model from confidently stating incorrect facts.
Step 3: Improved Multi-hop Reasoning
The Thought role in the chat template facilitates more structured internal reasoning, allowing for complex multi-step deductions.
Strong performance on benchmarks like HotpotQA and 2WikiHopQA demonstrates enhanced multi-hop reasoning capabilities.
Step 4: Better Handling of Challenging Contexts
Resilience to counterfactual, conflicting, and incomplete information, as shown in the FaithEval results, makes SFR-RAG more robust in real-world RAG scenarios.
The model's ability to adapt to unexpected or counter-intuitive information in contexts improves its versatility.
Step 5: More Reliable Citations
The clear separation of roles in the chat template, particularly the Observation role, facilitates more accurate tracking of information sources.
This improved source tracking leads to more reliable and accurate citations in generated responses.
Step 6: Efficient Parameter Usage
SFR-RAG-9B achieves state-of-the-art performance on several benchmarks with significantly fewer parameters than competing models.
This efficiency allows for faster inference and lower computational requirements in RAG applications.
Step 7: Balanced Capabilities
Maintaining strong performance on general language tasks ensures that SFR-RAG can handle a wide range of queries in RAG systems, not just narrowly specialized tasks.
Step 8: Improved Function Calling
The integration of function calling capabilities allows SFR-RAG to more effectively interact with external tools and APIs in RAG systems.
This enables more dynamic and sophisticated retrieval strategies, potentially improving the quality of retrieved context.
Step 9: Standardized Evaluation
The introduction of ContextualBench provides a more comprehensive and consistent way to evaluate RAG models, facilitating ongoing improvements in the field.
Conclusion
SFR-RAG represents a welcomed advancement in the field of Retrieval Augmented Generation (RAG). By introducing a novel chat template, employing comprehensive fine-tuning strategies, and demonstrating exceptional performance across a wide range of benchmarks, SFR-RAG sets a new standard for contextual understanding and faithful generation in language models.
The model's ability to handle challenging contexts, perform complex reasoning, and maintain general language capabilities while excelling in RAG-specific tasks makes it a versatile and potentially powerful tool. As RAG systems continue to grow in importance for providing up-to-date and factual AI responses, it’s important that decent models remain relatively small and affordable.
The team plans to fully open-source the model in the near future (“later”, they say). Before that, they will make it available via an API. Can’t wait to try this one out - how about you?
👍 If you enjoyed this article, give it a like and share it with your peers.
And in case you want to continue reading, here’s my previous research summaries on Google’s alternative to RAG and Microsoft’s GraphRAG: