🤖 AI Is Shaking Up the Life Sciences

Google's bet on generalist AI: how allrounders are beating specialists

Apr 11, 2025

Welcome to this week's LLM Watch! This time around, we're diving into three AI highlights that showcase how machine learning is tackling complex scientific challenges and redefining intelligent systems.

First, ATOMICA tackles fragmented molecular modeling by learning a universal language for interactions between proteins, DNA, and drugs. This unified view promises deeper biological insights and faster discovery.

Next, TxGemma brings efficient, explainable AI to drug development. Moving beyond black boxes, it offers powerful prediction and conversational reasoning, making advanced AI more accessible and trustworthy for scientists.

Finally, KnowSelf equips AI agents with situational self-awareness. By learning when to reflect or seek knowledge, these agents achieve better results with significantly fewer resources, paving the way for more efficient and adaptive AI.

These advancements highlight a clear trend: AI is becoming more integrated, efficient, explainable, and context-aware. Whether unifying our understanding of molecular biology, streamlining the path to new medicines, or building more resourceful AI agents, the potential impact is immense.

Don’t forget to subscribe to never miss an update again.

1. ATOMICA: Learning Universal Representations of Intermolecular Interactions

Watching: ATOMICA (paper/code)

What problem does it solve? Current machine learning approaches in molecular modeling have a fundamental limitation: they either treat molecules in isolation or specialize in specific interaction types (like protein-ligand binding). This siloed approach prevents knowledge transfer across different biomolecular classes, despite the fact that interactions between proteins, nucleic acids, small molecules, and ions all follow similar physicochemical principles like hydrogen bonding and van der Waals forces. While structure prediction tools like AlphaFold have made tremendous progress, they don't explicitly learn representations of interactions that generalize across the full diversity of molecular types. This fragmentation limits our ability to model the molecular world systematically.

How does it solve the problem? The researchers developed ATOMICA, a geometric deep learning model that learns universal representations of intermolecular interfaces across diverse molecular types. What sets ATOMICA apart is its pretraining on over 2 million interaction complexes spanning proteins, nucleic acids, small molecules, and metal ions. The model represents molecular interactions through a hierarchical graph that captures both atomic-level details and higher-order chemical structure. Using SE(3)-equivariant tensor field networks for message passing and a self-supervised denoising and masking objective, ATOMICA generates embeddings at multiple scales (atoms, chemical blocks, and interfaces) that effectively transfer across different molecular modalities.

What are the key findings? ATOMICA successfully captures critical residues at interaction interfaces and demonstrates significant improvements in cross-modality generalization. For example, training across molecular modalities improved masked block identity recovery on test protein-DNA interactions by 190%. The model's latent space organizes molecules according to chemical similarity and exhibits compositional properties similar to word embeddings in NLP (like vector("King") - vector("Man") + vector("Woman") ≈ vector("Queen")). The researchers constructed five modality-specific protein networks (ATOMICANETs) that identified disease pathways across 27 conditions and predicted disease-associated proteins in autoimmune neuropathies and lymphoma. Additionally, ATOMICA annotated 2,646 previously uncharacterized binding sites in the "dark proteome," including putative zinc finger motifs and cytochrome subunits.

Why does it matter? ATOMICA represents a significant step toward unifying molecular interaction modeling across different biomolecules. This universal approach enables better prediction and understanding of interactions that are critical for biological processes and disease mechanisms. By demonstrating that cross-modality learning improves representation quality, particularly for modalities with limited data (like protein-nucleic acid interactions), ATOMICA addresses a fundamental challenge in computational biology. The successful annotation of the dark proteome illuminates functions in previously uncharacterized protein families, while the disease pathway analysis through ATOMICANETs offers new insights into disease mechanisms and potential therapeutic targets. This technology opens new avenues for drug discovery, protein function prediction, and understanding disease at the molecular level.

2. TxGemma: Efficient and Agentic LLMs for Therapeutics

Watching: TxGemma (paper)

What problem does it solve? Therapeutic development is a high-risk, costly endeavor with notorious failure rates. While specialized computational models exist for predicting individual drug properties or interactions, they operate in isolation and often function as "black boxes" without explaining their reasoning. The industry lacks integrated models that can handle the diverse needs throughout the drug development pipeline - from target identification to clinical trials - while providing explanations that scientists can understand. Previous approaches like Tx-LLM showed promise but lacked conversational capabilities, limiting their usefulness to scientists who need models that can engage in nuanced scientific discussions about complex therapeutic challenges.

How does it solve the problem? TxGemma is a suite of efficient, generalist LLMs (2B, 9B, and 27B parameters) fine-tuned from Gemma-2 on comprehensive therapeutic datasets from the Therapeutic Data Commons. They created two model variants: TxGemma-Predict for high-performance property prediction across numerous therapeutic tasks, and TxGemma-Chat which balances predictive power with conversational abilities by mixing therapeutic data with general instruction tuning. Going beyond standalone models, they introduced Agentic-Tx, a system powered by Gemini 2.5 that orchestrates complex workflows by integrating TxGemma models with 18 specialized tools to retrieve external knowledge, perform chemical structure calculations, and manage multi-step reasoning processes.

What are the key findings? TxGemma achieved superior or comparable performance to state-of-the-art generalist models on 64 out of 66 therapeutic tasks (superior on 45) and outperformed specialized models on 26 tasks despite its more general training. It demonstrated exceptional data efficiency, requiring substantially less training data when fine-tuned for clinical trial adverse event prediction compared to base models. TxGemma-Chat successfully bridged the gap between precise property prediction and explanation, providing mechanistic reasoning for predictions based on molecular structure - a crucial step beyond "black box" predictions. The Agentic-Tx system showed remarkable improvements on therapeutic reasoning benchmarks, with performance gains of 52.3% on Humanity's Last Exam (Chemistry & Biology) and 26.7% on GPQA (Chemistry) over previous leading models.

Why does it matter? This represents a paradigm shift in therapeutic AI, showing that efficient generalist models can be competitive with specialized ones across diverse tasks while providing much-needed explainability. The conversational capabilities of TxGemma-Chat enable scientists to receive both predictions and reasoning in natural language, making the technology more accessible and trustworthy for drug developers. By releasing these models openly, researchers can now adapt and fine-tune them on proprietary datasets, potentially accelerating therapeutic development in data-limited domains. The Agentic-Tx system demonstrates how LLMs can orchestrate complex therapeutic workflows, combining prediction with knowledge retrieval and reasoning to support scientists throughout the drug development pipeline. This integrated approach could significantly reduce the time and cost of bringing new therapeutics to market.

3. Agentic Knowledgeable Self-awareness

Watching: KnowSelf (paper/code)

What problem does it solve? Current Large Language Model (LLM) agent planning approaches use what the authors call a "flood irrigation" methodology, indiscriminately injecting gold trajectories, external feedback, and domain knowledge into agent models. This overlooks the fundamental human cognitive ability of situational self-awareness - dynamically assessing when to rely on one's own abilities, when to self-reflect, or when external knowledge is needed. Traditional agents lack this metacognitive capability, leading to inefficient planning where they either use no additional resources or apply excessive reflection and knowledge incorporation regardless of necessity, resulting in higher computational costs and brittle behaviors.

How does it solve the problem? The researchers propose KnowSelf, a data-centric approach enabling agents to autonomously regulate knowledge utilization. Their method identifies three thinking scenarios: "fast thinking" (direct action), "slow thinking" (requiring reflection), and "knowledgeable thinking" (needing external knowledge). First, they collect self-explored trajectories and mark them with special tokens according to a situation judgment criterion. Then, they implement a two-stage training process: supervised fine-tuning to teach basic self-awareness patterns, followed by RPO (Reinforcement from Preference Optimization) to further enhance these capabilities. During inference, the agent identifies its needs by generating specific tokens, selectively applying reflection or knowledge only when necessary.

What are the key findings? Experiments show KnowSelf outperforms various strong baselines on two simulated agent planning datasets (ALFWorld and WebShop) while using significantly less external knowledge. For instance, KnowSelf achieved superior results on ALFWorld with only 15.01% knowledge utilization on Llama-8B, compared to baseline methods requiring 100% knowledge integration. The researchers found that training with selective knowledge awareness even outperforms approaches that apply knowledge at every step, particularly for smaller models like Gemma-2B, where excessive knowledge can be counterproductive. Analysis revealed that this knowledgeable self-awareness capability emerges in the final few layers of the Transformer architecture, resembling an internal decision-making process.

Why does it matter? By enabling situational self-awareness, KnowSelf drastically reduces both training and inference costs while improving task performance - essentially getting better results with fewer resources. The approach also enhances agent generalization ability, as demonstrated by superior performance on unseen tasks compared to traditional pattern-matching methods. As AI systems grow more complex, this metacognitive capability to distinguish between situations requiring different levels of reasoning support becomes increasingly valuable, potentially extending beyond planning tasks to broader AI applications where resource optimization and adaptive reasoning are critical.

Papers of the Week:

Beyond the Next Token: Towards Prompt-Robust Zero-Shot Classification via Efficient Multi-Token Prediction:
Prompt brittleness undermines zero-shot text classification reliability despite prompt engineering and next-token probabilities. Placeholding Parallel Prediction (P3) enhances robustness by simulating comprehensive sampling of generation paths, reducing standard deviation across prompts. P3 maintains performance even without a prompt, decreasing the need for prompt engineering.
LightPROF: A Lightweight Reasoning Framework for Large Language Model on Knowledge Graph:
LLMs excel at zero-shot reasoning but struggle with knowledge update delays and high resource consumption. LightPROF, a parameter-efficient framework, enhances LLM reasoning using Knowledge Graphs (KGs) and a Retrieve-Embed-Reason process. It uses stable retrieval and a Knowledge Adapter to map KG structural information to the LLM's token embedding space. Tested on KGQA benchmarks, LightPROF improves reasoning time with smaller LLMs.
DeepResearcher: Scaling Deep Research via Reinforcement Learning in Real-world Environments:
DeepResearcher trains LLM-based agents for deep, open-domain research in real-world environments with web search, overcoming limitations of prompt engineering and Retrieval-Augmented Generation. A multi-agent architecture with browsing agents enables scaled reinforcement learning and emergent cognitive behaviors, including self-reflection and the ability to cross-validate information from multiple sources.
Single-Pass Document Scanning for Question Answering:
Chunk-based methods lack global context in long documents. Single-pass document scanning offers efficient question answering over massive texts, processing in linear time while preserving coherence. Outperforming chunk-based methods on 41 QA benchmarks, it rivals large language models at lower cost. Code, datasets, and models are at MambaRetriever.
ThoughtProbe: Classifier-Guided Thought Space Exploration Leveraging LLM Intrinsic Reasoning
A linear classifier detects LLMs' intrinsic reasoning capabilities in activation space, guiding a classifier-guided search using tree expansion. With branch-aggregation selection identifying the optimal answer, this framework enhances reasoning performance on arithmetic reasoning benchmarks, demonstrating improved understanding and utilization of LLM reasoning.
FeedbackEval: A Benchmark for Evaluating Large Language Models in Feedback-Driven Code Repair Tasks
FeedbackEval evaluates large language models (GPT-4o, GLM-4, Qwen2.5) in code repair. Structured test feedback enhances success, influenced by prompt structure, during single-iteration and iterative processes. Chain-of-thought prompting offers limited benefit in automated bug resolution and software maintenance.
NeedleInATable: Exploring Long-Context Capability of Large Language Models towards Long-Structured Tables
Unlike long-context benchmarks using unstructured text, NIAT reveals LLMs' superficial correlations when extracting the target cell from long tables. A data synthesis method enhances long-table comprehension, outperforming long-table agent methods, and improving genuine long-structured table comprehension for long-context applications.
Plan-and-Refine: Diverse and Comprehensive Retrieval-Augmented Generation
Plan-and-Refine (P&R) overcomes retrieval-augmented LLM diversity limitations via global exploration and local exploitation, guided by a reward model. ICAT evaluation on non-factoid question answering (ANTIQUE) and TREC datasets shows P&R significantly improves answer factuality and comprehensiveness in information seeking, confirmed via user study.
PR-Attack: Coordinated Prompt-RAG Attacks on Retrieval-Augmented Generation in Large Language Models via Bilevel Optimization
Large Language Models (LLMs) face limitations like hallucinations; Retrieval-Augmented Generation (RAG) introduces vulnerabilities. PR-attack uses bilevel optimization, injecting poisoned texts and a prompt backdoor trigger to cause targeted responses, evading anomaly detection systems and achieving a high attack success rate with limited poisoned texts.
ConceptFormer: Towards Efficient Use of Knowledge-Graph Embeddings in Large Language Models
ConceptFormer augments Large Language Models (LLMs) with Wikidata knowledge via concept vectors, boosting GPT-2 0.1B's factual recall on Wikipedia and synthetic sentences. This lookup table approach outperforms textified knowledge graphs (KGs) in Retrieval Augmented Generation (RAG), avoiding LLM modification and improving efficiency.

LLM Watch