Researchers from Google Cloud and the Korea Advanced Institute of Science and Technology (KAIST) have introduced DS STAR, an advanced AI system designed to automate complex data science workflows.
This framework converts raw, heterogeneous data—including CSV files, nested JSON, Markdown documents, and unstructured text—into actionable Python code, all without human intervention.
By leveraging iterative planning and self-verification, DS STAR represents a shift from conventional Text-to-SQL tools to a more flexible Text-to-Python methodology.
How Does DS STAR Work?

DS STAR operates via a structured two-stage process that closely mirrors the workflow of experienced data scientists. This architecture addresses a common enterprise challenge: data is often fragmented across multiple formats and sources, making automated analysis difficult without prior integration.
Stage One: Universal Data File Analysis
The process begins with the Analyzer agent (referred to as “Aanalyzer” in the research paper). This component scans every file within a specified data lake, generating and executing Python scripts to extract essential metadata. For structured data like CSVs, it identifies column names and data types. For JSON files, it maps hierarchical relationships.
In Markdown documents, it locates embedded tables, and for unstructured text, it produces semantic summaries. The output is a detailed “data map” that offers full context to downstream agents without requiring them to load entire datasets into memory. This step is especially valuable in large, domain-specific archives where data provenance is critical.
Stage Two: Iterative Planning and Verification Loop
After the data landscape is mapped, DS-STAR initiates its core innovation: a cyclic process involving four specialized agents. The Planner first drafts an executable analysis strategy, often starting with basic operations such as loading a CSV or filtering records. The Coder then translates this plan into Python code, which executes immediately to produce results.
What sets DS-STAR apart is the Verifier agent. Using a large language model (LLM), it assesses whether the executed code meaningfully progresses toward answering the original business question. This goes beyond syntax checks—it evaluates the reasoning behind the cumulative plan, code, and outputs, delivering a binary verdict: sufficient or insufficient. If the result is insufficient, the Router agent determines the best corrective action: modify the current step, add a new operation, or redesign the entire approach.
This loop repeats until the Verifier approves the solution or the system reaches a predefined limit of 20 iterations. This sequential refinement mirrors how data scientists work in tools like Jupyter notebooks—testing hypotheses, inspecting outputs, and incrementally building accurate solutions.
Other Features of DS STAR
Beyond its primary workflow, DS-STAR integrates two modules that improve robustness in real-world settings.
Adebugger: Self-Repairing Code Execution
In production, data formats and schemas frequently change. The Adebugger module automatically detects execution failures—such as missing columns or type mismatches—and applies fixes using schema descriptions from the analysis phase.
Early trials indicate this self-healing capability reduces the need for manual intervention by 60–70%.
Retriever: Smart File Selection
To handle large-scale data lakes efficiently, the Retriever module uses semantic search to identify the top 100 most relevant files for a given analytical task.
By prioritizing key data sources before the Analyzer begins processing, DS-STAR maintains performance and scalability—even in terabyte-scale enterprise environments.
DS STAR Performance
Empirical benchmarks confirm that DS-STAR outperforms existing autonomous agents such as AutoGen, ReAct, and DA-Agent. Using Gemini 2.5 Pro as its base model and allowing up to 20 refinement cycles, DS-STAR achieved the following results on industry-standard benchmarks:
- DABStep: Accuracy on hard tasks increased from 12.70% (baseline LLM) to 45.24%, while easy tasks reached 87.50% accuracy. This 32-point absolute improvement placed DS-STAR at the top of the public leaderboard as of late 2025.
- KramaBench: The system achieved a normalized score of 44.69, up from 39.79 by the best alternative, highlighting its strength in synthesizing information from domain-specific archives.
- DA-Code: Hard-task accuracy rose from 32.0% to 37.1%, confirming consistent performance across varied analytical challenges.
Notably, easy tasks were completed in just 3.0 iterations on average, with more than half resolved in a single round. Hard tasks required approximately 5.6 iterations, showing that the system knows when to stop refining—avoiding endless loops common in less disciplined AI code generators.
Final Words on Google DS STAR
DS-STAR signals more than a step forward in automation—it reframes data science as an iterative, verifiable process. Its multi-agent, self-correcting architecture offers a template for other open-ended domains where ground truth is ambiguous, from scientific research to strategic planning.
By integrating planning, coding, verification, and debugging into a unified framework, DS-STAR reduces dependency on human analysts while improving reliability and scalability. As enterprises continue to grapple with sprawling and messy data environments, systems like DS-STAR may soon become essential tools in the data scientist’s toolkit.
Read More: Google Releases Magika 1.0



