Training highly effective AI models depends on one resource that’s quietly running out: specialised knowledge. While the web offered a seemingly infinite supply of text and images to train today’s generalist models, the next wave of AI breakthroughs — in cybersecurity, legal reasoning, healthcare, and other niche domains — requires data that simply doesn’t exist in sufficient quantity, or cannot be accessed due to privacy concerns.
A team of researchers from Google and EPFL introduce Simula, a reasoning-driven framework for synthetic data generation and evaluation that prioritises transparency, fine-grained control, and scalability. Unlike typical approaches, Simula doesn’t rely on seed data from the target distribution, hand-crafted prompts, or evolutionary algorithms — it constructs each dataset from first principles, treating data generation as a problem of mechanism design.
Why Synthetic Data Generation Is Harder Than It Looks
If you’ve worked with fine-tuning pipelines or domain-specific model training, you’ve probably run into the ‘not enough data’ wall. Manually collecting and annotating specialised datasets is expensive, time-consuming, and error-prone. But the obvious workaround — simply prompting a large language model (LLM) to generate training data — runs into its own set of problems.
Most current synthetic data strategies optimise for only a subset of what the researchers define as the three axes of ‘good’ data: quality, diversity, and complexity. Quality refers to whether a data point meets specific semantic and syntactic requirements. Diversity covers both global coverage (do you have examples from across your entire concept space?) and local variation (do you have multiple distinct takes on each concept?). Complexity captures how complicated, unusual, or elaborate a given example is. Simultaneously controlling all three, at scale, with explainability, is the unsolved problem that Simula directly targets.
How Simula Works: Taxonomies, Meta-Prompts, and Dual Critics
Simula breaks down the generation process into four distinct, controllable steps, each targeting a specific data property.
The first step addresses global diversity using hierarchical taxonomies. Given a dataset description — say, ‘a dataset of cybersecurity threat intelligence questions’ — a multi-modal model (called M3) is prompted to identify the prime factors of variation for that domain (e.g., attack type, threat actor, vulnerability class). Each factor is then expanded breadth-first into a hierarchical taxonomy tree. To reduce the risk of missing essential subcategories, the system uses a Best-of-N proposal strategy combined with a critic refinement step, where the model proposes N candidate child nodes and then critiques them for completeness, soundness, and specificity. The resulting taxonomies function as structured sampling scaffolds — ensuring that when you draw 512,000 training examples, they genuinely cover the long tail of the domain rather than clustering around frequent modes.
What the Experiments Show
The research team tested Simula using Gemini 2.5 Flash (non-thinking) as the teacher model and Gemma 3 4B as the student model, running 10 iterations of LoRA fine-tuning with different seeds per configuration and reporting mean accuracy with 95% confidence intervals. They generated datasets of up to 512K data points across five domains: CTI-MCQ, a multiple-choice question dataset for assessing understanding of CTI standards, threats, and mitigation; CTI-RCM, an open-ended generation task requiring the model to produce a Common Weakness Enumeration (CWE) class from a Common Vulnerabilities and Exposures (CVE) description; LEXam, covering Swiss, EU, and international law examinations in English and German; GSM8k (grade-school math); and Global MMLU (Math, Computer Science, and Physics in English, Korean, and Nepali).
Across all datasets and data sizes, the full Simula system — combining global diversification, local diversification, complexification, and critiquing — consistently outperformed simpler baseline configurations. Notably, combining both Global and Local diversification was crucial; either in isolation produced suboptimal outcomes depending on dataset and scale.
The complexity results were particularly instructive. On GSM8k, the High Complexity split yielded a 10% accuracy gain over the Low Complexity split at 64K data items. But on LEXam, where the teacher model achieved only 57% accuracy, higher complexity data actually hurt performance — demonstrating that complex data is only useful when the teacher model is strong enough to generate reliable labels for it. The critic rejection rate for LEXam reached 61%, compared to just 2% for CTI-MCQ, 9% for CTI-RCM, and 9% for GSM8k, directly reflecting the teacher model’s weakness in that domain.
A separate and practically important finding is what the research team call the Student-Teacher Gap effect on scaling laws. For CTI-RCM, student model performance saturated at around 128K data points, after bridging roughly 83% of the gap between the student’s starting accuracy (40%) and the teacher model’s performance (70%). GSM8k, by contrast, showed no such saturation because the student model’s peak performance (75%) remained sufficiently far from the teacher’s (88%).
Intrinsic Evaluation Gets a Rethink
Beyond generation, the research team introduces two new evaluation approaches. Taxonomic Coverage measures what fraction of taxonomy nodes at each level are represented in a dataset — a structured alternative to coarse embedding-based cosine distance metrics that fail to provide actionable insights. Calibrated Complexity Scoring assigns Elo scores to individual data points by running batch-wise pairwise comparisons, a technique the research team call ‘calibrated attribute scoring,’ which proved to align well with human-annotated complexity labels on the MATH dataset.
One finding stands out: on a taxonomic coverage basis, real-world reference datasets almost always cover less of the target domain than Simula-generated variants, even when embedding-based diversity metrics tell the opposite story. This underscores the limitation of relying on cosine distance alone as a proxy for dataset quality.
Key Takeaways
-
Simula’s reasoning-first, seedless framework controls quality, diversity, and complexity as independent axes — enabling fine-grained synthetic dataset design without relying on manual prompts, evolutionary algorithms, or seed data from the target distribution.
-
Combining Global and Local diversification is essential: either part in isolation produces suboptimal outcomes, but together they consistently improve downstream model performance across all tested datasets and data sizes.
-
Data complexity helps model performance in most domains, but can hurt when the teacher model is weak — on LEXam, where Gemini 2.5 Flash (non-thinking) achieved only 57% accuracy, the Low Complexity split outperformed the High Complexity split.
-
Real-world reference datasets almost always cover less of the target domain than Simula-generated variants on a taxonomic coverage basis, even when standard embedding-based cosine distance metrics suggest otherwise.
-
Data scaling laws are driven by data properties, not size alone — the full Simula system reached higher downstream performance with fewer samples compared to baseline approaches, making it cheaper across the total data lifecycle despite requiring up to 5x more inference calls per data point.
ztoog.com
