Close Menu
Ztoog
    What's Hot
    Mobile

    Nothing Phone 2: Carl Pei shows Google how to make the best $600 Android phone

    AI

    Using societal context knowledge to foster the responsible application of AI – Google Research Blog

    Mobile

    Airchat is a breath of fresh air in my social media life

    Important Pages:
    • About Us
    • Contact us
    • Privacy Policy
    • Terms & Conditions
    Facebook X (Twitter) Instagram Pinterest
    Facebook X (Twitter) Instagram Pinterest
    Ztoog
    • Home
    • The Future

      Everything Google announced at its Android Show, from Googlebooks to vibe-coded widgets

      CapCut Vs InShot: Which is the Best Video Editing Tool?

      What Meta gets wrong about workforce analytics

      Do you need to worry about Mythos, Anthropic’s computer-hacking AI?

      DraftKings is set to be the first sportsbook to launch its own federal PAC

    • Technology

      Marc Lore says that AI will soon enable anyone open a restaurant

      Snapdragon 8 Elite Gen 5 vs Dimensity 9500: The performance gap shrinks

      Today’s NYT Mini Crossword Answers for April 18

      Soft Photonic Switch Could Drive All‑Optical Logic

      Iran war: Why Trump’s defense secretary keeps talking about “lethality”

    • Gadgets

      Backup all your emails in one place with Mail Backup X

      Asus Zenbook A16 (2026) Review: Savor the Power, Ignore the Beige

      Drone pilot makes US rescind no-fly zones around unmarked, moving ICE vehicles

      Fitbit Enhances Sleep Score With Deep Analytics And Digital Coaching

      Google shoehorned Rust into Pixel 10 modem to make legacy code safer

    • Mobile

      Android 17 creator features bring AI editing, Premiere, and better Instagram uploads

      Oppo Enco Clip2 unboxing and hands-on

      The app Splitwise is the best hack to split group trip expenses in 2026

      Oppo Find X9 Ultra teardown video goes in-depth with every component

      T-Mobile tells stunned subscriber that T-Force reps are human, not AI

    • Science

      Pressure from individual particles measured for the first time

      The problem of cosmic inflation and how to solve it

      Research roundup: 6 cool science stories we almost missed

      Metal-reinforced scorpions evolved to kill

      A Startup Says It Grew Human Sperm in a Lab—and Used It to Make Embryos

    • AI

      Study: Firms often use automation to control certain workers’ wages | Ztoog

      A blueprint for using AI to strengthen democracy

      Sakana AI Introduces KAME: A Tandem Speech-to-Speech Architecture That Injects LLM Knowledge in Real Time

      Enabling privacy-preserving AI training on everyday devices | Ztoog

      Google Introduces Simula: A Reasoning-First Framework for Generating Controllable, Scalable Synthetic Datasets Across Specialized AI Domains

    • Crypto

      Binance Founder CZ Sees Major Changes Ahead For Crypto

      As crypto cools, a16z crypto raises a $2.2B fund

      Ethereum Shows Strength With $1 Billion In Buying Despite Hawkish Fed

      Bitcoin Faces ‘Most Critical Week In Months’ Amid $76,000 Retest

      Analyst Says Everyone Misunderstood The M2-Bitcoin Relationship, Here’s What Happens

    Ztoog
    Home » Google Introduces Simula: A Reasoning-First Framework for Generating Controllable, Scalable Synthetic Datasets Across Specialized AI Domains
    AI

    Google Introduces Simula: A Reasoning-First Framework for Generating Controllable, Scalable Synthetic Datasets Across Specialized AI Domains

    Facebook Twitter Pinterest WhatsApp
    Google Introduces Simula: A Reasoning-First Framework for Generating Controllable, Scalable Synthetic Datasets Across Specialized AI Domains
    Share
    Facebook Twitter LinkedIn Pinterest WhatsApp

    Training highly effective AI models depends on one resource that’s quietly running out: specialised knowledge. While the web offered a seemingly infinite supply of text and images to train today’s generalist models, the next wave of AI breakthroughs — in cybersecurity, legal reasoning, healthcare, and other niche domains — requires data that simply doesn’t exist in sufficient quantity, or cannot be accessed due to privacy concerns.

    A team of researchers from Google and EPFL introduce Simula, a reasoning-driven framework for synthetic data generation and evaluation that prioritises transparency, fine-grained control, and scalability. Unlike typical approaches, Simula doesn’t rely on seed data from the target distribution, hand-crafted prompts, or evolutionary algorithms — it constructs each dataset from first principles, treating data generation as a problem of mechanism design.

    Why Synthetic Data Generation Is Harder Than It Looks

    If you’ve worked with fine-tuning pipelines or domain-specific model training, you’ve probably run into the ‘not enough data’ wall. Manually collecting and annotating specialised datasets is expensive, time-consuming, and error-prone. But the obvious workaround — simply prompting a large language model (LLM) to generate training data — runs into its own set of problems.

    Most current synthetic data strategies optimise for only a subset of what the researchers define as the three axes of ‘good’ data: quality, diversity, and complexity. Quality refers to whether a data point meets specific semantic and syntactic requirements. Diversity covers both global coverage (do you have examples from across your entire concept space?) and local variation (do you have multiple distinct takes on each concept?). Complexity captures how complicated, unusual, or elaborate a given example is. Simultaneously controlling all three, at scale, with explainability, is the unsolved problem that Simula directly targets.

    How Simula Works: Taxonomies, Meta-Prompts, and Dual Critics

    Simula breaks down the generation process into four distinct, controllable steps, each targeting a specific data property.

    The first step addresses global diversity using hierarchical taxonomies. Given a dataset description — say, ‘a dataset of cybersecurity threat intelligence questions’ — a multi-modal model (called M3) is prompted to identify the prime factors of variation for that domain (e.g., attack type, threat actor, vulnerability class). Each factor is then expanded breadth-first into a hierarchical taxonomy tree. To reduce the risk of missing essential subcategories, the system uses a Best-of-N proposal strategy combined with a critic refinement step, where the model proposes N candidate child nodes and then critiques them for completeness, soundness, and specificity. The resulting taxonomies function as structured sampling scaffolds — ensuring that when you draw 512,000 training examples, they genuinely cover the long tail of the domain rather than clustering around frequent modes.

    What the Experiments Show

    The research team tested Simula using Gemini 2.5 Flash (non-thinking) as the teacher model and Gemma 3 4B as the student model, running 10 iterations of LoRA fine-tuning with different seeds per configuration and reporting mean accuracy with 95% confidence intervals. They generated datasets of up to 512K data points across five domains: CTI-MCQ, a multiple-choice question dataset for assessing understanding of CTI standards, threats, and mitigation; CTI-RCM, an open-ended generation task requiring the model to produce a Common Weakness Enumeration (CWE) class from a Common Vulnerabilities and Exposures (CVE) description; LEXam, covering Swiss, EU, and international law examinations in English and German; GSM8k (grade-school math); and Global MMLU (Math, Computer Science, and Physics in English, Korean, and Nepali).

    Across all datasets and data sizes, the full Simula system — combining global diversification, local diversification, complexification, and critiquing — consistently outperformed simpler baseline configurations. Notably, combining both Global and Local diversification was crucial; either in isolation produced suboptimal outcomes depending on dataset and scale.

    The complexity results were particularly instructive. On GSM8k, the High Complexity split yielded a 10% accuracy gain over the Low Complexity split at 64K data items. But on LEXam, where the teacher model achieved only 57% accuracy, higher complexity data actually hurt performance — demonstrating that complex data is only useful when the teacher model is strong enough to generate reliable labels for it. The critic rejection rate for LEXam reached 61%, compared to just 2% for CTI-MCQ, 9% for CTI-RCM, and 9% for GSM8k, directly reflecting the teacher model’s weakness in that domain.

    A separate and practically important finding is what the research team call the Student-Teacher Gap effect on scaling laws. For CTI-RCM, student model performance saturated at around 128K data points, after bridging roughly 83% of the gap between the student’s starting accuracy (40%) and the teacher model’s performance (70%). GSM8k, by contrast, showed no such saturation because the student model’s peak performance (75%) remained sufficiently far from the teacher’s (88%).

    Intrinsic Evaluation Gets a Rethink

    Beyond generation, the research team introduces two new evaluation approaches. Taxonomic Coverage measures what fraction of taxonomy nodes at each level are represented in a dataset — a structured alternative to coarse embedding-based cosine distance metrics that fail to provide actionable insights. Calibrated Complexity Scoring assigns Elo scores to individual data points by running batch-wise pairwise comparisons, a technique the research team call ‘calibrated attribute scoring,’ which proved to align well with human-annotated complexity labels on the MATH dataset.

    One finding stands out: on a taxonomic coverage basis, real-world reference datasets almost always cover less of the target domain than Simula-generated variants, even when embedding-based diversity metrics tell the opposite story. This underscores the limitation of relying on cosine distance alone as a proxy for dataset quality.

    Key Takeaways

    • Simula’s reasoning-first, seedless framework controls quality, diversity, and complexity as independent axes — enabling fine-grained synthetic dataset design without relying on manual prompts, evolutionary algorithms, or seed data from the target distribution.

    • Combining Global and Local diversification is essential: either part in isolation produces suboptimal outcomes, but together they consistently improve downstream model performance across all tested datasets and data sizes.

    • Data complexity helps model performance in most domains, but can hurt when the teacher model is weak — on LEXam, where Gemini 2.5 Flash (non-thinking) achieved only 57% accuracy, the Low Complexity split outperformed the High Complexity split.

    • Real-world reference datasets almost always cover less of the target domain than Simula-generated variants on a taxonomic coverage basis, even when standard embedding-based cosine distance metrics suggest otherwise.

    • Data scaling laws are driven by data properties, not size alone — the full Simula system reached higher downstream performance with fewer samples compared to baseline approaches, making it cheaper across the total data lifecycle despite requiring up to 5x more inference calls per data point.

    ztoog.com

     
     
     
     
     
     
     
     
     
     
     
     
     
     
     
     
     
     
     
     

    Share. Facebook Twitter Pinterest LinkedIn WhatsApp

    Related Posts

    The Future

    Everything Google announced at its Android Show, from Googlebooks to vibe-coded widgets

    AI

    Study: Firms often use automation to control certain workers’ wages | Ztoog

    AI

    A blueprint for using AI to strengthen democracy

    AI

    Sakana AI Introduces KAME: A Tandem Speech-to-Speech Architecture That Injects LLM Knowledge in Real Time

    AI

    Enabling privacy-preserving AI training on everyday devices | Ztoog

    AI

    Treating enterprise AI as an operating layer

    Gadgets

    Google shoehorned Rust into Pixel 10 modem to make legacy code safer

    AI

    A philosophy of work | Ztoog

    Leave A Reply Cancel Reply

    Follow Us
    • Facebook
    • Twitter
    • Pinterest
    • Instagram
    Top Posts
    Gadgets

    Google Maps’ new color scheme test looks a lot like Apple Maps

    The acquainted Google Maps interface may begin wanting a bit totally different quickly. 9to5Google experiences…

    Mobile

    Google aims to ship 10 million Pixel phones in 2024, bets big on India

    Rita El Khoury / Android Authority(*10*)TL;DR Google aims to ship 10 million Pixel phones in…

    Technology

    Tory Lanez’s sentencing for shooting Megan Thee Stallion, and hip-hop’s cycle of misogynoir

    From the second Megan Thee Stallion got here ahead with the information that she was…

    Mobile

    Weekly poll: are macro-enabled ultra wide cameras useful or useless?

    Many of you suppose it’s ineffective – the devoted 2MP macro module that’s there to…

    Gadgets

    ASUS Vivobook S16 OLED Review: The Most Practical 16-inch Laptop Right Now

    It leans into battery endurance, a spacious display, and a light-weight design, a mix that…

    Our Picks
    Technology

    Huawei and the Land of Magical Thinking

    Gadgets

    LEGO Islands Arrive In Fortnite Promising A New Era Of Creativity

    Mobile

    Samsung brings AI and Android to its Interactive Displays

    Categories
    • AI (1,577)
    • Crypto (1,845)
    • Gadgets (1,882)
    • Mobile (1,923)
    • Science (1,957)
    • Technology (1,874)
    • The Future (1,731)
    Most Popular
    Technology

    Google’s Pixel 8 Pro camera is the new mobile photography champ

    Technology

    Best WordPress Marketing Plugins for 2023

    Gadgets

    Eco-Efficient Protein: The Rise of Insect Farming with Korea Soft

    Ztoog
    Facebook X (Twitter) Instagram Pinterest
    • Home
    • About Us
    • Contact us
    • Privacy Policy
    • Terms & Conditions
    © 2026 Ztoog.

    Type above and press Enter to search. Press Esc to cancel.