Close Menu
Ztoog
    What's Hot
    The Future

    Disneyland’s 70th Anniversary Brings Cartoony Chaos to This Summer’s Celebration

    Science

    Monitoring Air Quality Anywhere Thanks to 3D Printing

    Mobile

    The Google Pixel Watch’s SpO2 monitoring finally becomes operational

    Important Pages:
    • About Us
    • Contact us
    • Privacy Policy
    • Terms & Conditions
    Facebook X (Twitter) Instagram Pinterest
    Facebook X (Twitter) Instagram Pinterest
    Ztoog
    • Home
    • The Future

      JD Vance and President Trump’s Sons Hype Bitcoin at Las Vegas Conference

      AI may already be shrinking entry-level jobs in tech, new research suggests

      Today’s NYT Strands Hints, Answer and Help for May 26 #449

      LiberNovo Omni: The World’s First Dynamic Ergonomic Chair

      Common Security Mistakes Made By Businesses and How to Avoid Them

    • Technology

      Gemini in Google Drive can now help you skip watching that painfully long Zoom meeting

      Apple iPhone exports from China to the US fall 76% as India output surges

      Today’s NYT Wordle Hints, Answer and Help for May 26, #1437

      5 Skills Kids (and Adults) Need in an AI World – O’Reilly

      How To Come Back After A Layoff

    • Gadgets

      8 Best Vegan Meal Delivery Services and Kits (2025), Tested and Reviewed

      Google Home is getting deeper Gemini integration and a new widget

      Google Announces AI Ultra Subscription Plan With Premium Features

      Google shows off Android XR-based glasses, announces Warby Parker team-up

      The market’s down, but this OpenAI for the stock market can help you trade up

    • Mobile

      Wallpaper Wednesday: Android wallpapers 2025-05-28

      Google can make smart glasses accessible with Warby Parker, Gentle Monster deals

      vivo T4 Ultra specs leak

      Forget screens: more details emerge on the mysterious Jony Ive + OpenAI device

      Android 16 QPR1 lets you check what fingerprints you’ve enrolled on your Pixel phone

    • Science

      Was Planet Nine exiled from the solar system as a baby?

      How farmers can help rescue water-loving birds

      A trip to the farm where loofahs grow on vines

      AI Is Eating Data Center Power Demand—and It’s Only Getting Worse

      Liquid physics: Inside the lab making black hole analogues on Earth

    • AI

      The AI Hype Index: College students are hooked on ChatGPT

      Learning how to predict rare kinds of failures | Ztoog

      Anthropic’s new hybrid AI model can work on tasks autonomously for hours at a time

      AI learns how vision and sound are connected, without human intervention | Ztoog

      How AI is introducing errors into courtrooms

    • Crypto

      CoinW Teams Up with Superteam Europe to Conclude Solana Hackathon and Accelerate Web3 Innovation in Europe

      Ethereum Net Flows Turn Negative As Bulls Push For $3,500

      Bitcoin’s Power Compared To Nuclear Reactor By Brazilian Business Leader

      Senate advances GENIUS Act after cloture vote passes

      Is Bitcoin Bull Run Back? Daily RSI Shows Only Mild Bullish Momentum

    Ztoog
    Home » How to build a better AI benchmark
    AI

    How to build a better AI benchmark

    Facebook Twitter Pinterest WhatsApp
    How to build a better AI benchmark
    Share
    Facebook Twitter LinkedIn Pinterest WhatsApp

    The limits of conventional testing

    If AI corporations have been gradual to reply to the rising failure of benchmarks, it’s partially as a result of the test-scoring method has been so efficient for thus lengthy. 

    One of the largest early successes of up to date AI was the ImageNet problem, a sort of antecedent to modern benchmarks. Released in 2010 as an open problem to researchers, the database held greater than 3 million pictures for AI methods to categorize into 1,000 totally different lessons.

    Crucially, the take a look at was utterly agnostic to strategies, and any profitable algorithm shortly gained credibility no matter the way it labored. When an algorithm known as AlexNet broke via in 2012, with a then unconventional type of GPU coaching, it grew to become one of many foundational outcomes of contemporary AI. Few would have guessed prematurely that AlexNet’s convolutional neural nets can be the key to unlocking picture recognition—however after it scored properly, nobody dared dispute it. (One of AlexNet’s builders, Ilya Sutskever, would go on to cofound OpenAI.)

    A big a part of what made this problem so efficient was that there was little sensible distinction between ImageNet’s object classification problem and the precise technique of asking a laptop to acknowledge a picture. Even if there have been disputes about strategies, nobody doubted that the highest-scoring mannequin would have a bonus when deployed in an precise picture recognition system.

    But within the 12 years since, AI researchers have utilized that very same method-agnostic method to more and more common duties. SWE-Bench is usually used as a proxy for broader coding capacity, whereas different exam-style benchmarks usually stand in for reasoning capacity. That broad scope makes it troublesome to be rigorous about what a particular benchmark measures—which, in flip, makes it exhausting to use the findings responsibly. 

    Where issues break down

    Anka Reuel, a PhD scholar who has been specializing in the benchmark drawback as a part of her analysis at Stanford, has grow to be satisfied the analysis drawback is the results of this push towards generality. “We’ve moved from task-specific models to general-purpose models,” Reuel says. “It’s not about a single task anymore but a whole bunch of tasks, so evaluation becomes harder.”

    Like the University of Michigan’s Jacobs, Reuel thinks “the main issue with benchmarks is validity, even more than the practical implementation,” noting: “That’s where a lot of things break down.” For a job as difficult as coding, as an illustration, it’s almost not possible to incorporate each doable state of affairs into your drawback set. As a consequence, it’s exhausting to gauge whether or not a mannequin is scoring better as a result of it’s extra expert at coding or as a result of it has extra successfully manipulated the issue set. And with a lot strain on builders to obtain report scores, shortcuts are exhausting to resist.

    For builders, the hope is that success on a lot of particular benchmarks will add up to a typically succesful mannequin. But the strategies of agentic AI imply a single AI system can embody a complicated array of various fashions, making it exhausting to consider whether or not enchancment on a particular job will lead to generalization. “There’s just many more knobs you can turn,” says Sayash Kapoor, a laptop scientist at Princeton and a distinguished critic of sloppy practices within the AI trade. “When it comes to agents, they have sort of given up on the best practices for evaluation.”

    Share. Facebook Twitter Pinterest LinkedIn WhatsApp

    Related Posts

    AI

    The AI Hype Index: College students are hooked on ChatGPT

    AI

    Learning how to predict rare kinds of failures | Ztoog

    AI

    Anthropic’s new hybrid AI model can work on tasks autonomously for hours at a time

    AI

    AI learns how vision and sound are connected, without human intervention | Ztoog

    AI

    How AI is introducing errors into courtrooms

    AI

    With AI, researchers predict the location of virtually any protein within a human cell | Ztoog

    AI

    Google DeepMind’s new AI agent cracks real-world problems better than humans can

    AI

    Study shows vision-language models can’t handle queries with negation words | Ztoog

    Leave A Reply Cancel Reply

    Follow Us
    • Facebook
    • Twitter
    • Pinterest
    • Instagram
    Top Posts
    AI

    Measurement-induced entanglement phase transitions in a quantum circuit – Google Research Blog

    Posted by Jesse Hoke, Student Researcher, and Pedram Roushan, Senior Research Scientist, Quantum AI Team

    Crypto

    Bitcoin Whales Go On Buying Spree As Price Dips, Here’s How Much They Bought

    A latest growth reveals that Bitcoin whales have refused to be deterred by the latest…

    Crypto

    Bitcoin Price Rally Was Not ETF-Driven: QCP Reveals Reason

    In their newest market replace, QCP Capital, a crypto asset buying and selling agency headquartered…

    Gadgets

    Samsung will unveil the Galaxy S25 on January 22 — here’s what we expect

    The greatest reveal from final week’s Samsung CES press convention could properly have been one…

    Gadgets

    Startup Synergy: IFEZ’s Role in Fostering Innovation and Economic Growth

    The Incheon Free Economic Zone (IFEZ) is a strategic financial initiative established in 2003 in Incheon,…

    Our Picks
    Science

    Microsensors to Give Food its Own Voice

    Science

    Removing Microplastics from Water with Ultrasounds

    Crypto

    Fractal Suggests Major Breakout In Q4

    Categories
    • AI (1,492)
    • Crypto (1,752)
    • Gadgets (1,804)
    • Mobile (1,849)
    • Science (1,864)
    • Technology (1,801)
    • The Future (1,647)
    Most Popular
    Technology

    Vulnerabilities result in millions of compromised users of popular managed file transfer software

    Crypto

    Bitcoin To Hit $150,000 By 2025, Financial Brokerage Firm Says

    AI

    Join me at EmTech Digital this week!

    Ztoog
    Facebook X (Twitter) Instagram Pinterest
    • Home
    • About Us
    • Contact us
    • Privacy Policy
    • Terms & Conditions
    © 2025 Ztoog.

    Type above and press Enter to search. Press Esc to cancel.