Close Menu
Ztoog
    What's Hot
    Mobile

    Good news! Lead times are declining for the iPhone 15 Pro and iPhone 15 Pro Max

    AI

    Best 10+ Password Managers in 2023

    Crypto

    Coinbase Makes History as First International Crypto Exchange Registered in Canada

    Important Pages:
    • About Us
    • Contact us
    • Privacy Policy
    • Terms & Conditions
    Facebook X (Twitter) Instagram Pinterest
    Facebook X (Twitter) Instagram Pinterest
    Ztoog
    • Home
    • The Future

      How I Turn Unstructured PDFs into Revenue-Ready Spreadsheets

      Is it the best tool for 2025?

      The clocks that helped define time from London’s Royal Observatory

      Summer Movies Are Here, and So Are the New Popcorn Buckets

      India-Pak conflict: Pak appoints ISI chief, appointment comes in backdrop of the Pahalgam attack

    • Technology

      Ensure Hard Work Is Recognized With These 3 Steps

      Cicada map 2025: Where will Brood XIV cicadas emerge this spring?

      Is Duolingo the face of an AI jobs crisis?

      The US DOD transfers its AI-based Open Price Exploration for National Security program to nonprofit Critical Minerals Forum to boost Western supply deals (Ernest Scheyder/Reuters)

      The more Google kills Fitbit, the more I want a Fitbit Sense 3

    • Gadgets

      Apple plans to split iPhone 18 launch into two phases in 2026

      Upgrade your desk to Starfleet status with this $95 USB-C hub

      37 Best Graduation Gift Ideas (2025): For College Grads

      Backblaze responds to claims of “sham accounting,” customer backups at risk

      Snapdragon X Plus Could Bring Faster, More Powerful Chromebooks

    • Mobile

      What are people doing with those free T-Mobile lines? Way more than you’d expect

      Samsung doesn’t want budget Galaxy phones to use exclusive AI features

      COROS’s charging adapter is a neat solution to the smartwatch charging cable problem

      Fortnite said to return to the US iOS App Store next week following court verdict

      Chinese tech icon is about to raise the stakes in a battle with US chipmaker over AI processors

    • Science

      Trump administration cuts off all future federal funding to Harvard

      Does kissing spread gluten? New research offers a clue.

      Why Balcony Solar Panels Haven’t Taken Off in the US

      ‘Dark photon’ theory of light aims to tear up a century of physics

      Signs of alien life on exoplanet K2-18b may just be statistical noise

    • AI

      How to build a better AI benchmark

      Q&A: A roadmap for revolutionizing health care through data-driven innovation | Ztoog

      This data set helps researchers spot harmful stereotypes in LLMs

      Making AI models more trustworthy for high-stakes settings | Ztoog

      The AI Hype Index: AI agent cyberattacks, racing robots, and musical models

    • Crypto

      ‘The Big Short’ Coming For Bitcoin? Why BTC Will Clear $110,000

      Bitcoin Holds Above $95K Despite Weak Blockchain Activity — Analytics Firm Explains Why

      eToro eyes US IPO launch as early as next week amid easing concerns over Trump’s tariffs

      Cardano ‘Looks Dope,’ Analyst Predicts Big Move Soon

      Speak at Ztoog Disrupt 2025: Applications now open

    Ztoog
    Home » How to build a better AI benchmark
    AI

    How to build a better AI benchmark

    Facebook Twitter Pinterest WhatsApp
    How to build a better AI benchmark
    Share
    Facebook Twitter LinkedIn Pinterest WhatsApp

    The limits of conventional testing

    If AI corporations have been gradual to reply to the rising failure of benchmarks, it’s partially as a result of the test-scoring method has been so efficient for thus lengthy. 

    One of the largest early successes of up to date AI was the ImageNet problem, a sort of antecedent to modern benchmarks. Released in 2010 as an open problem to researchers, the database held greater than 3 million pictures for AI methods to categorize into 1,000 totally different lessons.

    Crucially, the take a look at was utterly agnostic to strategies, and any profitable algorithm shortly gained credibility no matter the way it labored. When an algorithm known as AlexNet broke via in 2012, with a then unconventional type of GPU coaching, it grew to become one of many foundational outcomes of contemporary AI. Few would have guessed prematurely that AlexNet’s convolutional neural nets can be the key to unlocking picture recognition—however after it scored properly, nobody dared dispute it. (One of AlexNet’s builders, Ilya Sutskever, would go on to cofound OpenAI.)

    A big a part of what made this problem so efficient was that there was little sensible distinction between ImageNet’s object classification problem and the precise technique of asking a laptop to acknowledge a picture. Even if there have been disputes about strategies, nobody doubted that the highest-scoring mannequin would have a bonus when deployed in an precise picture recognition system.

    But within the 12 years since, AI researchers have utilized that very same method-agnostic method to more and more common duties. SWE-Bench is usually used as a proxy for broader coding capacity, whereas different exam-style benchmarks usually stand in for reasoning capacity. That broad scope makes it troublesome to be rigorous about what a particular benchmark measures—which, in flip, makes it exhausting to use the findings responsibly. 

    Where issues break down

    Anka Reuel, a PhD scholar who has been specializing in the benchmark drawback as a part of her analysis at Stanford, has grow to be satisfied the analysis drawback is the results of this push towards generality. “We’ve moved from task-specific models to general-purpose models,” Reuel says. “It’s not about a single task anymore but a whole bunch of tasks, so evaluation becomes harder.”

    Like the University of Michigan’s Jacobs, Reuel thinks “the main issue with benchmarks is validity, even more than the practical implementation,” noting: “That’s where a lot of things break down.” For a job as difficult as coding, as an illustration, it’s almost not possible to incorporate each doable state of affairs into your drawback set. As a consequence, it’s exhausting to gauge whether or not a mannequin is scoring better as a result of it’s extra expert at coding or as a result of it has extra successfully manipulated the issue set. And with a lot strain on builders to obtain report scores, shortcuts are exhausting to resist.

    For builders, the hope is that success on a lot of particular benchmarks will add up to a typically succesful mannequin. But the strategies of agentic AI imply a single AI system can embody a complicated array of various fashions, making it exhausting to consider whether or not enchancment on a particular job will lead to generalization. “There’s just many more knobs you can turn,” says Sayash Kapoor, a laptop scientist at Princeton and a distinguished critic of sloppy practices within the AI trade. “When it comes to agents, they have sort of given up on the best practices for evaluation.”

    Share. Facebook Twitter Pinterest LinkedIn WhatsApp

    Related Posts

    AI

    Q&A: A roadmap for revolutionizing health care through data-driven innovation | Ztoog

    AI

    This data set helps researchers spot harmful stereotypes in LLMs

    AI

    Making AI models more trustworthy for high-stakes settings | Ztoog

    AI

    The AI Hype Index: AI agent cyberattacks, racing robots, and musical models

    AI

    Novel method detects microbial contamination in cell cultures | Ztoog

    AI

    Seeing AI as a collaborator, not a creator

    AI

    “Periodic table of machine learning” could fuel AI discovery | Ztoog

    The Future

    How do companies track remote workers and still build trust?

    Leave A Reply Cancel Reply

    Follow Us
    • Facebook
    • Twitter
    • Pinterest
    • Instagram
    Top Posts
    Mobile

    Top 10 trending phones of week 42

    Much like final week we’ve got a Samsung telephone topping our trending chart solely it…

    Crypto

    Machine Learning Algorithm Predicts 17.66% Rise In Bitcoin Price, Here’s The Target

    The machine studying algorithm at CoinCodex has taken a crack on the Bitcoin value and…

    Mobile

    Android users could receive part of a $700 million settlement over Google Play Store policies (UPDATE)

    UPDATE: Epic Games Vice President of Public Policy, Corie Wright, issued a assertion about Google’s…

    Science

    China’s Chang’e 6 returns with first rocks from far side of the moon

    (*6*)The Chang’e 6 probe being retrieved in Siziwang Banner in Inner Mongolia, ChinaXinhua/Shutterstock China’s Chang’e…

    AI

    Foundation model with adaptive computation and dynamic read-and-write – Google Research Blog

    Posted by Fuzhao Xue, Research Intern, and Mostafa Dehghani, Research Scientist, Google

    Our Picks
    Technology

    IEEE Young Professionals Take On Climate Change

    Science

    Rocket Report: Vulcan stacked for launch; Starship aces test ahead of third flight

    Science

    NASA’s Lunar Gateway has a big visiting vehicles problem

    Categories
    • AI (1,482)
    • Crypto (1,744)
    • Gadgets (1,795)
    • Mobile (1,838)
    • Science (1,852)
    • Technology (1,789)
    • The Future (1,635)
    Most Popular
    Gadgets

    The best portable printers for 2024

    AI

    Revolutionizing Scene Reconstruction with Break-A-Scene: The Future of AI-Powered Object Extraction and Remixing

    Science

    Neanderthals may have been early risers

    Ztoog
    Facebook X (Twitter) Instagram Pinterest
    • Home
    • About Us
    • Contact us
    • Privacy Policy
    • Terms & Conditions
    © 2025 Ztoog.

    Type above and press Enter to search. Press Esc to cancel.