Close Menu
Ztoog
    What's Hot
    Technology

    Iraq lifts ban on Telegram after messaging app complies with authorities

    Gadgets

    Apple now allows retro game emulators on its App Store—but with big caveats

    Crypto

    Bitcoin Gets Backing From US Pres’l Candidate, Says Crypto Supports Civil Rights

    Important Pages:
    • About Us
    • Contact us
    • Privacy Policy
    • Terms & Conditions
    Facebook X (Twitter) Instagram Pinterest
    Facebook X (Twitter) Instagram Pinterest
    Ztoog
    • Home
    • The Future

      What is Project Management? 5 Best Tools that You Can Try

      Operational excellence strategy and continuous improvement

      Hannah Fry: AI isn’t as powerful as we think

      FanDuel goes all in on responsible gaming push with new Play with a Plan campaign

      Gettyimages.com Is the Best Website on the Internet Right Now

    • Technology

      Iran war: How could it end?

      Democratic senators question CFTC staffing cuts in Chicago enforcement office

      Google’s Cloud AI lead on the three frontiers of model capability

      AMD agrees to backstop a $300M loan from Goldman Sachs for Crusoe to buy AMD AI chips, the first known case of AMD chips used as debt collateral (The Information)

      Productivity apps failed me when I needed them most

    • Gadgets

      macOS Tahoe 26.3.1 update will “upgrade” your M5’s CPU to new “super” cores

      Lenovo Shows Off a ThinkBook Modular AI PC Concept With Swappable Ports and Detachable Displays at MWC 2026

      POCO M8 Review: The Ultimate Budget Smartphone With Some Cons

      The Mission: Impossible of SSDs has arrived with a fingerprint lock

      6 Best Phones With Headphone Jacks (2026), Tested and Reviewed

    • Mobile

      Android’s March update is all about finding people, apps, and your missing bags

      Watch Xiaomi’s global launch event live here

      Our poll shows what buyers actually care about in new smartphones (Hint: it’s not AI)

      Is Strava down for you? You’re not alone

      The Motorola Razr FIFA World Cup 2026 Edition was literally just unveiled, and Verizon is already giving them away

    • Science

      Big Tech Signs White House Data Center Pledge With Good Optics and Little Substance

      Inside the best dark matter detector ever built

      NASA’s Artemis moon exploration programme is getting a major makeover

      Scientists crack the case of “screeching” Scotch tape

      Blue-faced, puffy-lipped monkey scores a rare conservation win

    • AI

      Online harassment is entering its AI era

      Meet NullClaw: The 678 KB Zig AI Agent Framework Running on 1 MB RAM and Booting in Two Milliseconds

      New method could increase LLM training efficiency | Ztoog

      The human work behind humanoid robots is being hidden

      NVIDIA Releases DreamDojo: An Open-Source Robot World Model Trained on 44,711 Hours of Real-World Human Video Data

    • Crypto

      Google paid startup Form Energy $1B for its massive 100-hour battery

      Ethereum Breakout Alert: Corrective Channel Flip Sparks Impulsive Wave

      Show Your ID Or No Deal

      Jane Street sued for alleged front-running trades that accelerated Terraform Labs meltdown

      Bitcoin Trades Below ETF Cost-Basis As MVRV Signals Mounting Pressure

    Ztoog
    Home » Can we fix AI’s evaluation crisis?
    AI

    Can we fix AI’s evaluation crisis?

    Facebook Twitter Pinterest WhatsApp
    Can we fix AI’s evaluation crisis?
    Share
    Facebook Twitter LinkedIn Pinterest WhatsApp

    As a tech reporter I typically get requested questions like “Is DeepSeek actually better than ChatGPT?” or “Is the Anthropic model any good?” If I don’t really feel like turning it into an hour-long seminar, I’ll often give the diplomatic reply: “They’re both solid in different ways.”

    Most folks asking aren’t defining “good” in any exact means, and that’s honest. It’s human to need to make sense of one thing new and seemingly highly effective. But that easy query—Is this mannequin good?—is actually simply the on a regular basis model of a way more difficult technical drawback.

    So far, the best way we’ve tried to reply that query is thru benchmarks. These give fashions a hard and fast set of inquiries to reply and grade them on what number of they get proper. But similar to exams just like the SAT (an admissions take a look at utilized by many US schools), these benchmarks don’t at all times mirror deeper skills. Lately it feels as if a brand new AI mannequin drops each week, and each time an organization launches one, it comes with recent scores exhibiting it beating the capabilities of predecessors. On paper, all the things seems to be getting higher on a regular basis.

    In follow, it’s not so easy. Just as grinding for the SAT would possibly increase your rating with out enhancing your essential pondering, fashions will be skilled to optimize for benchmark outcomes with out truly getting smarter, as Russell Brandon defined in his piece for us. As OpenAI and Tesla AI veteran Andrej Karpathy lately put it, we’re residing by way of an evaluation disaster—our scoreboard for AI not displays what we actually need to measure.

    Benchmarks have grown stale for just a few key causes. First, the trade has discovered to “teach to the test,” coaching AI fashions to attain effectively somewhat than genuinely enhance. Second, widespread knowledge contamination means fashions might have already seen the benchmark questions, and even the solutions, someplace of their coaching knowledge. And lastly, many benchmarks are merely maxed out. On fashionable assessments like SuperGLUE, fashions have already reached or surpassed 90% accuracy, making additional good points really feel extra like statistical noise than significant enchancment. At that time, the scores cease telling us something helpful. That’s very true in high-skill domains like coding, reasoning, and complicated STEM problem-solving. 

    However, there are a rising variety of groups all over the world making an attempt to deal with the AI evaluation disaster. 

    One result’s a brand new benchmark referred to as LiveCodeBench Pro. It attracts issues from worldwide algorithmic olympiads—competitions for elite highschool and college programmers the place contributors clear up difficult issues with out exterior instruments. The prime AI fashions presently handle solely about 53% at first move on medium-difficulty issues and 0% on the toughest ones. These are duties the place human consultants routinely excel.

    Zihan Zheng, a junior at NYU and a North America finalist in aggressive coding, led the challenge to develop LiveCodeBench Pro with a workforce of olympiad medalists. They’ve revealed each the benchmark and an in depth research exhibiting that top-tier fashions like GPT o4-mini-high and Google’s Gemini 2.5 carry out at a degree akin to the highest 10% of human rivals. Across the board, Zheng noticed a sample: AI excels at planning and executing duties, nevertheless it struggles with nuanced algorithmic reasoning. “It shows that AI is still far from matching the best human coders,” he says.

    LiveCodeBench Pro would possibly outline a brand new higher bar. But what in regards to the flooring? Earlier this month, a gaggle of researchers from a number of universities argued that LLM brokers ought to be evaluated totally on the premise of their riskiness, not simply how effectively they carry out. In real-world, application-driven environments—particularly with AI brokers—unreliability, hallucinations, and brittleness are ruinous. One flawed transfer may spell catastrophe when cash or security are on the road.

    There are different new makes an attempt to deal with the issue. Some benchmarks, like ARC-AGI, now preserve a part of their knowledge set personal to stop AI fashions from being optimized excessively for the take a look at, an issue referred to as “overfitting.” Meta’s Yann LeCun has created LiveBench, a dynamic benchmark the place questions evolve each six months. The objective is to guage fashions not simply on information however on adaptability.

    Xbench, a Chinese benchmark challenge developed by HongShan Capital Group (previously Sequoia China), is one other one in all these effort. I simply wrote about it in a narrative. Xbench was initially in-built 2022—proper after ChatGPT’s launch—as an inner instrument to guage fashions for funding analysis. Over time, the workforce expanded the system and introduced in exterior collaborators. It simply made components of its query set publicly out there final week. 

    Xbench is notable for its dual-track design, which tries to bridge the hole between lab-based assessments and real-world utility. The first monitor evaluates technical reasoning abilities by testing a mannequin’s STEM information and skill to hold out Chinese-language analysis. The second monitor goals to evaluate sensible usefulness—how effectively a mannequin performs on duties in fields like recruitment and advertising and marketing. For occasion, one process asks an agent to determine 5 certified battery engineer candidates; one other has it match manufacturers with related influencers from a pool of greater than 800 creators. 

    The workforce behind Xbench has large ambitions. They plan to develop its testing capabilities into sectors like finance, legislation, and design, and so they plan to replace the take a look at set quarterly to keep away from stagnation. 

    This is one thing that I typically surprise about, as a result of a mannequin’s hardcore reasoning potential doesn’t essentially translate right into a enjoyable, informative, and artistic expertise. Most queries from common customers are most likely not going to be rocket science. There isn’t a lot analysis but on how one can successfully consider a mannequin’s creativity, however I’d like to know which mannequin could be one of the best for artistic writing or artwork tasks.

    Human desire testing has additionally emerged as a substitute for benchmarks. One more and more fashionable platform is LMarena, which lets customers submit questions and examine responses from completely different fashions aspect by aspect—after which choose which one they like greatest. Still, this methodology has its flaws. Users typically reward the reply that sounds extra flattering or agreeable, even when it’s flawed. That can incentivize “sweet-talking” fashions and skew ends in favor of pandering.

    AI researchers are starting to comprehend—and admit—that the established order of AI testing can’t proceed. At the latest CVPR convention, NYU professor Saining Xie drew on historian James Carse’s Finite and Infinite Games to critique the hypercompetitive tradition of AI analysis. An infinite sport, he famous, is open-ended—the objective is to maintain enjoying. But in AI, a dominant participant typically drops a giant outcome, triggering a wave of follow-up papers chasing the identical slim subject. This race-to-publish tradition places huge stress on researchers and rewards velocity over depth, short-term wins over long-term perception. “If academia chooses to play a finite game,” he warned, “it will lose everything.”

    I discovered his framing highly effective—and perhaps it applies to benchmarks, too. So, do we have a really complete scoreboard for a way good a mannequin is? Not actually. Many dimensions—social, emotional, interdisciplinary—nonetheless evade evaluation. But the wave of latest benchmarks hints at a shift. As the sphere evolves, a little bit of skepticism might be wholesome.

    This story initially appeared in The Algorithm, our weekly e-newsletter on AI. To get tales like this in your inbox first, enroll right here.

    Correction: A earlier model of the article mistakenly mentioned 4o-mini as a substitute of ChatGPT o4-mini-high, as a prime performing mannequin on LiveCodeBench Pro.

    Share. Facebook Twitter Pinterest LinkedIn WhatsApp

    Related Posts

    AI

    Online harassment is entering its AI era

    AI

    Meet NullClaw: The 678 KB Zig AI Agent Framework Running on 1 MB RAM and Booting in Two Milliseconds

    AI

    New method could increase LLM training efficiency | Ztoog

    AI

    The human work behind humanoid robots is being hidden

    AI

    NVIDIA Releases DreamDojo: An Open-Source Robot World Model Trained on 44,711 Hours of Real-World Human Video Data

    AI

    Personalization features can make LLMs more agreeable | Ztoog

    AI

    AI is already making online crimes easier. It could get much worse.

    AI

    NVIDIA Researchers Introduce KVTC Transform Coding Pipeline to Compress Key-Value Caches by 20x for Efficient LLM Serving

    Leave A Reply Cancel Reply

    Follow Us
    • Facebook
    • Twitter
    • Pinterest
    • Instagram
    Top Posts
    Crypto

    Grayscale Bitcoin Trust’s Premium Falls To 28% As Tug-Of-War With SEC Continues

    The Grayscale Bitcoin Trust (GBTC) has been on an unsure journey during the last yr…

    Science

    Airplanes Made from Algae | I’MNOVATION

    Can a producing course of transcend a zero-carbon footprint? The reply is sure—there are industrial…

    The Future

    How Costly is a TikTok Galaxy? All About TikTok Gifts

    From the very starting of TikTok, the builders have been arising with modern concepts to…

    Gadgets

    Acer reportedly sent Russia $70M in PC gear after saying it paused business there

    Enlarge / Acer continued promoting laptops, like these Chromebooks, in Russia after saying it suspended…

    Crypto

    Former Morgan Stanley CEO Says ‘Bitcoin Is Not Going Away’

    Ex-CEO of Morgan Stanley, James Gorman has expressed unwavering help for the Bitcoin longevity. However,…

    Our Picks
    Crypto

    Kiki World, a beauty brand that uses web3 for customer co-creation and ownership, raises $7M from a16z

    Gadgets

    LG Transforming into a Smart Life Solution Company: 2024 Checkpoint

    The Future

    QR codes can be phishing scams in disguise, warns the FTC

    Categories
    • AI (1,560)
    • Crypto (1,826)
    • Gadgets (1,870)
    • Mobile (1,910)
    • Science (1,939)
    • Technology (1,862)
    • The Future (1,716)
    Most Popular
    Technology

    Artificial Intelligence is a Top Priority for Tech Leaders in 2024

    Science

    US Cities Could Be Capturing Billions of Gallons of Rain a Day

    Science

    Water molecules detected on the surface of an asteroid in space for the first time

    Ztoog
    Facebook X (Twitter) Instagram Pinterest
    • Home
    • About Us
    • Contact us
    • Privacy Policy
    • Terms & Conditions
    © 2026 Ztoog.

    Type above and press Enter to search. Press Esc to cancel.