Close Menu
Ztoog
    What's Hot
    Technology

    You Might Not Need Open Brain Surgery to Get Mind Control

    Technology

    Why Did Apple Endorse California’s Right-to-Repair Bill?

    Science

    Rare quadruple solar flare event captured by NASA

    Important Pages:
    • About Us
    • Contact us
    • Privacy Policy
    • Terms & Conditions
    Facebook X (Twitter) Instagram Pinterest
    Facebook X (Twitter) Instagram Pinterest
    Ztoog
    • Home
    • The Future

      What is Project Management? 5 Best Tools that You Can Try

      Operational excellence strategy and continuous improvement

      Hannah Fry: AI isn’t as powerful as we think

      FanDuel goes all in on responsible gaming push with new Play with a Plan campaign

      Gettyimages.com Is the Best Website on the Internet Right Now

    • Technology

      Iran war: How could it end?

      Democratic senators question CFTC staffing cuts in Chicago enforcement office

      Google’s Cloud AI lead on the three frontiers of model capability

      AMD agrees to backstop a $300M loan from Goldman Sachs for Crusoe to buy AMD AI chips, the first known case of AMD chips used as debt collateral (The Information)

      Productivity apps failed me when I needed them most

    • Gadgets

      macOS Tahoe 26.3.1 update will “upgrade” your M5’s CPU to new “super” cores

      Lenovo Shows Off a ThinkBook Modular AI PC Concept With Swappable Ports and Detachable Displays at MWC 2026

      POCO M8 Review: The Ultimate Budget Smartphone With Some Cons

      The Mission: Impossible of SSDs has arrived with a fingerprint lock

      6 Best Phones With Headphone Jacks (2026), Tested and Reviewed

    • Mobile

      Android’s March update is all about finding people, apps, and your missing bags

      Watch Xiaomi’s global launch event live here

      Our poll shows what buyers actually care about in new smartphones (Hint: it’s not AI)

      Is Strava down for you? You’re not alone

      The Motorola Razr FIFA World Cup 2026 Edition was literally just unveiled, and Verizon is already giving them away

    • Science

      Big Tech Signs White House Data Center Pledge With Good Optics and Little Substance

      Inside the best dark matter detector ever built

      NASA’s Artemis moon exploration programme is getting a major makeover

      Scientists crack the case of “screeching” Scotch tape

      Blue-faced, puffy-lipped monkey scores a rare conservation win

    • AI

      Online harassment is entering its AI era

      Meet NullClaw: The 678 KB Zig AI Agent Framework Running on 1 MB RAM and Booting in Two Milliseconds

      New method could increase LLM training efficiency | Ztoog

      The human work behind humanoid robots is being hidden

      NVIDIA Releases DreamDojo: An Open-Source Robot World Model Trained on 44,711 Hours of Real-World Human Video Data

    • Crypto

      Google paid startup Form Energy $1B for its massive 100-hour battery

      Ethereum Breakout Alert: Corrective Channel Flip Sparks Impulsive Wave

      Show Your ID Or No Deal

      Jane Street sued for alleged front-running trades that accelerated Terraform Labs meltdown

      Bitcoin Trades Below ETF Cost-Basis As MVRV Signals Mounting Pressure

    Ztoog
    Home » Can large language models identify and correct their mistakes? – Google Research Blog
    AI

    Can large language models identify and correct their mistakes? – Google Research Blog

    Facebook Twitter Pinterest WhatsApp
    Can large language models identify and correct their mistakes? – Google Research Blog
    Share
    Facebook Twitter LinkedIn Pinterest WhatsApp

    Posted by Gladys Tyen, Intern, Google Research

    LLMs are more and more widespread for reasoning duties, reminiscent of multi-turn QA, job completion, code technology, or arithmetic. Yet very similar to individuals, they don’t all the time remedy issues accurately on the primary strive, particularly on duties for which they weren’t skilled. Therefore, for such techniques to be most helpful, they need to be capable of 1) identify the place their reasoning went flawed and 2) backtrack to seek out one other resolution.

    This has led to a surge in strategies associated to self-correction, the place an LLM is used to identify issues in its personal output, and then produce improved outcomes based mostly on the suggestions. Self-correction is usually regarded as a single course of, however we determined to interrupt it down into two elements, mistake discovering and output correction.

    In “LLMs cannot find reasoning errors, but can correct them!”, we take a look at state-of-the-art LLMs on mistake discovering and output correction individually. We current BIG-Bench Mistake, an analysis benchmark dataset for mistake identification, which we use to deal with the next questions:

    1. Can LLMs discover logical errors in Chain-of-Thought (CoT) type reasoning?
    2. Can mistake-finding be used as a proxy for correctness?
    3. Knowing the place the error is, can LLMs then be prompted to backtrack and arrive on the correct reply?
    4. Can mistake discovering as a talent generalize to duties the LLMs have by no means seen?

    About our dataset

    Mistake discovering is an underexplored downside in pure language processing, with a selected lack of analysis duties on this area. To finest assess the power of LLMs to seek out errors, analysis duties ought to exhibit errors which are non-ambiguous. To our data, most present mistake-finding datasets don’t transcend the realm of arithmetic for that reason.

    To assess the power of LLMs to cause about errors exterior of the maths area, we produce a brand new dataset to be used by the analysis neighborhood, known as BIG-Bench Mistake. This dataset consists of Chain-of-Thought traces generated utilizing PaLM 2 on 5 duties in BIG-Bench. Each hint is annotated with the situation of the primary logical mistake.

    To maximize the variety of errors in our dataset, we pattern 255 traces the place the reply is wrong (so we all know there may be positively a mistake), and 45 traces the place the reply is correct (so there could or might not be a mistake). We then ask human labelers to undergo every hint and identify the primary mistake step. Each hint has been annotated by not less than three labelers, whose solutions had inter-rater reliability ranges of >0.98 (utilizing Krippendorff’s α). The labeling was performed for all duties besides the Dyck Languages job, which includes predicting the sequence of closing parentheses for a given enter sequence. This job we labeled algorithmically.

    The logical errors made on this dataset are easy and unambiguous, offering a superb benchmark for testing an LLM’s potential to seek out its personal errors earlier than utilizing them on more durable, extra ambiguous duties.

    Core questions on mistake identification

    1. Can LLMs discover logical errors in Chain-of-Thought type reasoning?

    First, we wish to discover out if LLMs can identify errors independently of their potential to correct them. We try a number of prompting strategies to check GPT sequence models for their potential to find errors (prompts right here) beneath the idea that they’re usually consultant of contemporary LLM efficiency.

    Generally, we discovered these state-of-the-art models carry out poorly, with one of the best mannequin attaining 52.9% accuracy total. Hence, there’s a want to enhance LLMs’ potential on this space of reasoning.

    In our experiments, we strive three totally different prompting strategies: direct (hint), direct (step) and CoT (step). In direct (hint), we offer the LLM with the hint and ask for the situation step of the error or no mistake. In direct (step), we immediate the LLM to ask itself this query for every step it takes. In CoT (step), we immediate the LLM to offer its reasoning for whether or not every step is a mistake or not a mistake.

    A diagram exhibiting the three prompting strategies direct (hint), direct (step) and CoT (step).

    Our discovering is in line and builds upon prior outcomes, however goes additional in exhibiting that LLMs battle with even easy and unambiguous errors (for comparability, our human raters with out prior experience remedy the issue with a excessive diploma of settlement). We hypothesize that this can be a large cause why LLMs are unable to self-correct reasoning errors. See the paper for the complete outcomes.

    2. Can mistake-finding be used as a proxy for correctness of the reply?

    When persons are confronted with an issue the place we’re uncertain of the reply, we will work by our options step-by-step. If no error is discovered, we will make the idea that we did the best factor.

    While we hypothesized that this could work equally for LLMs, we found that this can be a poor technique. On our dataset of 85% incorrect traces and 15% correct traces, utilizing this technique shouldn’t be a lot better than the naïve technique of all the time labeling traces as incorrect, which provides a weighted common F1 of 78.

    A diagram exhibiting how effectively mistake-finding with LLMs can be utilized as a proxy for correctness of the reply on every dataset.

    3. Can LLMs backtrack realizing the place the error is?

    Since we’ve proven that LLMs exhibit poor efficiency to find reasoning errors in CoT traces, we wish to know whether or not LLMs may even correct errors in any respect, even when they know the place the error is.

    Note that realizing the mistake location is totally different from realizing the best reply: CoT traces can include logical errors even when the ultimate reply is correct, or vice versa. In most real-world conditions, we received’t know what the best reply is, however we would be capable of identify logical errors in intermediate steps.

    We suggest the next backtracking technique:

    1. Generate CoT traces as normal, at temperature = 0. (Temperature is a parameter that controls the randomness of generated responses, with greater values producing extra numerous and artistic outputs, normally on the expense of high quality.)
    2. Identify the situation of the primary logical mistake (for instance with a classifier, or right here we simply use labels from our dataset).
    3. Re-generate the error step at temperature = 1 and produce a set of eight outputs. Since the unique output is understood to result in incorrect outcomes, the aim is to seek out an alternate technology at this step that’s considerably totally different from the unique.
    4. From these eight outputs, choose one that’s totally different from the unique mistake step. (We simply use actual matching right here, however sooner or later this may be one thing extra refined.)
    5. Using the brand new step, generate the remainder of the hint as regular at temperature = 0.

    It’s a quite simple technique that doesn’t require any further immediate crafting and avoids having to re-generate the complete hint. We take a look at it utilizing the error location knowledge from BIG-Bench Mistake, and we discover that it could actually correct CoT errors.

    Recent work confirmed that self-correction strategies, like Reflexion and RCI, trigger deterioration in accuracy scores as a result of there are extra correct solutions turning into incorrect than vice versa. Our technique, alternatively, produces extra positive factors (by correcting flawed solutions) than losses (by altering proper solutions to flawed solutions).

    We additionally examine our technique with a random baseline, the place we randomly assume a step to be a mistake. Our outcomes present that this random baseline does produce some positive factors, however not as a lot as backtracking with the correct mistake location, and with extra losses.

    A diagram exhibiting the positive factors and losses in accuracy for our technique in addition to a random baseline on every dataset.

    4. Can mistake discovering generalize to duties the LLMs have by no means seen?

    To reply this query, we fine-tuned a small mannequin on 4 of the BIG-Bench duties and examined it on the fifth, held-out job. We do that for each job, producing 5 fine-tuned models in complete. Then we examine the outcomes with simply zero-shot prompting PaLM 2-L-Unicorn, a a lot bigger mannequin.

    Bar chart exhibiting the accuracy enchancment of the fine-tuned small mannequin in comparison with zero-shot prompting with PaLM 2-L-Unicorn.

    Our outcomes present that the a lot smaller fine-tuned reward mannequin usually performs higher than zero-shot prompting a large mannequin, though the reward mannequin has by no means seen knowledge from the duty within the take a look at set. The solely exception is logical deduction, the place it performs on par with zero-shot prompting.

    This is a really promising end result as we will doubtlessly simply use a small fine-tuned reward mannequin to carry out backtracking and enhance accuracy on any job, even when we don’t have the information for it. This smaller reward mannequin is totally impartial of the generator LLM, and may be up to date and additional fine-tuned for particular person use instances.

    An illustration exhibiting how our backtracking technique works.

    Conclusion

    In this work, we created an analysis benchmark dataset that the broader tutorial neighborhood can use to guage future LLMs. We additional confirmed that LLMs presently battle to seek out logical errors. However, if they might, we present the effectiveness of backtracking as a method that may present positive factors on duties. Finally, a smaller reward mannequin may be skilled on normal mistake-finding duties and be used to enhance out-of-domain mistake discovering, exhibiting that mistake-finding can generalize.

    Acknowledgements

    Thank you to Peter Chen, Tony Mak, Hassan Mansoor and Victor Cărbune for contributing concepts and serving to with the experiments and knowledge assortment. We would additionally wish to thank Sian Gooding and Vicky Zayats for their feedback and recommendations on the paper.

    Share. Facebook Twitter Pinterest LinkedIn WhatsApp

    Related Posts

    AI

    Online harassment is entering its AI era

    AI

    Meet NullClaw: The 678 KB Zig AI Agent Framework Running on 1 MB RAM and Booting in Two Milliseconds

    AI

    New method could increase LLM training efficiency | Ztoog

    AI

    The human work behind humanoid robots is being hidden

    AI

    NVIDIA Releases DreamDojo: An Open-Source Robot World Model Trained on 44,711 Hours of Real-World Human Video Data

    AI

    Personalization features can make LLMs more agreeable | Ztoog

    AI

    AI is already making online crimes easier. It could get much worse.

    AI

    NVIDIA Researchers Introduce KVTC Transform Coding Pipeline to Compress Key-Value Caches by 20x for Efficient LLM Serving

    Leave A Reply Cancel Reply

    Follow Us
    • Facebook
    • Twitter
    • Pinterest
    • Instagram
    Top Posts
    Technology

    FCC Denies Starlink Low-Orbit Bid for Lower Latency

    The FCC has as soon as once more rejected a Starlink plan to deploy hundreds…

    Crypto

    Clinton Vs. Novogratz In Heated War Of Words

    Sparks flew this week on the Bloomberg New Economy Forum as political heavyweight Hilary Clinton…

    AI

    CMU Researchers Present FlexLLM: An Artificial Intelligence System that can Serve Inference and Parameter-Efficient Finetuning Requests in the Same Iteration

    In synthetic intelligence, the surge in massive language mannequin (LLM) improvement has considerably remodeled how…

    AI

    This AI Research Unveils Photo-SLAM: Elevating Real-Time Photorealistic Mapping on Portable Devices

    In pc imaginative and prescient and robotics, simultaneous localization and mapping (SLAM) with cameras is…

    Science

    NASA’s Lunar Gateway has a big visiting vehicles problem

    Enlarge / A rendering of NASA’s proposed lunar gateway.NASA Do you keep in mind the…

    Our Picks
    Technology

    European Union set to revise cookie law, admits cookie banners are annoying

    Science

    How did water get on Earth?

    AI

    OpenAI Unveils GPT-4 Turbo: A Customizable Leap Forward Towards The Future of Artificial Intelligence

    Categories
    • AI (1,560)
    • Crypto (1,826)
    • Gadgets (1,870)
    • Mobile (1,910)
    • Science (1,939)
    • Technology (1,862)
    • The Future (1,716)
    Most Popular
    The Future

    Sam Altman gives up control of OpenAI Startup Fund, resolving unusual corporate venture structure

    Technology

    Indian central bank tightening consumer loans curb to impact startups

    Science

    Is an old NASA probe about to redraw the frontier of the solar system?

    Ztoog
    Facebook X (Twitter) Instagram Pinterest
    • Home
    • About Us
    • Contact us
    • Privacy Policy
    • Terms & Conditions
    © 2026 Ztoog.

    Type above and press Enter to search. Press Esc to cancel.