Close Menu
Ztoog
    What's Hot
    AI

    Revolutionizing Cancer Detection: the University of Surrey Unleashes Game-Changing Sketch-Based Object Detection Tool in Machine Learning

    AI

    New method accelerates data retrieval in huge databases | Ztoog

    Gadgets

    Google’s New Feature Ensures Your Pixel Phone Hasn’t Been Hacked. Here’s How It Works | WIRED

    Important Pages:
    • About Us
    • Contact us
    • Privacy Policy
    • Terms & Conditions
    Facebook X (Twitter) Instagram Pinterest
    Facebook X (Twitter) Instagram Pinterest
    Ztoog
    • Home
    • The Future

      How to Get Bot Lobbies in Fortnite? (2025 Guide)

      Can work-life balance tracking improve well-being?

      Any wall can be turned into a camera to see around corners

      JD Vance and President Trump’s Sons Hype Bitcoin at Las Vegas Conference

      AI may already be shrinking entry-level jobs in tech, new research suggests

    • Technology

      What does a millennial midlife crisis look like?

      Elon Musk tries to stick to spaceships

      A Replit employee details a critical security flaw in web apps created using AI-powered app builder Lovable that exposes API keys and personal info of app users (Reed Albergotti/Semafor)

      Gemini in Google Drive can now help you skip watching that painfully long Zoom meeting

      Apple iPhone exports from China to the US fall 76% as India output surges

    • Gadgets

      Watch Apple’s WWDC 2025 keynote right here

      Future-proof your career by mastering AI skills for just $20

      8 Best Vegan Meal Delivery Services and Kits (2025), Tested and Reviewed

      Google Home is getting deeper Gemini integration and a new widget

      Google Announces AI Ultra Subscription Plan With Premium Features

    • Mobile

      YouTube is testing a leaderboard to show off top live stream fans

      Deals: the Galaxy S25 series comes with a free tablet, Google Pixels heavily discounted

      Microsoft is done being subtle – this new tool screams “upgrade now”

      Wallpaper Wednesday: Android wallpapers 2025-05-28

      Google can make smart glasses accessible with Warby Parker, Gentle Monster deals

    • Science

      Some parts of Trump’s proposed budget for NASA are literally draconian

      June skygazing: A strawberry moon, the summer solstice… and Asteroid Day!

      Analysts Say Trump Trade Wars Would Harm the Entire US Energy Sector, From Oil to Solar

      Do we have free will? Quantum experiments may soon reveal the answer

      Was Planet Nine exiled from the solar system as a baby?

    • AI

      Fueling seamless AI at scale

      Rationale engineering generates a compact new tool for gene therapy | Ztoog

      The AI Hype Index: College students are hooked on ChatGPT

      Learning how to predict rare kinds of failures | Ztoog

      Anthropic’s new hybrid AI model can work on tasks autonomously for hours at a time

    • Crypto

      Bitcoin Maxi Isn’t Buying Hype Around New Crypto Holding Firms

      GameStop bought $500 million of bitcoin

      CoinW Teams Up with Superteam Europe to Conclude Solana Hackathon and Accelerate Web3 Innovation in Europe

      Ethereum Net Flows Turn Negative As Bulls Push For $3,500

      Bitcoin’s Power Compared To Nuclear Reactor By Brazilian Business Leader

    Ztoog
    Home » This AI Research Uncovers the Mechanics of Dishonesty in Large Language Models: A Deep Dive into Prompt Engineering and Neural Network Analysis
    AI

    This AI Research Uncovers the Mechanics of Dishonesty in Large Language Models: A Deep Dive into Prompt Engineering and Neural Network Analysis

    Facebook Twitter Pinterest WhatsApp
    This AI Research Uncovers the Mechanics of Dishonesty in Large Language Models: A Deep Dive into Prompt Engineering and Neural Network Analysis
    Share
    Facebook Twitter LinkedIn Pinterest WhatsApp

    Understanding massive language fashions (LLMs) and selling their sincere conduct has turn into more and more essential as these fashions have demonstrated rising capabilities and began broadly adopted by society. Researchers contend that new dangers, comparable to scalable disinformation, manipulation, fraud, election tampering, or the speculative threat of loss of management, come up from the potential for fashions to be misleading (which they outline as “the systematic inducement of false beliefs in the pursuit of some outcome other than the truth”). Research signifies that even whereas the fashions’ activations have the needed data, they could want greater than misalignment to supply the proper consequence. 

    Previous research have distinguished between truthfulness and honesty, saying that the former refrains from making false claims, whereas the latter refrains from making claims it doesn’t “believe.” This distinction helps to make sense of it. Therefore, a mannequin could generate deceptive assertions owing to misalignment in the kind of dishonesty slightly than an absence of ability. Since then, a number of research have tried to deal with LLM honesty by delving into a mannequin’s inside state to search out truthful representations. Proposals for latest black field methods have additionally been made to determine and provoke large language mannequin mendacity. Notably, earlier work demonstrates that enhancing the extraction of inside mannequin representations could also be achieved by forcing fashions to think about a notion actively. 

    Furthermore, fashions embody a “critical” middleman layer in context-following environments, past which representations of true or incorrect responses in context-following are inclined to diverge a phenomenon referred to as “overthinking.” Motivated by earlier research, the researchers broadened the focus from incorrectly labeled in-context studying to deliberate dishonesty, in which they gave the mannequin specific directions to lie. Using probing and mechanical interpretability methodologies, the analysis crew from Cornell University, the University of Pennsylvania, and the University of Maryland hopes to determine and comprehend which layers and consideration heads in the mannequin are accountable for dishonesty in this context. 

    The following are their contributions: 

    1. The analysis crew reveals that, as decided by significantly below-chance accuracy on true/false questions, LLaMA-2-70b-chat may be educated to lie. According to the research crew, this may be fairly delicate and needs to be fastidiously and rapidly engineered. 

    2. Using activation patching and probing, the analysis crew finds unbiased proof for 5 mannequin layers crucial to dishonest conduct. 

    3. Only 46 consideration heads, or 0.9% of all heads in the community, have been successfully subjected to causal interventions by the research crew, which pressured misleading fashions to reply honestly. These therapies are resilient over a number of dataset splits and prompts. 

    In a nutshell the analysis crew appears to be like at a simple case of mendacity, the place they supply LLM directions on whether or not to inform the fact or not. Their findings show that vast fashions can show dishonest behaviour, producing proper solutions when requested to be sincere and inaccurate responses if pushed to lie. These findings construct on earlier analysis that means activation probing can generalize out-of-distribution when prompted. However, the analysis crew does uncover that this may increasingly necessitate prolonged immediate engineering as a result of issues like the mannequin’s tendency to output the “False” token sooner in the sequence than the “True” token. 

    By utilizing prefix injection, the analysis crew can persistently induce mendacity. Subsequently, the crew compares the activations of the dishonest and sincere fashions, localizing the layers and consideration heads concerned in mendacity. By using linear probes to research this mendacity conduct, the analysis crew discovers that early-to-middle layers see comparable mannequin representations for sincere and liar prompts earlier than diverging drastically to turn into anti-parallel. This would possibly present that prior layers ought to have a context-invariant illustration of fact, as desired by a physique of literature. Activation patching is one other device the analysis crew makes use of to know extra about the workings of particular layers and heads. The researchers found that localized interventions might utterly handle the mismatch between the honest-prompted and liar fashions in both course. 

    Significantly, these interventions on a mere 46 consideration heads show a strong diploma of cross-dataset and cross-prompt resilience. The analysis crew focuses on mendacity by using an accessible dataset and particularly telling the mannequin to lie, in distinction to earlier work that has largely examined the accuracy and integrity of fashions which are sincere by default. Thanks to this context, researchers have discovered an incredible deal about the subtleties of encouraging dishonest conduct and the strategies by which huge fashions have interaction in dishonest conduct. To assure the moral and protected utility of LLMs in the actual world, the analysis crew hopes that extra work in this context will result in new approaches to stopping LLM mendacity.


    Check out the Paper. All credit score for this analysis goes to the researchers of this challenge. Also, don’t overlook to affix our 33k+ ML SubReddit, 41k+ Facebook Community, Discord Channel, and Email Newsletter, the place we share the newest AI analysis information, cool AI initiatives, and extra.

    If you want our work, you’ll love our publication..


    Aneesh Tickoo is a consulting intern at MarktechPost. He is at present pursuing his undergraduate diploma in Data Science and Artificial Intelligence from the Indian Institute of Technology(IIT), Bhilai. He spends most of his time engaged on initiatives aimed toward harnessing the energy of machine studying. His analysis curiosity is picture processing and is captivated with constructing options round it. He loves to attach with individuals and collaborate on attention-grabbing initiatives.


    ✅ [Featured AI Model] Check out LLMWare and It’s RAG- specialised 7B Parameter LLMs

    Share. Facebook Twitter Pinterest LinkedIn WhatsApp

    Related Posts

    AI

    Fueling seamless AI at scale

    AI

    Rationale engineering generates a compact new tool for gene therapy | Ztoog

    AI

    The AI Hype Index: College students are hooked on ChatGPT

    AI

    Learning how to predict rare kinds of failures | Ztoog

    AI

    Anthropic’s new hybrid AI model can work on tasks autonomously for hours at a time

    AI

    AI learns how vision and sound are connected, without human intervention | Ztoog

    AI

    How AI is introducing errors into courtrooms

    AI

    With AI, researchers predict the location of virtually any protein within a human cell | Ztoog

    Leave A Reply Cancel Reply

    Follow Us
    • Facebook
    • Twitter
    • Pinterest
    • Instagram
    Top Posts
    Gadgets

    Employee finds SSD stolen last year from corporate data center for sale on eBay

    Sketchy offers on eBay and different on-line marketplaces occur on a regular basis. Encountering counterfeit,…

    Gadgets

    The best portable fire pits of 2023

    We might earn income from the merchandise out there on this web page and take…

    AI

    Best 10+ AI Tools For Designers (2023)

    Kive is an AI-powered software supposed to make it easy for designers to arrange their…

    Gadgets

    Leica camera has built-in defense against misleading AI, costs $9,125

    Enlarge / A photograph shot with the M11-P. On Thursday, Leica Camera launched the primary…

    Science

    DART Showed How to Smash an Asteroid. So Where Did the Space Shrapnel Go?

    Nearly one yr in the past, NASA flung the DART spacecraft into the asteroid Dimorphos…

    Our Picks
    Crypto

    Carv raises $10M Series A to help gamers monetize their data

    Science

    Stench leads officials to 189 rotting corpses at taxidermist’s funeral home

    Gadgets

    This Nespresso machine is a perfect last-minute gift. Get it 34% off at Amazon

    Categories
    • AI (1,494)
    • Crypto (1,754)
    • Gadgets (1,806)
    • Mobile (1,852)
    • Science (1,868)
    • Technology (1,804)
    • The Future (1,650)
    Most Popular
    Gadgets

    Meta releases open source AI audio tools, AudioCraft

    Crypto

    What is Cryptocurrency and How Does it Work?

    Mobile

    These are my favorite Samsung Galaxy One UI 6 features

    Ztoog
    Facebook X (Twitter) Instagram Pinterest
    • Home
    • About Us
    • Contact us
    • Privacy Policy
    • Terms & Conditions
    © 2025 Ztoog.

    Type above and press Enter to search. Press Esc to cancel.