Close Menu
Ztoog
    What's Hot
    Gadgets

    15 Best Early Black Friday Deals (2023): iPads, Scooters, Wireless Earbuds

    Science

    How hummingbirds switch gears at breakneck speeds

    The Future

    Google and HP collaborate on ‘Made in India’ Chromebooks

    Important Pages:
    • About Us
    • Contact us
    • Privacy Policy
    • Terms & Conditions
    Facebook X (Twitter) Instagram Pinterest
    Facebook X (Twitter) Instagram Pinterest
    Ztoog
    • Home
    • The Future

      Can work-life balance tracking improve well-being?

      Any wall can be turned into a camera to see around corners

      JD Vance and President Trump’s Sons Hype Bitcoin at Las Vegas Conference

      AI may already be shrinking entry-level jobs in tech, new research suggests

      Today’s NYT Strands Hints, Answer and Help for May 26 #449

    • Technology

      Elon Musk tries to stick to spaceships

      A Replit employee details a critical security flaw in web apps created using AI-powered app builder Lovable that exposes API keys and personal info of app users (Reed Albergotti/Semafor)

      Gemini in Google Drive can now help you skip watching that painfully long Zoom meeting

      Apple iPhone exports from China to the US fall 76% as India output surges

      Today’s NYT Wordle Hints, Answer and Help for May 26, #1437

    • Gadgets

      Future-proof your career by mastering AI skills for just $20

      8 Best Vegan Meal Delivery Services and Kits (2025), Tested and Reviewed

      Google Home is getting deeper Gemini integration and a new widget

      Google Announces AI Ultra Subscription Plan With Premium Features

      Google shows off Android XR-based glasses, announces Warby Parker team-up

    • Mobile

      Deals: the Galaxy S25 series comes with a free tablet, Google Pixels heavily discounted

      Microsoft is done being subtle – this new tool screams “upgrade now”

      Wallpaper Wednesday: Android wallpapers 2025-05-28

      Google can make smart glasses accessible with Warby Parker, Gentle Monster deals

      vivo T4 Ultra specs leak

    • Science

      June skygazing: A strawberry moon, the summer solstice… and Asteroid Day!

      Analysts Say Trump Trade Wars Would Harm the Entire US Energy Sector, From Oil to Solar

      Do we have free will? Quantum experiments may soon reveal the answer

      Was Planet Nine exiled from the solar system as a baby?

      How farmers can help rescue water-loving birds

    • AI

      Rationale engineering generates a compact new tool for gene therapy | Ztoog

      The AI Hype Index: College students are hooked on ChatGPT

      Learning how to predict rare kinds of failures | Ztoog

      Anthropic’s new hybrid AI model can work on tasks autonomously for hours at a time

      AI learns how vision and sound are connected, without human intervention | Ztoog

    • Crypto

      Bitcoin Maxi Isn’t Buying Hype Around New Crypto Holding Firms

      GameStop bought $500 million of bitcoin

      CoinW Teams Up with Superteam Europe to Conclude Solana Hackathon and Accelerate Web3 Innovation in Europe

      Ethereum Net Flows Turn Negative As Bulls Push For $3,500

      Bitcoin’s Power Compared To Nuclear Reactor By Brazilian Business Leader

    Ztoog
    Home » This AI Research Uncovers the Mechanics of Dishonesty in Large Language Models: A Deep Dive into Prompt Engineering and Neural Network Analysis
    AI

    This AI Research Uncovers the Mechanics of Dishonesty in Large Language Models: A Deep Dive into Prompt Engineering and Neural Network Analysis

    Facebook Twitter Pinterest WhatsApp
    This AI Research Uncovers the Mechanics of Dishonesty in Large Language Models: A Deep Dive into Prompt Engineering and Neural Network Analysis
    Share
    Facebook Twitter LinkedIn Pinterest WhatsApp

    Understanding massive language fashions (LLMs) and selling their sincere conduct has turn into more and more essential as these fashions have demonstrated rising capabilities and began broadly adopted by society. Researchers contend that new dangers, comparable to scalable disinformation, manipulation, fraud, election tampering, or the speculative threat of loss of management, come up from the potential for fashions to be misleading (which they outline as “the systematic inducement of false beliefs in the pursuit of some outcome other than the truth”). Research signifies that even whereas the fashions’ activations have the needed data, they could want greater than misalignment to supply the proper consequence. 

    Previous research have distinguished between truthfulness and honesty, saying that the former refrains from making false claims, whereas the latter refrains from making claims it doesn’t “believe.” This distinction helps to make sense of it. Therefore, a mannequin could generate deceptive assertions owing to misalignment in the kind of dishonesty slightly than an absence of ability. Since then, a number of research have tried to deal with LLM honesty by delving into a mannequin’s inside state to search out truthful representations. Proposals for latest black field methods have additionally been made to determine and provoke large language mannequin mendacity. Notably, earlier work demonstrates that enhancing the extraction of inside mannequin representations could also be achieved by forcing fashions to think about a notion actively. 

    Furthermore, fashions embody a “critical” middleman layer in context-following environments, past which representations of true or incorrect responses in context-following are inclined to diverge a phenomenon referred to as “overthinking.” Motivated by earlier research, the researchers broadened the focus from incorrectly labeled in-context studying to deliberate dishonesty, in which they gave the mannequin specific directions to lie. Using probing and mechanical interpretability methodologies, the analysis crew from Cornell University, the University of Pennsylvania, and the University of Maryland hopes to determine and comprehend which layers and consideration heads in the mannequin are accountable for dishonesty in this context. 

    The following are their contributions: 

    1. The analysis crew reveals that, as decided by significantly below-chance accuracy on true/false questions, LLaMA-2-70b-chat may be educated to lie. According to the research crew, this may be fairly delicate and needs to be fastidiously and rapidly engineered. 

    2. Using activation patching and probing, the analysis crew finds unbiased proof for 5 mannequin layers crucial to dishonest conduct. 

    3. Only 46 consideration heads, or 0.9% of all heads in the community, have been successfully subjected to causal interventions by the research crew, which pressured misleading fashions to reply honestly. These therapies are resilient over a number of dataset splits and prompts. 

    In a nutshell the analysis crew appears to be like at a simple case of mendacity, the place they supply LLM directions on whether or not to inform the fact or not. Their findings show that vast fashions can show dishonest behaviour, producing proper solutions when requested to be sincere and inaccurate responses if pushed to lie. These findings construct on earlier analysis that means activation probing can generalize out-of-distribution when prompted. However, the analysis crew does uncover that this may increasingly necessitate prolonged immediate engineering as a result of issues like the mannequin’s tendency to output the “False” token sooner in the sequence than the “True” token. 

    By utilizing prefix injection, the analysis crew can persistently induce mendacity. Subsequently, the crew compares the activations of the dishonest and sincere fashions, localizing the layers and consideration heads concerned in mendacity. By using linear probes to research this mendacity conduct, the analysis crew discovers that early-to-middle layers see comparable mannequin representations for sincere and liar prompts earlier than diverging drastically to turn into anti-parallel. This would possibly present that prior layers ought to have a context-invariant illustration of fact, as desired by a physique of literature. Activation patching is one other device the analysis crew makes use of to know extra about the workings of particular layers and heads. The researchers found that localized interventions might utterly handle the mismatch between the honest-prompted and liar fashions in both course. 

    Significantly, these interventions on a mere 46 consideration heads show a strong diploma of cross-dataset and cross-prompt resilience. The analysis crew focuses on mendacity by using an accessible dataset and particularly telling the mannequin to lie, in distinction to earlier work that has largely examined the accuracy and integrity of fashions which are sincere by default. Thanks to this context, researchers have discovered an incredible deal about the subtleties of encouraging dishonest conduct and the strategies by which huge fashions have interaction in dishonest conduct. To assure the moral and protected utility of LLMs in the actual world, the analysis crew hopes that extra work in this context will result in new approaches to stopping LLM mendacity.


    Check out the Paper. All credit score for this analysis goes to the researchers of this challenge. Also, don’t overlook to affix our 33k+ ML SubReddit, 41k+ Facebook Community, Discord Channel, and Email Newsletter, the place we share the newest AI analysis information, cool AI initiatives, and extra.

    If you want our work, you’ll love our publication..


    Aneesh Tickoo is a consulting intern at MarktechPost. He is at present pursuing his undergraduate diploma in Data Science and Artificial Intelligence from the Indian Institute of Technology(IIT), Bhilai. He spends most of his time engaged on initiatives aimed toward harnessing the energy of machine studying. His analysis curiosity is picture processing and is captivated with constructing options round it. He loves to attach with individuals and collaborate on attention-grabbing initiatives.


    ✅ [Featured AI Model] Check out LLMWare and It’s RAG- specialised 7B Parameter LLMs

    Share. Facebook Twitter Pinterest LinkedIn WhatsApp

    Related Posts

    AI

    Rationale engineering generates a compact new tool for gene therapy | Ztoog

    AI

    The AI Hype Index: College students are hooked on ChatGPT

    AI

    Learning how to predict rare kinds of failures | Ztoog

    AI

    Anthropic’s new hybrid AI model can work on tasks autonomously for hours at a time

    AI

    AI learns how vision and sound are connected, without human intervention | Ztoog

    AI

    How AI is introducing errors into courtrooms

    AI

    With AI, researchers predict the location of virtually any protein within a human cell | Ztoog

    AI

    Google DeepMind’s new AI agent cracks real-world problems better than humans can

    Leave A Reply Cancel Reply

    Follow Us
    • Facebook
    • Twitter
    • Pinterest
    • Instagram
    Top Posts
    Crypto

    Bitcoin ETFs, Carta’s latest mess and let’s go to the moon

    Listen right here or wherever you get your podcasts. Hello, and welcome again to Equity, the podcast…

    AI

    Precision home robots learn with real-to-sim-to-real | Ztoog

    At the highest of many automation want lists is a very time-consuming job: chores. The moonshot…

    AI

    You need to talk to your kid about AI. Here are 6 things you should say.

    “These tools are not representative of everybody—what they tell us is based on what they’ve…

    Science

    Gene Therapy in the Womb Is Inching Closer to Reality

    In a future when gene remedy can tweak an individual’s genome exactly sufficient to treatment them…

    Science

    California Nixes a Bill to Decriminalize Plant-Based Psychedelics

    Over the weekend, California Governor Gavin Newsom vetoed Senate Bill 58 (SB 58), nixing the…

    Our Picks
    Crypto

    El Salvador Crypto Advisor Predicts Explosive Rise To $220,000

    AI

    How Do Schrodinger Bridges Beat Diffusion Models On Text-To-Speech (TTS) Synthesis?

    Science

    Why Antidepressants Take So Long to Work

    Categories
    • AI (1,493)
    • Crypto (1,754)
    • Gadgets (1,805)
    • Mobile (1,851)
    • Science (1,867)
    • Technology (1,803)
    • The Future (1,649)
    Most Popular
    Mobile

    iPhone 16 Pro series to offer up to 2TB storage and larger batteries

    Mobile

    Update to iOS 17 will add useful and stress-saving feature to Apple Maps not found on Google Maps

    Mobile

    So many Quest 3 leaks, so little time. Plus, we’re taking a look at more Quest game releases including a new NFL VR game, and Apple’s commitment to the Vision Pro’s long-term success.

    Ztoog
    Facebook X (Twitter) Instagram Pinterest
    • Home
    • About Us
    • Contact us
    • Privacy Policy
    • Terms & Conditions
    © 2025 Ztoog.

    Type above and press Enter to search. Press Esc to cancel.