Close Menu
Ztoog
    What's Hot
    AI

    Google Researchers Introduce An Open-Source Library in JAX for Deep Learning on Spherical Surfaces

    Mobile

    Google officially killed Driving Mode after stripping most of its features in 2024

    Technology

    4 Ways Your Body Is Telling You to Prioritize Quality Sleep

    Important Pages:
    • About Us
    • Contact us
    • Privacy Policy
    • Terms & Conditions
    Facebook X (Twitter) Instagram Pinterest
    Facebook X (Twitter) Instagram Pinterest
    Ztoog
    • Home
    • The Future

      How I Turn Unstructured PDFs into Revenue-Ready Spreadsheets

      Is it the best tool for 2025?

      The clocks that helped define time from London’s Royal Observatory

      Summer Movies Are Here, and So Are the New Popcorn Buckets

      India-Pak conflict: Pak appoints ISI chief, appointment comes in backdrop of the Pahalgam attack

    • Technology

      Ensure Hard Work Is Recognized With These 3 Steps

      Cicada map 2025: Where will Brood XIV cicadas emerge this spring?

      Is Duolingo the face of an AI jobs crisis?

      The US DOD transfers its AI-based Open Price Exploration for National Security program to nonprofit Critical Minerals Forum to boost Western supply deals (Ernest Scheyder/Reuters)

      The more Google kills Fitbit, the more I want a Fitbit Sense 3

    • Gadgets

      Apple plans to split iPhone 18 launch into two phases in 2026

      Upgrade your desk to Starfleet status with this $95 USB-C hub

      37 Best Graduation Gift Ideas (2025): For College Grads

      Backblaze responds to claims of “sham accounting,” customer backups at risk

      Snapdragon X Plus Could Bring Faster, More Powerful Chromebooks

    • Mobile

      Samsung Galaxy S25 Edge promo materials leak

      What are people doing with those free T-Mobile lines? Way more than you’d expect

      Samsung doesn’t want budget Galaxy phones to use exclusive AI features

      COROS’s charging adapter is a neat solution to the smartwatch charging cable problem

      Fortnite said to return to the US iOS App Store next week following court verdict

    • Science

      Failed Soviet probe will soon crash to Earth – and we don’t know where

      Trump administration cuts off all future federal funding to Harvard

      Does kissing spread gluten? New research offers a clue.

      Why Balcony Solar Panels Haven’t Taken Off in the US

      ‘Dark photon’ theory of light aims to tear up a century of physics

    • AI

      How to build a better AI benchmark

      Q&A: A roadmap for revolutionizing health care through data-driven innovation | Ztoog

      This data set helps researchers spot harmful stereotypes in LLMs

      Making AI models more trustworthy for high-stakes settings | Ztoog

      The AI Hype Index: AI agent cyberattacks, racing robots, and musical models

    • Crypto

      ‘The Big Short’ Coming For Bitcoin? Why BTC Will Clear $110,000

      Bitcoin Holds Above $95K Despite Weak Blockchain Activity — Analytics Firm Explains Why

      eToro eyes US IPO launch as early as next week amid easing concerns over Trump’s tariffs

      Cardano ‘Looks Dope,’ Analyst Predicts Big Move Soon

      Speak at Ztoog Disrupt 2025: Applications now open

    Ztoog
    Home » Can Benign Data Undermine AI Safety? This Paper from Princeton University Explores the Paradox of Machine Learning Fine-Tuning
    AI

    Can Benign Data Undermine AI Safety? This Paper from Princeton University Explores the Paradox of Machine Learning Fine-Tuning

    Facebook Twitter Pinterest WhatsApp
    Can Benign Data Undermine AI Safety? This Paper from Princeton University Explores the Paradox of Machine Learning Fine-Tuning
    Share
    Facebook Twitter LinkedIn Pinterest WhatsApp

    Safety tuning is vital for guaranteeing that superior Large Language Models (LLMs) are aligned with human values and protected to deploy. Current LLMs, together with these tuned for security and alignment, are prone to jailbreaking. Existing guardrails are proven to be fragile. Even customizing fashions by fine-tuning with benign knowledge, free of dangerous content material, may set off degradation in security for beforehand aligned fashions.

    Researchers from Princeton Language and Intelligence (PLI), Princeton University, current a radical analysis on why benign-finetuning inadvertently results in jailbreaking. They signify fine-tuning knowledge by two lenses: illustration and gradient areas. They additionally proposed a bi-directional anchoring technique that prioritizes knowledge factors near dangerous examples and distant from benign ones. Their strategy successfully identifies subsets of benign knowledge which might be extra more likely to degrade the mannequin’s security after fine-tuning.

    They thought-about finetuning a safety-aligned language mannequin with a dataset of instruction completion pairs with out specific dangerous info. Researchers proposed two model-aware approaches to determine knowledge that may result in mannequin jailbreaking: illustration matching and gradient matching. For illustration matching, they hypothesized that examples positioned close to dangerous examples would have related optimization pathways as precise dangerous examples, making them extra susceptible to degrading security guardrails throughout fine-tuning even when they don’t explicitly embody dangerous content material. They explicitly thought-about the instructions wherein samples replace the mannequin for gradient matching. The instinct is that samples extra more likely to result in a loss lower in dangerous examples usually tend to result in jailbreaking.

    On evaluating fine-tuning knowledge chosen by their approaches and random choice, They demonstrated that their illustration matching and gradient matching methods successfully determine the implicitly dangerous subsets of benign knowledge. Incorporating security anchors, the ASR for top-selected examples considerably will increase from 46.6% to 66.5% on ALPACA and from 4.9% to 53.3% on DOLLY. Moreover, choosing the lowest-ranked examples results in a considerably decreased ASR of 3.8% on ALPACA. They fine-tuned LLAMA-2-13B-CHAT utilizing the similar hyperparameters and the similar units of knowledge chosen with both illustration or gradient-based technique, utilizing LLAMA-2-7BCHAT as the base mannequin. Then, the similar analysis suite on the fine-tuned 13B fashions confirmed that the choice was efficient on the larger mannequin, boosting the mannequin’s harmfulness after fine-tuning.

    In this work, the researchers present a research on benign fine-tuning breaking mannequin security and alignment from a data-centric perspective. They launched illustration and gradient-based strategies that successfully choose a subset of benign knowledge that jailbreaks fashions after finetuning. GPT-3.5 ASR will increase from lower than 20% to greater than 70% after fine-tuning on their chosen dataset, exceeding ASR after fine-tuning on an explicitly dangerous dataset of the similar measurement. This work gives an preliminary step into understanding which benign knowledge will extra probably degrade security after fine-tuning.


    Check out the Paper. All credit score for this analysis goes to the researchers of this challenge. Also, don’t neglect to comply with us on Twitter. Join our Telegram Channel, Discord Channel, and LinkedIn Group.

    If you want our work, you’ll love our e-newsletter..

    Don’t Forget to hitch our 39k+ ML SubReddit


    Asjad is an intern marketing consultant at Marktechpost. He is persuing B.Tech in mechanical engineering at the Indian Institute of Technology, Kharagpur. Asjad is a Machine studying and deep studying fanatic who’s all the time researching the purposes of machine studying in healthcare.


    🐝 Join the Fastest Growing AI Research Newsletter Read by Researchers from Google + NVIDIA + Meta + Stanford + MIT + Microsoft and lots of others…

    Share. Facebook Twitter Pinterest LinkedIn WhatsApp

    Related Posts

    AI

    How to build a better AI benchmark

    AI

    Q&A: A roadmap for revolutionizing health care through data-driven innovation | Ztoog

    AI

    This data set helps researchers spot harmful stereotypes in LLMs

    AI

    Making AI models more trustworthy for high-stakes settings | Ztoog

    AI

    The AI Hype Index: AI agent cyberattacks, racing robots, and musical models

    AI

    Novel method detects microbial contamination in cell cultures | Ztoog

    AI

    Seeing AI as a collaborator, not a creator

    AI

    “Periodic table of machine learning” could fuel AI discovery | Ztoog

    Leave A Reply Cancel Reply

    Follow Us
    • Facebook
    • Twitter
    • Pinterest
    • Instagram
    Top Posts
    Science

    Seeing a corpse makes fruit flies age faster

    What we expertise every day makes an affect on us, good or unhealthy. After all,…

    The Future

    How Laundry Pick-Up and Delivery Apps Work

    Laundry pickup and supply apps have revolutionized how we do our laundry. We not must…

    Technology

    Experts say terrorist groups are using generative AI tools to evade the hashing algorithms used by tech companies to automatically remove extremist content (David Gilbert/Wired)

    David Gilbert / Wired: Experts say terrorist groups are using generative AI tools to evade…

    Crypto

    Economist Predicts $115K Bitcoin Peak, Then Historic Crash

    Renowned macroeconomist Henrik Zeberg has set the monetary world abuzz with a stark prognosis on…

    Gadgets

    17 Great Black Friday Soundbar Deals to Pump Up the Volume (2024)

    You might not understand it, however your TV is gloomy. It’s unhappy as a result…

    Our Picks
    Crypto

    Bitcoin Under Siege: Support Breakdown Raises Concerns Of Drop To $24,000

    Gadgets

    Virgin Galactic Fulfills Decades-Long Promise With Inaugural Space Tourism Launch

    Science

    The Senate just lobbed a tactical nuke at NASA’s Mars Sample Return program

    Categories
    • AI (1,482)
    • Crypto (1,744)
    • Gadgets (1,795)
    • Mobile (1,839)
    • Science (1,853)
    • Technology (1,789)
    • The Future (1,635)
    Most Popular
    Crypto

    Bitcoin And Ethereum Addresses Shrink In 2024

    Science

    Are we in a space race again?

    Crypto

    Market Clarity: Multiple Bitcoin ETF Addresses Disclosed To The Public

    Ztoog
    Facebook X (Twitter) Instagram Pinterest
    • Home
    • About Us
    • Contact us
    • Privacy Policy
    • Terms & Conditions
    © 2025 Ztoog.

    Type above and press Enter to search. Press Esc to cancel.