Close Menu
Ztoog
    What's Hot
    Gadgets

    Get Bose’s latest QuietComfort Ultra headphones at their lowest price ever, just in time for Christmas

    Crypto

    Hut 8 Secures $50 Million Credit Facility from Coinbase Credit

    AI

    This AI Research Introduces Atom: A Low-Bit Quantization Technique for Efficient and Accurate Large Language Model (LLM) Serving

    Important Pages:
    • About Us
    • Contact us
    • Privacy Policy
    • Terms & Conditions
    Facebook X (Twitter) Instagram Pinterest
    Facebook X (Twitter) Instagram Pinterest
    Ztoog
    • Home
    • The Future

      Any wall can be turned into a camera to see around corners

      JD Vance and President Trump’s Sons Hype Bitcoin at Las Vegas Conference

      AI may already be shrinking entry-level jobs in tech, new research suggests

      Today’s NYT Strands Hints, Answer and Help for May 26 #449

      LiberNovo Omni: The World’s First Dynamic Ergonomic Chair

    • Technology

      A Replit employee details a critical security flaw in web apps created using AI-powered app builder Lovable that exposes API keys and personal info of app users (Reed Albergotti/Semafor)

      Gemini in Google Drive can now help you skip watching that painfully long Zoom meeting

      Apple iPhone exports from China to the US fall 76% as India output surges

      Today’s NYT Wordle Hints, Answer and Help for May 26, #1437

      5 Skills Kids (and Adults) Need in an AI World – O’Reilly

    • Gadgets

      Future-proof your career by mastering AI skills for just $20

      8 Best Vegan Meal Delivery Services and Kits (2025), Tested and Reviewed

      Google Home is getting deeper Gemini integration and a new widget

      Google Announces AI Ultra Subscription Plan With Premium Features

      Google shows off Android XR-based glasses, announces Warby Parker team-up

    • Mobile

      Deals: the Galaxy S25 series comes with a free tablet, Google Pixels heavily discounted

      Microsoft is done being subtle – this new tool screams “upgrade now”

      Wallpaper Wednesday: Android wallpapers 2025-05-28

      Google can make smart glasses accessible with Warby Parker, Gentle Monster deals

      vivo T4 Ultra specs leak

    • Science

      Analysts Say Trump Trade Wars Would Harm the Entire US Energy Sector, From Oil to Solar

      Do we have free will? Quantum experiments may soon reveal the answer

      Was Planet Nine exiled from the solar system as a baby?

      How farmers can help rescue water-loving birds

      A trip to the farm where loofahs grow on vines

    • AI

      Rationale engineering generates a compact new tool for gene therapy | Ztoog

      The AI Hype Index: College students are hooked on ChatGPT

      Learning how to predict rare kinds of failures | Ztoog

      Anthropic’s new hybrid AI model can work on tasks autonomously for hours at a time

      AI learns how vision and sound are connected, without human intervention | Ztoog

    • Crypto

      GameStop bought $500 million of bitcoin

      CoinW Teams Up with Superteam Europe to Conclude Solana Hackathon and Accelerate Web3 Innovation in Europe

      Ethereum Net Flows Turn Negative As Bulls Push For $3,500

      Bitcoin’s Power Compared To Nuclear Reactor By Brazilian Business Leader

      Senate advances GENIUS Act after cloture vote passes

    Ztoog
    Home » A faster, better way to prevent an AI chatbot from giving toxic responses | Ztoog
    AI

    A faster, better way to prevent an AI chatbot from giving toxic responses | Ztoog

    Facebook Twitter Pinterest WhatsApp
    A faster, better way to prevent an AI chatbot from giving toxic responses | Ztoog
    Share
    Facebook Twitter LinkedIn Pinterest WhatsApp

    A person might ask ChatGPT to write a pc program or summarize an article, and the AI chatbot would possible give you the chance to generate helpful code or write a cogent synopsis. However, somebody might additionally ask for directions to construct a bomb, and the chatbot may give you the chance to present these, too.

    To prevent this and different issues of safety, firms that construct massive language fashions sometimes safeguard them utilizing a course of referred to as red-teaming. Teams of human testers write prompts geared toward triggering unsafe or toxic textual content from the mannequin being examined. These prompts are used to educate the chatbot to keep away from such responses.

    But this solely works successfully if engineers know which toxic prompts to use. If human testers miss some prompts, which is probably going given the variety of prospects, a chatbot considered secure may nonetheless be able to producing unsafe solutions.

    Researchers from Improbable AI Lab at MIT and the MIT-IBM Watson AI Lab used machine studying to enhance red-teaming. They developed a way to prepare a red-team massive language mannequin to mechanically generate numerous prompts that set off a wider vary of undesirable responses from the chatbot being examined.

    They do that by educating the red-team mannequin to be curious when it writes prompts, and to concentrate on novel prompts that evoke toxic responses from the goal mannequin.

    The method outperformed human testers and different machine-learning approaches by producing extra distinct prompts that elicited more and more toxic responses. Not solely does their technique considerably enhance the protection of inputs being examined in contrast to different automated strategies, however it may possibly additionally draw out toxic responses from a chatbot that had safeguards constructed into it by human consultants.

    “Right now, every large language model has to undergo a very lengthy period of red-teaming to ensure its safety. That is not going to be sustainable if we want to update these models in rapidly changing environments. Our method provides a faster and more effective way to do this quality assurance,” says Zhang-Wei Hong, an electrical engineering and laptop science (EECS) graduate scholar within the Improbable AI lab and lead writer of a paper on this red-teaming strategy.

    Hong’s co-authors embody EECS graduate college students Idan Shenfield, Tsun-Hsuan Wang, and Yung-Sung Chuang; Aldo Pareja and Akash Srivastava, analysis scientists on the MIT-IBM Watson AI Lab; James Glass, senior analysis scientist and head of the Spoken Language Systems Group within the Computer Science and Artificial Intelligence Laboratory (CSAIL); and senior writer Pulkit Agrawal, director of Improbable AI Lab and an assistant professor in CSAIL. The analysis might be introduced on the International Conference on Learning Representations.

    Automated red-teaming 

    Large language fashions, like those who energy AI chatbots, are sometimes skilled by displaying them monumental quantities of textual content from billions of public web sites. So, not solely can they be taught to generate toxic phrases or describe unlawful actions, the fashions might additionally leak private info they might have picked up.

    The tedious and dear nature of human red-teaming, which is usually ineffective at producing a large sufficient number of prompts to totally safeguard a mannequin, has inspired researchers to automate the method utilizing machine studying.

    Such strategies usually prepare a red-team mannequin utilizing reinforcement studying. This trial-and-error course of rewards the red-team mannequin for producing prompts that set off toxic responses from the chatbot being examined.

    But due to the way reinforcement studying works, the red-team mannequin will usually maintain producing a couple of related prompts which are extremely toxic to maximize its reward.

    For their reinforcement studying strategy, the MIT researchers utilized a way referred to as curiosity-driven exploration. The red-team mannequin is incentivized to be curious concerning the penalties of every immediate it generates, so it’ll strive prompts with completely different phrases, sentence patterns, or meanings.

    “If the red-team model has already seen a specific prompt, then reproducing it will not generate any curiosity in the red-team model, so it will be pushed to create new prompts,” Hong says.

    During its coaching course of, the red-team mannequin generates a immediate and interacts with the chatbot. The chatbot responds, and a security classifier charges the toxicity of its response, rewarding the red-team mannequin based mostly on that ranking.

    Rewarding curiosity

    The red-team mannequin’s goal is to maximize its reward by eliciting an much more toxic response with a novel immediate. The researchers allow curiosity within the red-team mannequin by modifying the reward sign within the reinforcement studying arrange.

    First, as well as to maximizing toxicity, they embody an entropy bonus that encourages the red-team mannequin to be extra random because it explores completely different prompts. Second, to make the agent curious they embody two novelty rewards. One rewards the mannequin based mostly on the similarity of phrases in its prompts, and the opposite rewards the mannequin based mostly on semantic similarity. (Less similarity yields a better reward.)

    To prevent the red-team mannequin from producing random, nonsensical textual content, which may trick the classifier into awarding a excessive toxicity rating, the researchers additionally added a naturalistic language bonus to the coaching goal.

    With these additions in place, the researchers in contrast the toxicity and variety of responses their red-team mannequin generated with different automated strategies. Their mannequin outperformed the baselines on each metrics.

    They additionally used their red-team mannequin to check a chatbot that had been fine-tuned with human suggestions so it will not give toxic replies. Their curiosity-driven strategy was in a position to rapidly produce 196 prompts that elicited toxic responses from this “safe” chatbot.

    “We are seeing a surge of models, which is only expected to rise. Imagine thousands of models or even more and companies/labs pushing model updates frequently. These models are going to be an integral part of our lives and it’s important that they are verified before released for public consumption. Manual verification of models is simply not scalable, and our work is an attempt to reduce the human effort to ensure a safer and trustworthy AI future,” says Agrawal.  

    In the long run, the researchers need to allow the red-team mannequin to generate prompts about a greater variety of matters. They additionally need to discover the usage of a big language mannequin because the toxicity classifier. In this way, a person might prepare the toxicity classifier utilizing an organization coverage doc, for example, so a red-team mannequin might check a chatbot for firm coverage violations.

    “If you are releasing a new AI model and are concerned about whether it will behave as expected, consider using curiosity-driven red-teaming,” says Agrawal.

    This analysis is funded, partially, by Hyundai Motor Company, Quanta Computer Inc., the MIT-IBM Watson AI Lab, an Amazon Web Services MLRA analysis grant, the U.S. Army Research Office, the U.S. Defense Advanced Research Projects Agency Machine Common Sense Program, the U.S. Office of Naval Research, the U.S. Air Force Research Laboratory, and the U.S. Air Force Artificial Intelligence Accelerator.

    Share. Facebook Twitter Pinterest LinkedIn WhatsApp

    Related Posts

    AI

    Rationale engineering generates a compact new tool for gene therapy | Ztoog

    AI

    The AI Hype Index: College students are hooked on ChatGPT

    AI

    Learning how to predict rare kinds of failures | Ztoog

    AI

    Anthropic’s new hybrid AI model can work on tasks autonomously for hours at a time

    AI

    AI learns how vision and sound are connected, without human intervention | Ztoog

    AI

    How AI is introducing errors into courtrooms

    AI

    With AI, researchers predict the location of virtually any protein within a human cell | Ztoog

    AI

    Google DeepMind’s new AI agent cracks real-world problems better than humans can

    Leave A Reply Cancel Reply

    Follow Us
    • Facebook
    • Twitter
    • Pinterest
    • Instagram
    Top Posts
    AI

    Revolutionizing Scene Reconstruction with Break-A-Scene: The Future of AI-Powered Object Extraction and Remixing

    Humans naturally possess the power to interrupt down difficult scenes into part parts and think…

    The Future

    This Modular Robot Lawnmower Will Do All Yard Work You Don’t Want to Do

    There’s nothing extra satisfying for a house owner than having a well-maintained yard. However, the…

    Science

    Squeezing loofah sponges creates enough electricity to power LEDs

    Loofah sponges are produced from the dried husks of luffa gourdsAndriana Syvanych/Shutterstock Loofah sponges –…

    Technology

    Daily Telescope: A bright nebula in a one-horned constellation

    Mel Martin Welcome to the Daily Telescope. There is a little an excessive amount of…

    Gadgets

    Reddit mods fear spam overload as BotDefense leaves “antagonistic” Reddit

    Enlarge / There might quickly be way more of this on Reddit. The Reddit group…

    Our Picks
    The Future

    Generative AI is Set to Revolutionize the Automotive Industry

    Crypto

    HashKey Group Announces Launch of HashKey Global MENA with VASP License in UAE

    Mobile

    What is OpenAI? — the company behind ChatGPT

    Categories
    • AI (1,493)
    • Crypto (1,753)
    • Gadgets (1,805)
    • Mobile (1,851)
    • Science (1,866)
    • Technology (1,802)
    • The Future (1,648)
    Most Popular
    AI

    Researchers teach LLMs to solve complex planning challenges | Ztoog

    Technology

    Dublin, California-based Opkey, which provides AI-based ERP testing software, raised a $47M Series B led by PeakSpan Capital (Ingrid Lunden/Ztoog)

    The Future

    Amazon brings new AI-driven features to Thursday Night Football

    Ztoog
    Facebook X (Twitter) Instagram Pinterest
    • Home
    • About Us
    • Contact us
    • Privacy Policy
    • Terms & Conditions
    © 2025 Ztoog.

    Type above and press Enter to search. Press Esc to cancel.