Close Menu
Ztoog
    What's Hot
    Crypto

    Tether Expands Dollar, Gold Stablecoins to Boost Telegram Payments

    Crypto

    Here’s How Much Michael Saylor’s Personal Bitcoin Stash Is Now Worth

    AI

    Google DeepMind and Anthropic Researchers Introduce Equal-Info Windows: A Groundbreaking AI Method for Efficient LLM Training on Compressed Text

    Important Pages:
    • About Us
    • Contact us
    • Privacy Policy
    • Terms & Conditions
    Facebook X (Twitter) Instagram Pinterest
    Facebook X (Twitter) Instagram Pinterest
    Ztoog
    • Home
    • The Future

      OPPO launches A5 Pro 5G: Premium features at a budget price

      How I Turn Unstructured PDFs into Revenue-Ready Spreadsheets

      Is it the best tool for 2025?

      The clocks that helped define time from London’s Royal Observatory

      Summer Movies Are Here, and So Are the New Popcorn Buckets

    • Technology

      What It Is and Why It Matters—Part 1 – O’Reilly

      Ensure Hard Work Is Recognized With These 3 Steps

      Cicada map 2025: Where will Brood XIV cicadas emerge this spring?

      Is Duolingo the face of an AI jobs crisis?

      The US DOD transfers its AI-based Open Price Exploration for National Security program to nonprofit Critical Minerals Forum to boost Western supply deals (Ernest Scheyder/Reuters)

    • Gadgets

      Maono Caster G1 Neo & PD200X Review: Budget Streaming Gear for Aspiring Creators

      Apple plans to split iPhone 18 launch into two phases in 2026

      Upgrade your desk to Starfleet status with this $95 USB-C hub

      37 Best Graduation Gift Ideas (2025): For College Grads

      Backblaze responds to claims of “sham accounting,” customer backups at risk

    • Mobile

      Samsung Galaxy S25 Edge promo materials leak

      What are people doing with those free T-Mobile lines? Way more than you’d expect

      Samsung doesn’t want budget Galaxy phones to use exclusive AI features

      COROS’s charging adapter is a neat solution to the smartwatch charging cable problem

      Fortnite said to return to the US iOS App Store next week following court verdict

    • Science

      Nothing is stronger than quantum connections – and now we know why

      Failed Soviet probe will soon crash to Earth – and we don’t know where

      Trump administration cuts off all future federal funding to Harvard

      Does kissing spread gluten? New research offers a clue.

      Why Balcony Solar Panels Haven’t Taken Off in the US

    • AI

      Hybrid AI model crafts smooth, high-quality videos in seconds | Ztoog

      How to build a better AI benchmark

      Q&A: A roadmap for revolutionizing health care through data-driven innovation | Ztoog

      This data set helps researchers spot harmful stereotypes in LLMs

      Making AI models more trustworthy for high-stakes settings | Ztoog

    • Crypto

      Ethereum Breaks Key Resistance In One Massive Move – Higher High Confirms Momentum

      ‘The Big Short’ Coming For Bitcoin? Why BTC Will Clear $110,000

      Bitcoin Holds Above $95K Despite Weak Blockchain Activity — Analytics Firm Explains Why

      eToro eyes US IPO launch as early as next week amid easing concerns over Trump’s tariffs

      Cardano ‘Looks Dope,’ Analyst Predicts Big Move Soon

    Ztoog
    Home » A New AI Research from KAIST Introduces FLASK: A Fine-Grained Evaluation Framework for Language Models Based on Skill Sets
    AI

    A New AI Research from KAIST Introduces FLASK: A Fine-Grained Evaluation Framework for Language Models Based on Skill Sets

    Facebook Twitter Pinterest WhatsApp
    A New AI Research from KAIST Introduces FLASK: A Fine-Grained Evaluation Framework for Language Models Based on Skill Sets
    Share
    Facebook Twitter LinkedIn Pinterest WhatsApp

    Incredibly, LLMs have confirmed to match with human values, offering useful, sincere, and innocent responses. In specific, this functionality has been drastically enhanced by strategies that fine-tune a pretrained LLM on varied duties or person preferences, corresponding to instruction tuning and reinforcement studying from human suggestions (RLHF). Recent analysis means that by evaluating fashions solely primarily based on binary human/machine selection, open-sourced fashions skilled by way of dataset distillation from proprietary fashions can shut the efficiency hole with the proprietary LLMs.

    Researchers in pure language processing (NLP) have proposed a brand new analysis protocol known as FLASK (Fine-grained Language Model Evaluation primarily based on Alignment Skill Sets) to handle the shortcomings of present analysis settings. This protocol refines the standard coarse-grained scoring course of right into a extra fine-grained scoring setup, permitting instance-wise task-agnostic talent analysis relying on the given instruction.

    For a radical analysis of language mannequin efficiency, researchers outline 4 main skills which are additional damaged down into 12 fine-grained abilities:

    🚀 Build high-quality coaching datasets with Kili Technology and resolve NLP machine studying challenges to develop highly effective ML functions
    1. Reasoning that’s logical (within the sense of being right, sturdy, and efficient)
    2. Facts and customary sense are examples of background information.
    3. Problem-Solving (Grasping, Insight, Completion, and Metacognition)
    4. Consistency with User Preferences (Brevity, Readability, and Safety).

    Researchers additionally annotate the occasion with details about the domains during which it happens, the extent of problem, and the associated set of abilities (a talent set). Then, both human evaluators or cutting-edge LLMs1 provides every occasion’s given abilities a rating between 1 and 5. By permitting for an in depth research of the mannequin’s efficiency primarily based on the talent set, goal area, and problem, FLASK gives a complete image of LLM efficiency. They use FLASK for each model-based and human-based analysis to judge and distinction LLMs from completely different open-source and proprietary sources, every of which has its mannequin measurement and technique of fine-tuning.

    The researchers current a number of findings:

    • They discover that even probably the most superior open-source LLMs are underperforming proprietary LLMs by about 25% and 10% in Logical Thinking and Background Knowledge skills, respectively.
    • They additionally discover that for studying varied abilities, different-sized fashions are wanted. Skills like Conciseness and Insightfulness, for occasion, attain a ceiling after a sure measurement, though bigger fashions profit extra from coaching in Logical Correctness.
    • They display that even cutting-edge proprietary LLMs endure efficiency drops of as much as 50% on the FLASK-HARD set, a subset of the FLASK evaluation set from which solely laborious examples are picked.

    Both researchers and practitioners can profit from FLASK’s thorough evaluation of LLMs. FLASK facilitates exact understanding of the present state of a mannequin, offering specific steps for enhancing mannequin alignment. For occasion, in accordance with FLASK’s findings, firms creating non-public LLMs ought to develop fashions that rating nicely on the FLASK-HARD set. At the identical time, the open-source neighborhood ought to work on creating fundamental fashions with excessive Logical Thinking and Background Knowledge skills. FLASK helps practitioners advocate fashions most suited to their wants by offering a fine-grained comparability of LLMs.

    Researchers have recognized the next 4 core skills, damaged down into a complete of twelve abilities, as being essential for profitable adherence to person directions:

    1. Stability in Reasoning

    Does the mannequin assure that the steps within the instruction’s logic chain are constant and freed from contradictions? This entails eager about particular circumstances and missing counterexamples when fixing coding and math difficulties.

    2. Validity of Reasoning

    Is the response’s ultimate reply logically correct and proper when utilized to a command with a hard and fast consequence?

    3. Efficient Use of Reason

    Is there an efficient use of reasoning within the reply? The cause behind the response must be simple and time-efficient, with no pointless steps. The advisable answer ought to take into account the time complexity of the work if it entails coding.

    4. Typical Realization

    When given directions that decision for a simulation of the expected consequence or that decision for widespread sense or spatial reasoning, how nicely does the mannequin perceive these notions from the actual world?

    5. Veracity

    When factual information retrieval was required, did the mannequin extract the mandatory context data with out introducing any errors? Is there documentation or a quotation of the place one obtained that data to assist the declare?

    6. Reflective considering

    Did the mannequin’s response mirror an understanding of its efficacy? Did the mannequin state its constraints when it lacked data or competence to supply a reliable response, corresponding to when given complicated or unsure directions?

    7. Perceptiveness

    Does the response supply something new or completely different, corresponding to a unique take on one thing or a contemporary manner of one thing?

    Eighth, Fullness

    Does the reply adequately clarify the issue? The breadth of subjects addressed and the amount of element provided inside every subject point out the response’s comprehensiveness and completeness.

    9. Understanding

    Does the response meet the wants of the instruction by supplying obligatory particulars, particularly when these particulars are quite a few and sophisticated? This entails responding to each the said and unspoken targets of directions.

    10. Brevity

    Does the response present the related data with out rambling on?

    11. Ease of Reading

    How well-organized and coherent is the reply? Does the reply display excellent group?

    12. No Harm

    Does the mannequin’s reply lack prejudice primarily based on sexual orientation, race, or faith? Does it take into account the person’s security, avoiding offering responses that would trigger hurt or put the person at risk?

    In conclusion, researchers who research LLMs advocate that the open-source neighborhood enhance base fashions with enhanced logic and information. In distinction, builders of proprietary LLMs work to spice up their fashions’ efficiency on the FLASK-HARD set, a very tough subset of FLASK. FLASK will assist them enhance their fundamental fashions and higher perceive different LLMs to make use of of their work. Furthermore, there could also be eventualities when 12 granular skills are inadequate, corresponding to when FLASK is utilized in a domain-specific surroundings. In addition, current discoveries of LLM skills counsel that future fashions with stronger skills and abilities would require reclassifying the basic capabilities and abilities.


    Check out the Paper and Demo. All Credit For This Research Goes To the Researchers on This Project. Also, don’t overlook to hitch our 26k+ ML SubReddit, Discord Channel, and Email Newsletter, the place we share the most recent AI analysis information, cool AI initiatives, and extra.


    Dhanshree Shenwai is a Computer Science Engineer and has an excellent expertise in FinTech corporations protecting Financial, Cards & Payments and Banking area with eager curiosity in functions of AI. She is keen about exploring new applied sciences and developments in at the moment’s evolving world making everybody’s life straightforward.


    🔥 Gain a aggressive
    edge with information: Actionable market intelligence for world manufacturers, retailers, analysts, and buyers. (Sponsored)

    Share. Facebook Twitter Pinterest LinkedIn WhatsApp

    Related Posts

    AI

    Hybrid AI model crafts smooth, high-quality videos in seconds | Ztoog

    AI

    How to build a better AI benchmark

    AI

    Q&A: A roadmap for revolutionizing health care through data-driven innovation | Ztoog

    AI

    This data set helps researchers spot harmful stereotypes in LLMs

    AI

    Making AI models more trustworthy for high-stakes settings | Ztoog

    AI

    The AI Hype Index: AI agent cyberattacks, racing robots, and musical models

    AI

    Novel method detects microbial contamination in cell cultures | Ztoog

    AI

    Seeing AI as a collaborator, not a creator

    Leave A Reply Cancel Reply

    Follow Us
    • Facebook
    • Twitter
    • Pinterest
    • Instagram
    Top Posts
    Crypto

    Who Accepts Bitcoin as Payment?

    Over the course of its existence, Bitcoin has steadily risen to change into the unmatched…

    The Future

    Chinese scientists develop drone with ‘human brain’ capable of engaging in ‘group chats’

    A workforce of Chinese scientists has achieved a big breakthrough in drone expertise, making a…

    Technology

    Radar Trends to Watch: November 2023 – O’Reilly

    Our Security part has grown nearly as giant as AI (and longer than Programming)—and that’s…

    Mobile

    Samsung Galaxy Buds FE spotted in regulatory listing

    Lily Katz / Android AuthorityTL;DR A regulatory listing of a brand new Galaxy Buds has…

    Mobile

    Why the Pixel 8’s Face Unlock upgrade is a big deal

    When Google unveiled the Pixel 8 collection final week, they centered on all the new…

    Our Picks
    Mobile

    Top 10 trending phones of week 23

    Science

    Hassell and ESA unveil their concept for a permanent base on the moon

    Gadgets

    98Q80C: Samsung Unveils Affordable 98-Inch QLED TV

    Categories
    • AI (1,483)
    • Crypto (1,745)
    • Gadgets (1,796)
    • Mobile (1,839)
    • Science (1,854)
    • Technology (1,790)
    • The Future (1,636)
    Most Popular
    The Future

    Comun channels local banking approach to serve Latino immigrants

    Science

    Jupiter jet stream captured by James Webb Space Telescope

    AI

    How Adobe’s bet on non-exploitative AI is paying off

    Ztoog
    Facebook X (Twitter) Instagram Pinterest
    • Home
    • About Us
    • Contact us
    • Privacy Policy
    • Terms & Conditions
    © 2025 Ztoog.

    Type above and press Enter to search. Press Esc to cancel.