Close Menu
Ztoog
    What's Hot
    Mobile

    Catch some hot deals on Anker and UGREEN power banks

    Technology

    X appears to be rolling out a new mobile ad format that can't be reported or blocked, lacks an ad label and user profile, and doesn't disclose the advertiser (Matt Binder/Mashable)

    Mobile

    Realme to bundle Narzo 70 Pro with Buds T300 in early bird promo

    Important Pages:
    • About Us
    • Contact us
    • Privacy Policy
    • Terms & Conditions
    Facebook X (Twitter) Instagram Pinterest
    Facebook X (Twitter) Instagram Pinterest
    Ztoog
    • Home
    • The Future

      What is Project Management? 5 Best Tools that You Can Try

      Operational excellence strategy and continuous improvement

      Hannah Fry: AI isn’t as powerful as we think

      FanDuel goes all in on responsible gaming push with new Play with a Plan campaign

      Gettyimages.com Is the Best Website on the Internet Right Now

    • Technology

      Iran war: How could it end?

      Democratic senators question CFTC staffing cuts in Chicago enforcement office

      Google’s Cloud AI lead on the three frontiers of model capability

      AMD agrees to backstop a $300M loan from Goldman Sachs for Crusoe to buy AMD AI chips, the first known case of AMD chips used as debt collateral (The Information)

      Productivity apps failed me when I needed them most

    • Gadgets

      macOS Tahoe 26.3.1 update will “upgrade” your M5’s CPU to new “super” cores

      Lenovo Shows Off a ThinkBook Modular AI PC Concept With Swappable Ports and Detachable Displays at MWC 2026

      POCO M8 Review: The Ultimate Budget Smartphone With Some Cons

      The Mission: Impossible of SSDs has arrived with a fingerprint lock

      6 Best Phones With Headphone Jacks (2026), Tested and Reviewed

    • Mobile

      Android’s March update is all about finding people, apps, and your missing bags

      Watch Xiaomi’s global launch event live here

      Our poll shows what buyers actually care about in new smartphones (Hint: it’s not AI)

      Is Strava down for you? You’re not alone

      The Motorola Razr FIFA World Cup 2026 Edition was literally just unveiled, and Verizon is already giving them away

    • Science

      Big Tech Signs White House Data Center Pledge With Good Optics and Little Substance

      Inside the best dark matter detector ever built

      NASA’s Artemis moon exploration programme is getting a major makeover

      Scientists crack the case of “screeching” Scotch tape

      Blue-faced, puffy-lipped monkey scores a rare conservation win

    • AI

      Online harassment is entering its AI era

      Meet NullClaw: The 678 KB Zig AI Agent Framework Running on 1 MB RAM and Booting in Two Milliseconds

      New method could increase LLM training efficiency | Ztoog

      The human work behind humanoid robots is being hidden

      NVIDIA Releases DreamDojo: An Open-Source Robot World Model Trained on 44,711 Hours of Real-World Human Video Data

    • Crypto

      Google paid startup Form Energy $1B for its massive 100-hour battery

      Ethereum Breakout Alert: Corrective Channel Flip Sparks Impulsive Wave

      Show Your ID Or No Deal

      Jane Street sued for alleged front-running trades that accelerated Terraform Labs meltdown

      Bitcoin Trades Below ETF Cost-Basis As MVRV Signals Mounting Pressure

    Ztoog
    Home » Looking for a specific action in a video? This AI-based method can find it for you | Ztoog
    AI

    Looking for a specific action in a video? This AI-based method can find it for you | Ztoog

    Facebook Twitter Pinterest WhatsApp
    Looking for a specific action in a video? This AI-based method can find it for you | Ztoog
    Share
    Facebook Twitter LinkedIn Pinterest WhatsApp

    The web is awash in tutorial movies that can educate curious viewers every part from cooking the right pancake to performing a life-saving Heimlich maneuver.

    But pinpointing when and the place a explicit action occurs in a lengthy video can be tedious. To streamline the method, scientists are attempting to show computer systems to carry out this process. Ideally, a consumer may simply describe the action they’re wanting for, and an AI mannequin would skip to its location in the video.

    However, instructing machine-learning fashions to do that often requires a nice deal of high-priced video knowledge which have been painstakingly hand-labeled.

    A brand new, extra environment friendly strategy from researchers at MIT and the MIT-IBM Watson AI Lab trains a mannequin to carry out this process, often known as spatio-temporal grounding, utilizing solely movies and their mechanically generated transcripts.

    The researchers educate a mannequin to grasp an unlabeled video in two distinct methods: by small particulars to determine the place objects are positioned (spatial info) and looking out on the larger image to grasp when the action happens (temporal info).

    Compared to different AI approaches, their method extra precisely identifies actions in longer movies with a number of actions. Interestingly, they discovered that concurrently coaching on spatial and temporal info makes a mannequin higher at figuring out every individually.

    In addition to streamlining on-line studying and digital coaching processes, this system is also helpful in well being care settings by quickly discovering key moments in movies of diagnostic procedures, for instance.

    “We disentangle the challenge of trying to encode spatial and temporal information all at once and instead think about it like two experts working on their own, which turns out to be a more explicit way to encode the information. Our model, which combines these two separate branches, leads to the best performance,” says Brian Chen, lead writer of a paper on this system.

    Chen, a 2023 graduate of Columbia University who carried out this analysis whereas a visiting scholar on the MIT-IBM Watson AI Lab, is joined on the paper by James Glass, senior analysis scientist, member of the MIT-IBM Watson AI Lab, and head of the Spoken Language Systems Group in the Computer Science and Artificial Intelligence Laboratory (CSAIL); Hilde Kuehne, a member of the MIT-IBM Watson AI Lab who can also be affiliated with Goethe University Frankfurt; and others at MIT, Goethe University, the MIT-IBM Watson AI Lab, and Quality Match GmbH. The analysis will probably be offered on the Conference on Computer Vision and Pattern Recognition.

    Global and native studying

    Researchers often educate fashions to carry out spatio-temporal grounding utilizing movies in which people have annotated the beginning and finish occasions of explicit duties.

    Not solely is producing these knowledge costly, however it can be troublesome for people to determine precisely what to label. If the action is “cooking a pancake,” does that action begin when the chef begins mixing the batter or when she pours it into the pan?

    “This time, the task may be about cooking, but next time, it might be about fixing a car. There are so many different domains for people to annotate. But if we can learn everything without labels, it is a more general solution,” Chen says.

    For their strategy, the researchers use unlabeled tutorial movies and accompanying textual content transcripts from a web site like YouTube as coaching knowledge. These don’t want any particular preparation.

    They cut up the coaching course of into two items. For one, they educate a machine-learning mannequin to have a look at all the video to grasp what actions occur at sure occasions. This high-level info is named a international illustration.

    For the second, they educate the mannequin to concentrate on a specific area in components of the video the place action is going on. In a giant kitchen, for occasion, the mannequin would possibly solely must concentrate on the wood spoon a chef is utilizing to combine pancake batter, slightly than all the counter. This fine-grained info is named a native illustration.

    The researchers incorporate a further element into their framework to mitigate misalignments that happen between narration and video. Perhaps the chef talks about cooking the pancake first and performs the action later.

    To develop a extra lifelike answer, the researchers targeted on uncut movies which might be a number of minutes lengthy. In distinction, most AI methods practice utilizing few-second clips that somebody trimmed to indicate just one action.

    A brand new benchmark

    But once they got here to judge their strategy, the researchers couldn’t find an efficient benchmark for testing a mannequin on these longer, uncut movies — in order that they created one.

    To construct their benchmark dataset, the researchers devised a new annotation approach that works properly for figuring out multistep actions. They had customers mark the intersection of objects, like the purpose the place a knife edge cuts a tomato, slightly than drawing a field round necessary objects.

    “This is more clearly defined and speeds up the annotation process, which reduces the human labor and cost,” Chen says.

    Plus, having a number of individuals do level annotation on the identical video can higher seize actions that happen over time, just like the circulate of milk being poured. All annotators received’t mark the very same level in the circulate of liquid.

    When they used this benchmark to check their strategy, the researchers discovered that it was extra correct at pinpointing actions than different AI methods.

    Their method was additionally higher at specializing in human-object interactions. For occasion, if the action is “serving a pancake,” many different approaches would possibly focus solely on key objects, like a stack of pancakes sitting on a counter. Instead, their method focuses on the precise second when the chef flips a pancake onto a plate.

    Next, the researchers plan to reinforce their strategy so fashions can mechanically detect when textual content and narration should not aligned, and swap focus from one modality to the opposite. They additionally need to prolong their framework to audio knowledge, since there are often sturdy correlations between actions and the sounds objects make.

    This analysis is funded, in half, by the MIT-IBM Watson AI Lab.

    Share. Facebook Twitter Pinterest LinkedIn WhatsApp

    Related Posts

    AI

    Online harassment is entering its AI era

    AI

    Meet NullClaw: The 678 KB Zig AI Agent Framework Running on 1 MB RAM and Booting in Two Milliseconds

    AI

    New method could increase LLM training efficiency | Ztoog

    AI

    The human work behind humanoid robots is being hidden

    AI

    NVIDIA Releases DreamDojo: An Open-Source Robot World Model Trained on 44,711 Hours of Real-World Human Video Data

    AI

    Personalization features can make LLMs more agreeable | Ztoog

    AI

    AI is already making online crimes easier. It could get much worse.

    AI

    NVIDIA Researchers Introduce KVTC Transform Coding Pipeline to Compress Key-Value Caches by 20x for Efficient LLM Serving

    Leave A Reply Cancel Reply

    Follow Us
    • Facebook
    • Twitter
    • Pinterest
    • Instagram
    Top Posts
    Science

    New Trials Aim to Restore Hearing in Deaf Children—With Gene Therapy

    While the Novartis trial was taking place, Lustig and others had been engaged on methods…

    Gadgets

    Phone Signal Search Lead To Deadly Whale And Dolphin Oil Spill

    In at present’s digital period, a rising variety of folks discover themselves tethered to their…

    Mobile

    Disney Plus’ password-sharing crackdown begins in June

    If you might have a Disney Plus account, and also you occur to share your…

    Mobile

    Honor MagicBook 16 Pro in for review

    The Honor MagicBook 16 dates again to September, when its first members debuted in China.…

    Crypto

    Litecoin Whale Deposits Big To Binance, LTC’s 3% Drop To Extend?

    On-chain information exhibits a Litecoin whale has made a big deposit to cryptocurrency change Binance,…

    Our Picks
    AI

    New software designs eco-friendly clothing that can reassemble into new items | Ztoog

    The Future

    Deadpool and Godzilla Headline Latest RSVLTS Fashion Collection

    The Future

    Reddit’s CEO lashes out, Twitter gets evicted, and NYC delivery workers get a pay raise

    Categories
    • AI (1,560)
    • Crypto (1,826)
    • Gadgets (1,870)
    • Mobile (1,910)
    • Science (1,939)
    • Technology (1,862)
    • The Future (1,716)
    Most Popular
    The Future

    Chair for gamers boosts player performance and prevents muscular aches

    Science

    Colorado State is getting a brand new laser facility

    Crypto

    Grayscale: ‘Next Bitcoin Halving Is Different’

    Ztoog
    Facebook X (Twitter) Instagram Pinterest
    • Home
    • About Us
    • Contact us
    • Privacy Policy
    • Terms & Conditions
    © 2026 Ztoog.

    Type above and press Enter to search. Press Esc to cancel.