Close Menu
Ztoog
    What's Hot
    The Future

    Samsung is kicking off its One UI 6 beta program

    Science

    Claudia de Rham: In search of the true nature of gravity

    Technology

    Gary Gensler says it's "nearly unavoidable" that AI will cause a financial crisis if many institutions rely on the same underlying base model or data aggregator (Financial Times)

    Important Pages:
    • About Us
    • Contact us
    • Privacy Policy
    • Terms & Conditions
    Facebook X (Twitter) Instagram Pinterest
    Facebook X (Twitter) Instagram Pinterest
    Ztoog
    • Home
    • The Future

      How I Turn Unstructured PDFs into Revenue-Ready Spreadsheets

      Is it the best tool for 2025?

      The clocks that helped define time from London’s Royal Observatory

      Summer Movies Are Here, and So Are the New Popcorn Buckets

      India-Pak conflict: Pak appoints ISI chief, appointment comes in backdrop of the Pahalgam attack

    • Technology

      Ensure Hard Work Is Recognized With These 3 Steps

      Cicada map 2025: Where will Brood XIV cicadas emerge this spring?

      Is Duolingo the face of an AI jobs crisis?

      The US DOD transfers its AI-based Open Price Exploration for National Security program to nonprofit Critical Minerals Forum to boost Western supply deals (Ernest Scheyder/Reuters)

      The more Google kills Fitbit, the more I want a Fitbit Sense 3

    • Gadgets

      Maono Caster G1 Neo & PD200X Review: Budget Streaming Gear for Aspiring Creators

      Apple plans to split iPhone 18 launch into two phases in 2026

      Upgrade your desk to Starfleet status with this $95 USB-C hub

      37 Best Graduation Gift Ideas (2025): For College Grads

      Backblaze responds to claims of “sham accounting,” customer backups at risk

    • Mobile

      Samsung Galaxy S25 Edge promo materials leak

      What are people doing with those free T-Mobile lines? Way more than you’d expect

      Samsung doesn’t want budget Galaxy phones to use exclusive AI features

      COROS’s charging adapter is a neat solution to the smartwatch charging cable problem

      Fortnite said to return to the US iOS App Store next week following court verdict

    • Science

      Failed Soviet probe will soon crash to Earth – and we don’t know where

      Trump administration cuts off all future federal funding to Harvard

      Does kissing spread gluten? New research offers a clue.

      Why Balcony Solar Panels Haven’t Taken Off in the US

      ‘Dark photon’ theory of light aims to tear up a century of physics

    • AI

      How to build a better AI benchmark

      Q&A: A roadmap for revolutionizing health care through data-driven innovation | Ztoog

      This data set helps researchers spot harmful stereotypes in LLMs

      Making AI models more trustworthy for high-stakes settings | Ztoog

      The AI Hype Index: AI agent cyberattacks, racing robots, and musical models

    • Crypto

      ‘The Big Short’ Coming For Bitcoin? Why BTC Will Clear $110,000

      Bitcoin Holds Above $95K Despite Weak Blockchain Activity — Analytics Firm Explains Why

      eToro eyes US IPO launch as early as next week amid easing concerns over Trump’s tariffs

      Cardano ‘Looks Dope,’ Analyst Predicts Big Move Soon

      Speak at Ztoog Disrupt 2025: Applications now open

    Ztoog
    Home » Looking for a specific action in a video? This AI-based method can find it for you | Ztoog
    AI

    Looking for a specific action in a video? This AI-based method can find it for you | Ztoog

    Facebook Twitter Pinterest WhatsApp
    Looking for a specific action in a video? This AI-based method can find it for you | Ztoog
    Share
    Facebook Twitter LinkedIn Pinterest WhatsApp

    The web is awash in tutorial movies that can educate curious viewers every part from cooking the right pancake to performing a life-saving Heimlich maneuver.

    But pinpointing when and the place a explicit action occurs in a lengthy video can be tedious. To streamline the method, scientists are attempting to show computer systems to carry out this process. Ideally, a consumer may simply describe the action they’re wanting for, and an AI mannequin would skip to its location in the video.

    However, instructing machine-learning fashions to do that often requires a nice deal of high-priced video knowledge which have been painstakingly hand-labeled.

    A brand new, extra environment friendly strategy from researchers at MIT and the MIT-IBM Watson AI Lab trains a mannequin to carry out this process, often known as spatio-temporal grounding, utilizing solely movies and their mechanically generated transcripts.

    The researchers educate a mannequin to grasp an unlabeled video in two distinct methods: by small particulars to determine the place objects are positioned (spatial info) and looking out on the larger image to grasp when the action happens (temporal info).

    Compared to different AI approaches, their method extra precisely identifies actions in longer movies with a number of actions. Interestingly, they discovered that concurrently coaching on spatial and temporal info makes a mannequin higher at figuring out every individually.

    In addition to streamlining on-line studying and digital coaching processes, this system is also helpful in well being care settings by quickly discovering key moments in movies of diagnostic procedures, for instance.

    “We disentangle the challenge of trying to encode spatial and temporal information all at once and instead think about it like two experts working on their own, which turns out to be a more explicit way to encode the information. Our model, which combines these two separate branches, leads to the best performance,” says Brian Chen, lead writer of a paper on this system.

    Chen, a 2023 graduate of Columbia University who carried out this analysis whereas a visiting scholar on the MIT-IBM Watson AI Lab, is joined on the paper by James Glass, senior analysis scientist, member of the MIT-IBM Watson AI Lab, and head of the Spoken Language Systems Group in the Computer Science and Artificial Intelligence Laboratory (CSAIL); Hilde Kuehne, a member of the MIT-IBM Watson AI Lab who can also be affiliated with Goethe University Frankfurt; and others at MIT, Goethe University, the MIT-IBM Watson AI Lab, and Quality Match GmbH. The analysis will probably be offered on the Conference on Computer Vision and Pattern Recognition.

    Global and native studying

    Researchers often educate fashions to carry out spatio-temporal grounding utilizing movies in which people have annotated the beginning and finish occasions of explicit duties.

    Not solely is producing these knowledge costly, however it can be troublesome for people to determine precisely what to label. If the action is “cooking a pancake,” does that action begin when the chef begins mixing the batter or when she pours it into the pan?

    “This time, the task may be about cooking, but next time, it might be about fixing a car. There are so many different domains for people to annotate. But if we can learn everything without labels, it is a more general solution,” Chen says.

    For their strategy, the researchers use unlabeled tutorial movies and accompanying textual content transcripts from a web site like YouTube as coaching knowledge. These don’t want any particular preparation.

    They cut up the coaching course of into two items. For one, they educate a machine-learning mannequin to have a look at all the video to grasp what actions occur at sure occasions. This high-level info is named a international illustration.

    For the second, they educate the mannequin to concentrate on a specific area in components of the video the place action is going on. In a giant kitchen, for occasion, the mannequin would possibly solely must concentrate on the wood spoon a chef is utilizing to combine pancake batter, slightly than all the counter. This fine-grained info is named a native illustration.

    The researchers incorporate a further element into their framework to mitigate misalignments that happen between narration and video. Perhaps the chef talks about cooking the pancake first and performs the action later.

    To develop a extra lifelike answer, the researchers targeted on uncut movies which might be a number of minutes lengthy. In distinction, most AI methods practice utilizing few-second clips that somebody trimmed to indicate just one action.

    A brand new benchmark

    But once they got here to judge their strategy, the researchers couldn’t find an efficient benchmark for testing a mannequin on these longer, uncut movies — in order that they created one.

    To construct their benchmark dataset, the researchers devised a new annotation approach that works properly for figuring out multistep actions. They had customers mark the intersection of objects, like the purpose the place a knife edge cuts a tomato, slightly than drawing a field round necessary objects.

    “This is more clearly defined and speeds up the annotation process, which reduces the human labor and cost,” Chen says.

    Plus, having a number of individuals do level annotation on the identical video can higher seize actions that happen over time, just like the circulate of milk being poured. All annotators received’t mark the very same level in the circulate of liquid.

    When they used this benchmark to check their strategy, the researchers discovered that it was extra correct at pinpointing actions than different AI methods.

    Their method was additionally higher at specializing in human-object interactions. For occasion, if the action is “serving a pancake,” many different approaches would possibly focus solely on key objects, like a stack of pancakes sitting on a counter. Instead, their method focuses on the precise second when the chef flips a pancake onto a plate.

    Next, the researchers plan to reinforce their strategy so fashions can mechanically detect when textual content and narration should not aligned, and swap focus from one modality to the opposite. They additionally need to prolong their framework to audio knowledge, since there are often sturdy correlations between actions and the sounds objects make.

    This analysis is funded, in half, by the MIT-IBM Watson AI Lab.

    Share. Facebook Twitter Pinterest LinkedIn WhatsApp

    Related Posts

    AI

    How to build a better AI benchmark

    AI

    Q&A: A roadmap for revolutionizing health care through data-driven innovation | Ztoog

    AI

    This data set helps researchers spot harmful stereotypes in LLMs

    AI

    Making AI models more trustworthy for high-stakes settings | Ztoog

    AI

    The AI Hype Index: AI agent cyberattacks, racing robots, and musical models

    AI

    Novel method detects microbial contamination in cell cultures | Ztoog

    AI

    Seeing AI as a collaborator, not a creator

    AI

    “Periodic table of machine learning” could fuel AI discovery | Ztoog

    Leave A Reply Cancel Reply

    Follow Us
    • Facebook
    • Twitter
    • Pinterest
    • Instagram
    Top Posts
    Science

    Japan launches SLIM moon lander and XRISM X-ray space telescope on same rocket

    JAXA launched a moon lander and space telescope on the same rocketBJ Warnick/Newscom/Alamy Japan has…

    Gadgets

    Streaming apps are trying to bundle their way out of customer disenchantment

    Enlarge / Michael Keaton in The Flash, which is streaming on Max. YouTube/Warner Bros. Streaming…

    AI

    A chatbot that asks questions could help you spot when it makes no sense

    Fernanda Viégas, a professor of laptop science at Harvard University, who didn’t take part within…

    Gadgets

    The best wireless security cameras in 2023

    We could earn income from the merchandise obtainable on this web page and take part…

    Gadgets

    Back-up and save with Samsung storage on Amazon

    We could earn income from the merchandise accessible on this web page and take part…

    Our Picks
    Gadgets

    At only $50 MS Office 2019 with Windows 11 Pro makes for a great digital gift

    AI

    EPFL and Apple Researchers Open-Sources 4M: An Artificial Intelligence Framework for Training Multimodal Foundation Models Across Tens of Modalities and Tasks

    AI

    On-device real-time few-shot face stylization – Google Research Blog

    Categories
    • AI (1,482)
    • Crypto (1,744)
    • Gadgets (1,796)
    • Mobile (1,839)
    • Science (1,853)
    • Technology (1,789)
    • The Future (1,635)
    Most Popular
    Technology

    More and more people are ditching carrier roaming in favor of travel eSIMs –

    Technology

    This Is Your Last Chance to Grab Windows 11 Pro for the All-Time Low Price of $30

    Crypto

    Singapore Expands Crypto Regulation, Introduces Stricter User Protection Requirements

    Ztoog
    Facebook X (Twitter) Instagram Pinterest
    • Home
    • About Us
    • Contact us
    • Privacy Policy
    • Terms & Conditions
    © 2025 Ztoog.

    Type above and press Enter to search. Press Esc to cancel.