Close Menu
Ztoog
    What's Hot
    The Future

    Give the Gift of Learning With These Subscriptions

    Crypto

    Head Fund Guru Predicts ‘Opportunity Of The Year’

    Mobile

    What is Google’s Enhanced Safe Browsing?

    Important Pages:
    • About Us
    • Contact us
    • Privacy Policy
    • Terms & Conditions
    Facebook X (Twitter) Instagram Pinterest
    Facebook X (Twitter) Instagram Pinterest
    Ztoog
    • Home
    • The Future

      How I Turn Unstructured PDFs into Revenue-Ready Spreadsheets

      Is it the best tool for 2025?

      The clocks that helped define time from London’s Royal Observatory

      Summer Movies Are Here, and So Are the New Popcorn Buckets

      India-Pak conflict: Pak appoints ISI chief, appointment comes in backdrop of the Pahalgam attack

    • Technology

      Ensure Hard Work Is Recognized With These 3 Steps

      Cicada map 2025: Where will Brood XIV cicadas emerge this spring?

      Is Duolingo the face of an AI jobs crisis?

      The US DOD transfers its AI-based Open Price Exploration for National Security program to nonprofit Critical Minerals Forum to boost Western supply deals (Ernest Scheyder/Reuters)

      The more Google kills Fitbit, the more I want a Fitbit Sense 3

    • Gadgets

      Maono Caster G1 Neo & PD200X Review: Budget Streaming Gear for Aspiring Creators

      Apple plans to split iPhone 18 launch into two phases in 2026

      Upgrade your desk to Starfleet status with this $95 USB-C hub

      37 Best Graduation Gift Ideas (2025): For College Grads

      Backblaze responds to claims of “sham accounting,” customer backups at risk

    • Mobile

      Samsung Galaxy S25 Edge promo materials leak

      What are people doing with those free T-Mobile lines? Way more than you’d expect

      Samsung doesn’t want budget Galaxy phones to use exclusive AI features

      COROS’s charging adapter is a neat solution to the smartwatch charging cable problem

      Fortnite said to return to the US iOS App Store next week following court verdict

    • Science

      Failed Soviet probe will soon crash to Earth – and we don’t know where

      Trump administration cuts off all future federal funding to Harvard

      Does kissing spread gluten? New research offers a clue.

      Why Balcony Solar Panels Haven’t Taken Off in the US

      ‘Dark photon’ theory of light aims to tear up a century of physics

    • AI

      How to build a better AI benchmark

      Q&A: A roadmap for revolutionizing health care through data-driven innovation | Ztoog

      This data set helps researchers spot harmful stereotypes in LLMs

      Making AI models more trustworthy for high-stakes settings | Ztoog

      The AI Hype Index: AI agent cyberattacks, racing robots, and musical models

    • Crypto

      ‘The Big Short’ Coming For Bitcoin? Why BTC Will Clear $110,000

      Bitcoin Holds Above $95K Despite Weak Blockchain Activity — Analytics Firm Explains Why

      eToro eyes US IPO launch as early as next week amid easing concerns over Trump’s tariffs

      Cardano ‘Looks Dope,’ Analyst Predicts Big Move Soon

      Speak at Ztoog Disrupt 2025: Applications now open

    Ztoog
    Home » GPT-4o’s Chinese token-training data is polluted by spam and porn websites
    AI

    GPT-4o’s Chinese token-training data is polluted by spam and porn websites

    Facebook Twitter Pinterest WhatsApp
    GPT-4o’s Chinese token-training data is polluted by spam and porn websites
    Share
    Facebook Twitter LinkedIn Pinterest WhatsApp

    The new tokenizer has 200,000 tokens in complete, and about 25% are in non-English languages, says Deedy Das, an AI investor at Menlo Ventures. He used language filters to rely the variety of tokens in numerous languages, and the highest languages, moreover English, are Russian, Arabic, and Vietnamese.

    “So the tokenizer’s main impact, in my opinion, is you get the cost down in these languages, not that the quality in these languages goes dramatically up,” Das says. When an LLM has higher and longer tokens in non-English languages, it could analyze the prompts quicker and cost customers much less for a similar reply. With the brand new tokenizer, “you’re looking at almost four times cost reduction,” he says.

    Das, who additionally speaks Hindi and Bengali, took a take a look at the longest tokens in these languages. The tokens replicate discussions occurring in these languages, in order that they embody phrases like “Narendra” or “Pakistan,” however widespread English phrases like “Prime Minister,” “university,” and “international” additionally come up incessantly. They additionally don’t exhibit the problems surrounding the Chinese tokens.

    That seemingly displays the coaching data in these languages, Das says: “My working theory is the websites in Hindi and Bengali are very rudimentary. It’s like [mostly] news articles. So I would expect this to be the case. There are not many spam bots and porn websites trying to happen in these languages. It’s mostly going to be in English.”

    Polluted data and an absence of cleansing

    However, issues are drastically totally different in Chinese. According to a number of researchers who’ve appeared into the brand new library of tokens used for GPT-4o, the longest tokens in Chinese are virtually solely spam phrases utilized in pornography, playing, and scamming contexts. Even shorter tokens, like three-character-long Chinese phrases, replicate these subjects to a major diploma.

    “The problem is clear: the corpus used to train [the tokenizer] is not clean. The English tokens seem fine, but the Chinese ones are not,” says Cai from Princeton University. It is not uncommon for a language mannequin to crawl spam when gathering coaching data, however normally there shall be vital effort taken to wash up the data earlier than it’s used. “It’s possible that they didn’t do proper data clearing when it comes to Chinese,” he says.

    The content material of those Chinese tokens might counsel that they’ve been polluted by a selected phenomenon: websites hijacking unrelated content material in Chinese or different languages to spice up spam messages. 

    These messages are sometimes commercials for pornography movies and playing websites. They may very well be actual companies or merely scams. And the language is inserted into content material farm websites or typically legit websites to allow them to be listed by search engines like google and yahoo, circumvent the spam filters, and come up in random searches. For instance, Google listed one search consequence web page on a US National Institutes of Health web site, which lists a porn web site in Chinese. The similar web site identify additionally appeared in a minimum of 5 Chinese tokens in GPT-4o. 

    Share. Facebook Twitter Pinterest LinkedIn WhatsApp

    Related Posts

    AI

    How to build a better AI benchmark

    AI

    Q&A: A roadmap for revolutionizing health care through data-driven innovation | Ztoog

    AI

    This data set helps researchers spot harmful stereotypes in LLMs

    AI

    Making AI models more trustworthy for high-stakes settings | Ztoog

    AI

    The AI Hype Index: AI agent cyberattacks, racing robots, and musical models

    Mobile

    Chinese tech icon is about to raise the stakes in a battle with US chipmaker over AI processors

    AI

    Novel method detects microbial contamination in cell cultures | Ztoog

    AI

    Seeing AI as a collaborator, not a creator

    Leave A Reply Cancel Reply

    Follow Us
    • Facebook
    • Twitter
    • Pinterest
    • Instagram
    Top Posts
    Mobile

    How to watch La Liga in the USA

    Lionel Messi could also be tearing it up in the MLS, however he spent a…

    Crypto

    Bitcoin Spot ETF: Bloomberg Analyst Offers Insights On Approval Timing

    James Seyffart, a outstanding analysis analyst for Bloomberg Intelligence has supplied his tackle the timeline…

    AI

    Revolutionizing Cancer Detection: the University of Surrey Unleashes Game-Changing Sketch-Based Object Detection Tool in Machine Learning

    Since prehistoric occasions, individuals have used sketches for communication and documentation. Over the previous decade,…

    Gadgets

    Your productivity dream team: Office 2021 and Windows 11 Pro

    If your present tech setup feels extra “meh” than “modern”, this bundle is the improve…

    Science

    Einstein may be wrong about how mirrors travelling at light speed work

    How does a mirror travelling at light speed behave? We may now knowImagine Photographer/Getty Images…

    Our Picks
    Technology

    Call of Duty: Modern Warfare III faces criticism for rushed storyline

    Crypto

    Ethereum Could See A Steep Decline If It Closes Below This Level

    Science

    A 62-Year-Old German Man Got 217 Covid Shots—and Was Totally Fine

    Categories
    • AI (1,482)
    • Crypto (1,744)
    • Gadgets (1,796)
    • Mobile (1,839)
    • Science (1,853)
    • Technology (1,789)
    • The Future (1,635)
    Most Popular
    Crypto

    PEPE Whale Makes $8.13M In Profit As Bullish Rally Continues

    AI

    What’s next for AI in 2025

    Crypto

    Senators to vote on SEC Commissioner Crenshaw’s renomination next Wednesday after postponement

    Ztoog
    Facebook X (Twitter) Instagram Pinterest
    • Home
    • About Us
    • Contact us
    • Privacy Policy
    • Terms & Conditions
    © 2025 Ztoog.

    Type above and press Enter to search. Press Esc to cancel.