The new tokenizer has 200,000 tokens in complete, and about 25% are in non-English languages, says Deedy Das, an AI investor at Menlo Ventures. He used language filters to rely the variety of tokens in numerous languages, and the highest languages, moreover English, are Russian, Arabic, and Vietnamese.
“So the tokenizer’s main impact, in my opinion, is you get the cost down in these languages, not that the quality in these languages goes dramatically up,” Das says. When an LLM has higher and longer tokens in non-English languages, it could analyze the prompts quicker and cost customers much less for a similar reply. With the brand new tokenizer, “you’re looking at almost four times cost reduction,” he says.
Das, who additionally speaks Hindi and Bengali, took a take a look at the longest tokens in these languages. The tokens replicate discussions occurring in these languages, in order that they embody phrases like “Narendra” or “Pakistan,” however widespread English phrases like “Prime Minister,” “university,” and “international” additionally come up incessantly. They additionally don’t exhibit the problems surrounding the Chinese tokens.
That seemingly displays the coaching data in these languages, Das says: “My working theory is the websites in Hindi and Bengali are very rudimentary. It’s like [mostly] news articles. So I would expect this to be the case. There are not many spam bots and porn websites trying to happen in these languages. It’s mostly going to be in English.”
Polluted data and an absence of cleansing
However, issues are drastically totally different in Chinese. According to a number of researchers who’ve appeared into the brand new library of tokens used for GPT-4o, the longest tokens in Chinese are virtually solely spam phrases utilized in pornography, playing, and scamming contexts. Even shorter tokens, like three-character-long Chinese phrases, replicate these subjects to a major diploma.
“The problem is clear: the corpus used to train [the tokenizer] is not clean. The English tokens seem fine, but the Chinese ones are not,” says Cai from Princeton University. It is not uncommon for a language mannequin to crawl spam when gathering coaching data, however normally there shall be vital effort taken to wash up the data earlier than it’s used. “It’s possible that they didn’t do proper data clearing when it comes to Chinese,” he says.
The content material of those Chinese tokens might counsel that they’ve been polluted by a selected phenomenon: websites hijacking unrelated content material in Chinese or different languages to spice up spam messages.
These messages are sometimes commercials for pornography movies and playing websites. They may very well be actual companies or merely scams. And the language is inserted into content material farm websites or typically legit websites to allow them to be listed by search engines like google and yahoo, circumvent the spam filters, and come up in random searches. For instance, Google listed one search consequence web page on a US National Institutes of Health web site, which lists a porn web site in Chinese. The similar web site identify additionally appeared in a minimum of 5 Chinese tokens in GPT-4o.