In reality, amongst the few lengthy Chinese tokens in GPT-4o that aren’t both pornography or playing nonsense, two are “socialism with Chinese characteristics” and “People’s Republic of China.” The presence of those phrases suggests {that a} vital a part of the coaching information truly is from Chinese state media writings, the place formal, lengthy expressions are extraordinarily widespread.
OpenAI has traditionally been very tight-lipped about the information it makes use of to coach its models, and it most likely won’t ever inform us how a lot of its Chinese coaching database is state media and the way a lot is spam. (OpenAI didn’t reply to MIT Technology Review’s detailed questions despatched on Friday.)
But it’s not the solely firm combating this downside. People inside China who work in its AI business agree there’s an absence of high quality Chinese textual content information units for coaching LLMs. One purpose is that the Chinese web was once, and largely stays, divided up by large corporations like Tencent and ByteDance. They personal most of the social platforms and aren’t going to share their information with rivals or third events to coach LLMs.
In reality, that is additionally why engines like google, together with Google, kinda suck relating to looking in Chinese. Since WeChat content material can solely be searched on WeChat, and content material on Douyin (the Chinese TikTok) can solely be searched on Douyin, this information will not be accessible to a third-party search engine, not to mention an LLM. But these are the platforms the place precise human conversations are occurring, as an alternative of some spam web site that retains attempting to attract you into on-line playing.
The lack of high quality coaching information is a a lot greater downside than the failure to filter out the porn and common nonsense in GPT-4o’s token-training information. If there isn’t an present information set, AI corporations should put in vital work to determine, supply, and curate their very own information units and filter out inappropriate or biased content material.
It doesn’t appear OpenAI did that, which in equity makes some sense, given that folks in China can’t use its AI models anyway.
Still, there are numerous folks dwelling exterior China who wish to use AI companies in Chinese. And they deserve a product that works correctly as a lot as audio system of every other language do.
How can we resolve the downside of the lack of fine Chinese LLM coaching information? Tell me your concept at zeyi@technologyreview.com.