šAlibaba Chief Says China AI 2 Years Behind US, How Humor Forum Unexpectedly Makes AI Smarter, and China Approves 117 Gen-AI Models
Weekly China AI News from March 25, 2024 to April 3, 2024
Hello readers, as Chinese families are honoring their ancestors in the Qingming tomb-sweeping festival, Iām delivering this week's issue early. In this edition, I highlighted Alibaba Chair Joe Tsaiās perspective on Chinaās AI and the US chip restrictions. Surprisingly, a humor sub-forum on Baidu Tieba, Ruozhiba (å¼±ęŗå§), has emerged as a goldmine for training Chinese LLMs. Chinaās Internet regulator has released a full list of 117 generative AI models now approved for public services.
Alibaba Chair Joe Tsai on China AI, Chip Restrictions, and Homegrown GPUs
Whatās New: Alibabaās co-founder and new chief Joe Tsai said in a recent public interview that China is two years behind the top LLM from the U.S. and believes the country can eventually produce its high-end GPUs. Below are quick highlights.
US vs China on AI: āI think China is today behind. It's clear that American companies like OpenAI have really leaped ahead of everybody
else, but China is trying to play catchup. I think China could have a lag that will last for a long time because everybody else is running very fast as
well. I think today weāre probably two years behind the top models.āChip Restrictions: āLast October the U.S. put in very stringent restrictions on the ability of companies like Nvidia to export high-end chips to every company in China, so theyāve sort of abandoned the entity list approach and they put the entire China on their list. I think weāre definitely affected by that. In fact, weāve actually publicly communicated it did affect our cloud business and our ability to offer high-end Computing Services to our customers. So it is an issue in the short run and probably the medium run, but in the long run, China will develop its own ability to make these high-end GPUs.ā
Short-term impact: āI think in the next year or 18 months the training of large language models (LLMs) can still go ahead given the inventory that people have. I think thereās more high computing thatās required for training as opposed to the applications, what people call inference. So on the inference side, there are multiple options. You donāt need to have as
high power and high-end chips such as the Nvidia you know the latest model.āAlibabaās AI strategy: āWeāre one of the largest cloud players in China so AI is essential. Having a good large language model that is proprietarily developed in-house is very important because it helps our cloud business if we have a great LLM and other people, other developers are developing on top of it theyāre using our computing services. So we see AI as very much the left hand and right hand for our cloud business. And the other aspect is the e-commerce business is one of the places where you can have the richest use cases for AI. So you can develop a lot of really cool products on top of our own models or even someone elseās open-source modelā¦You can try something on using virtual dressing rooms. Our merchants doing business on our marketplace will be able to use AI to self-generate photos product descriptions and things like that.ā
Chinese Reddit-like Humor Forum Ruozhiba (å¼±ęŗå§) Unexpectedly Makes AI Smarter
Whatās New: Ruozhiba, which literally translates to āIdiot Sub-forumā, is a bizarre corner of the Chinese internet. This sub-forum on Reddit-like Baidu Tieba is filled with ridiculous, pun-filled, logically challenging threads that will twist your brain into a pretzel. Here are some examples:
Is it a violation to drink all the water during a swimming race and then run?
Since prisons are full of criminals, why donāt the cops just go arrest people there?
Fresh sashimi is a dead fish slice (ēé±¼ēęÆę»é±¼ē).
But who knew this forum has unexpectedly become a treasure trove for training Chinese language AI models?
How it Works: A recent paper titled COIG-CQIA: Quality is All You Need for Chinese Instruction Fine-tuning introduced a high-quality Chinese dataset aimed at fine-tuning Chinese LLMs to better understand and respond to instructions like native Chinese.
The dataset contains over 48,000 instruction-response pairs collected from diverse sources on the Chinese internet like Q&A forums, Wiki articles, exams, and existing NLP datasets.
The authors then analyzed the effects of different data sources, including Ruozhiba.
The Ruozhiba dataset only contains 240 instruction-response pairs. The authors first collected the 500 most highly upvoted threads from Ruozhiba. They used the titles of these threads as instructions. For the responses, some were generated by humans and some by GPT-4.
Surprisingly, the authors found that the Yi-34B model fine-tuned on the Ruozhiba data performed the best overall across different tasks on the BELLE-EVAL benchmark, outperforming other data sources like Zhihu, Xiaohongshu, and Wiki.
Additionally, the smaller Yi-6B model fine-tuned on the Ruozhiba subset also ranked third overall, behind only the carefully curated CQIA-Subset and the Exam data.
On the SafetyBench which evaluates ethical and safe behavior, the Yi-6B model trained on Ruozhiba data also secured the second-highest score.
The authors conjectured that Ruozhiba āmay enhance the modelās
logical reasoning ability, thereby benefiting most of the instruct-following tasks.ā
Why it Matters: Itās just a fun story that I really enjoyed writing about. You never would have guessed that a dataset filled with pure nonsense could actually help enhance AI!
China Approves 117 Generative AI Models for Public Use
Whatās New: China has approved 117 generative AI models for public use as of March 28, 2024, the Cyberspace Administration of China (CAC) disclosed for the first time.
Background: Under Chinaās generative AI regulation, platforms especially chatbots like Baiduās ERNIE Bot and Alibabaās Tongyi Qianwen had to seek approval from local CAC offices before launch. Since August last year, any generative AI services ācapable of shaping public opinion or mobilizing societyā must undergo a safety evaluation and registration process.
Local CAC offices will then publicly disclose information of registered generative AI services.
Key Takeaways
While I havenāt studied all 117 models, assumably a majority of models are language-based models (or LLMs).
No models from non-Chinese companies have made the cut yet.
Beijing and Shanghai stand at the forefront of Chinaās AI innovation, with 51 models from Beijing and 24 from Shanghai receiving approval.
Weekly News Roundup
Shenzhen-based robotic company UBTech has worked with Baidu to integrate LLMs into humanoid robots. Their demo features the Walker S robot folding clothes and sorting objects through natural language, using Baiduās ERNIE Bot for task interpretation and planning. (New Atlas)
For 20 yuan, Chinese internet users can now generate animated digital avatars of their dearly missed, as per online advertisements. During this yearās Tomb-Sweeping Festival on Thursday, mourners are embracing AI to connect with those who have passed away. (The Guardian)
Last Friday, the Biden administration tightened regulations to restrict Chinaās access to U.S. AI chips and chipmaking tools, aiming to curb Beijingās chip industry growth over national security fears. (Reuters)
On April 2, Kunlun Techās AI music generation model SkyMusic began free beta testing, offering 1,000 slots to media and interested music professionals. (Jiqizhixin)
Trending Research
DiJiang: Efficient Large Language Models through Compact Kernelization
Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction