🏃Alibaba Chief Says China AI 2 Years Behind US, How Humor Forum Unexpectedly Makes AI Smarter, and China Approves 117 Gen-AI Models

Weekly China AI News from March 25, 2024 to April 3, 2024

Apr 05, 2024

Hello readers, as Chinese families are honoring their ancestors in the Qingming tomb-sweeping festival, I’m delivering this week's issue early. In this edition, I highlighted Alibaba Chair Joe Tsai’s perspective on China’s AI and the US chip restrictions. Surprisingly, a humor sub-forum on Baidu Tieba, Ruozhiba (弱智吧), has emerged as a goldmine for training Chinese LLMs. China’s Internet regulator has released a full list of 117 generative AI models now approved for public services.

Alibaba Chair Joe Tsai on China AI, Chip Restrictions, and Homegrown GPUs

What’s New: Alibaba’s co-founder and new chief Joe Tsai said in a recent public interview that China is two years behind the top LLM from the U.S. and believes the country can eventually produce its high-end GPUs. Below are quick highlights.

US vs China on AI: “I think China is today behind. It's clear that American companies like OpenAI have really leaped ahead of everybody
else, but China is trying to play catchup. I think China could have a lag that will last for a long time because everybody else is running very fast as
well. I think today we’re probably two years behind the top models.”
Chip Restrictions: “Last October the U.S. put in very stringent restrictions on the ability of companies like Nvidia to export high-end chips to every company in China, so they’ve sort of abandoned the entity list approach and they put the entire China on their list. I think we’re definitely affected by that. In fact, we’ve actually publicly communicated it did affect our cloud business and our ability to offer high-end Computing Services to our customers. So it is an issue in the short run and probably the medium run, but in the long run, China will develop its own ability to make these high-end GPUs.”
Short-term impact: “I think in the next year or 18 months the training of large language models (LLMs) can still go ahead given the inventory that people have. I think there’s more high computing that’s required for training as opposed to the applications, what people call inference. So on the inference side, there are multiple options. You don’t need to have as
high power and high-end chips such as the Nvidia you know the latest model.”
Alibaba’s AI strategy: “We’re one of the largest cloud players in China so AI is essential. Having a good large language model that is proprietarily developed in-house is very important because it helps our cloud business if we have a great LLM and other people, other developers are developing on top of it they’re using our computing services. So we see AI as very much the left hand and right hand for our cloud business. And the other aspect is the e-commerce business is one of the places where you can have the richest use cases for AI. So you can develop a lot of really cool products on top of our own models or even someone else’s open-source model…You can try something on using virtual dressing rooms. Our merchants doing business on our marketplace will be able to use AI to self-generate photos product descriptions and things like that.”

Chinese Reddit-like Humor Forum Ruozhiba (弱智吧) Unexpectedly Makes AI Smarter

What’s New: Ruozhiba, which literally translates to “Idiot Sub-forum”, is a bizarre corner of the Chinese internet. This sub-forum on Reddit-like Baidu Tieba is filled with ridiculous, pun-filled, logically challenging threads that will twist your brain into a pretzel. Here are some examples:

Is it a violation to drink all the water during a swimming race and then run?
Since prisons are full of criminals, why don’t the cops just go arrest people there?
Fresh sashimi is a dead fish slice (生鱼片是死鱼片).

But who knew this forum has unexpectedly become a treasure trove for training Chinese language AI models?

How it Works: A recent paper titled COIG-CQIA: Quality is All You Need for Chinese Instruction Fine-tuning introduced a high-quality Chinese dataset aimed at fine-tuning Chinese LLMs to better understand and respond to instructions like native Chinese.

The dataset contains over 48,000 instruction-response pairs collected from diverse sources on the Chinese internet like Q&A forums, Wiki articles, exams, and existing NLP datasets.

The authors then analyzed the effects of different data sources, including Ruozhiba.

The Ruozhiba dataset only contains 240 instruction-response pairs. The authors first collected the 500 most highly upvoted threads from Ruozhiba. They used the titles of these threads as instructions. For the responses, some were generated by humans and some by GPT-4.

Surprisingly, the authors found that the Yi-34B model fine-tuned on the Ruozhiba data performed the best overall across different tasks on the BELLE-EVAL benchmark, outperforming other data sources like Zhihu, Xiaohongshu, and Wiki.

Additionally, the smaller Yi-6B model fine-tuned on the Ruozhiba subset also ranked third overall, behind only the carefully curated CQIA-Subset and the Exam data.

On the SafetyBench which evaluates ethical and safe behavior, the Yi-6B model trained on Ruozhiba data also secured the second-highest score.

The authors conjectured that Ruozhiba “may enhance the model’s
logical reasoning ability, thereby benefiting most of the instruct-following tasks.”

Why it Matters: It’s just a fun story that I really enjoyed writing about. You never would have guessed that a dataset filled with pure nonsense could actually help enhance AI!

China Approves 117 Generative AI Models for Public Use

What’s New: China has approved 117 generative AI models for public use as of March 28, 2024, the Cyberspace Administration of China (CAC) disclosed for the first time.

Background: Under China’s generative AI regulation, platforms especially chatbots like Baidu’s ERNIE Bot and Alibaba’s Tongyi Qianwen had to seek approval from local CAC offices before launch. Since August last year, any generative AI services “capable of shaping public opinion or mobilizing society” must undergo a safety evaluation and registration process.

Local CAC offices will then publicly disclose information of registered generative AI services.

Key Takeaways

While I haven’t studied all 117 models, assumably a majority of models are language-based models (or LLMs).
No models from non-Chinese companies have made the cut yet.
Beijing and Shanghai stand at the forefront of China’s AI innovation, with 51 models from Beijing and 24 from Shanghai receiving approval.

Weekly News Roundup

Shenzhen-based robotic company UBTech has worked with Baidu to integrate LLMs into humanoid robots. Their demo features the Walker S robot folding clothes and sorting objects through natural language, using Baidu’s ERNIE Bot for task interpretation and planning. (New Atlas)
For 20 yuan, Chinese internet users can now generate animated digital avatars of their dearly missed, as per online advertisements. During this year’s Tomb-Sweeping Festival on Thursday, mourners are embracing AI to connect with those who have passed away. (The Guardian)
Last Friday, the Biden administration tightened regulations to restrict China’s access to U.S. AI chips and chipmaking tools, aiming to curb Beijing’s chip industry growth over national security fears. (Reuters)
On April 2, Kunlun Tech’s AI music generation model SkyMusic began free beta testing, offering 1,000 slots to media and interested music professionals. (Jiqizhixin)

Trending Research

DiJiang: Efficient Large Language Models through Compact Kernelization

InternLM2 Technical Report

Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction

Recode China AI

Discussion about this post