🔍From Zero to One: A Brief History of BAAI's Wudao LLMs

A Collective Endeavor of Top Academic Minds to Build a Chinese Language Model at an Unprecedented Scale

Oct 03, 2023

Hello readers! Last week was a bit of a slow news week in China’s AI world, so I took the opportunity to finally translate a feature story that’s been on my to-do list for months. Published by Leiphone (雷峰网), one of China’s premier deep tech publications, the piece delves into the origin story of Wudao (悟道), a series of Large Language Models (LLMs) from the Beijing Academy of Artificial Intelligence (BAAI). What’s truly special about Wudao isn't just its capabilities, but how it became a breeding ground for young, talented Chinese scientists who went on to create their own LLMs and companies. The story is a long one—over 7,000 words (ChatGPT did most of the translation). You can find the original Chinese article here.

China Now Has The Largest Language Model With WuDao 2.0

Chapter 1

Sesame Street Responds to Claims That Bert and Ernie Are Gay | Time

The story began in the autumn of 2018 in Haidian District, Beijing. On that day, October 11, a regular Thursday, Liu Zhiyuan habitually opened the arXiv website as usual and browsed the latest artificial intelligence (AI) works uploaded by scholars from all over the world. Most of the time, the quality of papers on arXiv was uneven, and Liu Zhiyuan only skimmed them to get a general idea. But that day, he was deeply attracted by a paper by Google’s language group.

Originally, he only clicked in to take a look, but he became more and more fascinated and surprised the more he read it. After closing his computer, he still couldn’t recover for a long time, overwhelmed by the ideas in it. Sure enough, he soon discovered that this paper had also attracted widespread attention from other AI scholars in China. Teachers and students from top universities like Tsinghua, Peking, Renmin, and Fudan were also enthusiastically discussing this work.

Everyone vaguely felt: “This could be another technological paradigm shift in the field of AI.”

This work was the later famous BERT paper - “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding”, which has now been cited over 700,000 times on Google Scholar.

In the Chinese context, “paradigm范式” is not a common word. But in Leiphone’s interviews about LLMs, this word was mentioned repeatedly, once describing deep learning in 2012, once BERT in 2018, and another time the direction of LLM startups in China before ChatGPT came out in 2022: “At that time, no one thought about aiming for artificial general intelligence (AGI), but felt that LLM could be made into a universal AI paradigm.” That’s the story later.

Back to BERT.

Paradigm refers to the basic system and framework of a field, such as western suits and Hanfu being two different paradigms in the clothing field. On the basis of these two paradigms, fashion designers can design all kinds of styles and models. In short, the paradigm represents a change in underlying thinking, dividing the past from the future.

And BERT’s “bidirectional pre-training” approach embodied this potential.

AI has three main directions: computer vision (CV), natural language processing (NLP), and machine learning (ML). The ultimate goal of NLP is to enable computers to understand human language. So how do we judge that a computer has understood human language? For a long time before BERT, the NLP research approach was to break down language understanding into small task directions, such as machine translation, text contrast, semantic analysis, etc., and then design and train AI algorithms for each task separately. For example, Liu Zhiyuan’s research direction during his Ph.D. (2006-2011) was a basic NLP task called “keyword extraction”.

The difference between BERT and traditional methods is that traditional statistical learning or deep learning allows AI algorithms to directly learn data for a specific task (such as text contrast). Before learning this data, AI is like a blank slate, without any basic capabilities, and the trained algorithms can only perform one task. BERT’s pre-training method, on the other hand, is first to let AI recite a large amount of annotated data before learning task data, which is like doing a full set of test papers before an exam, so the trained algorithms perform better in subsequent “exams”.

BERT was not the first pre-trained language model to use pre-training methods. A few months earlier, OpenAI released GPT-1, which was also a pre-trained language model. However, BERT’s innovation was that it broke the reliance on pre-training methods for specified task frameworks using bidirectional training ideas.

GPT-1 had a unidirectional structure that could only learn textual information from left to right or from right to left, so the trained algorithm could only perform one language task, such as GPT-1 being good at text generation but not understanding. BERT has a bidirectional structure that can simultaneously learn language representations from both the left and right, and learn from massive unlabeled data across multiple tasks, so it can perform multiple language tasks like question answering, fill-in-the-blank, and text understanding all at once, outperforming all models at the time on each task, and soon dominated the authoritative NLP leaderboard GLUE.

Everyone was shocked by BERT’s results, just like going back to when deep learning first demonstrated its power in 2012:

That year, Geoffrey Hinton, a professor at the University of Toronto, led two students, Alex Krizhevsky and Ilya Sutskever (now OpenAI’s chief scientist), to use deep learning methods to train AlexNet, which swept the world computer vision competition ImageNet, leaving all other statistical learning algorithms far behind. “Deep learning” became famous overnight, and even NLP scholars kept discussing it.

Compared to deep learning, BERT made much smaller waves at the time, but a number of domestic NLP scholars also felt a sense of urgency that it was now or never.

Although there are no precise statistics, many scholars told Leiphone that after the rise of deep learning in 2012, whether in research or implementation, vision was the direction with the largest number of researchers and the hottest research enthusiasm in the domestic AI circle. From 2012 to 2018, changes in the language field were not as great as those in vision, and it did not stand out, particularly in embracing the wave of deep learning.

Liu Zhiyuan belonged to the Natural Language Processing Laboratory (THUNLP) at Tsinghua University. In 2012, Sun Maosong, the director of the lab, happened to lead the application for a 973 national key project, and in order to better determine the future technical route for NLP, organized several units, including Peking University, Harbin Institute of Technology, Institute of Automation of Chinese Academy of Sciences, and Baidu to discuss together. Everyone was optimistic about deep learning, so after the project was successfully applied for, THUNLP also turned to deep learning starting in 2013. Later, deep learning swept the globe as expected.

Since then, “daring to revolutionize oneself” has become THUNLP’s research spirit. After BERT came out, Liu Zhiyuan also quickly decided to turn to pre-training methods. Their idea was to use knowledge graph methods to extract abstract knowledge one by one, and then inject it into pre-trained language models to make the models smarter. They cooperated with Liu Qun and Jiang Xin from Huawei Noah’s Ark Lab to quickly develop a pre-trained language model called “ERNIE” and submitted it to the top NLP academic conference ACL 2019.

Coincidentally, in 2018, Baidu’s NLP team was also shocked by BERT, and almost at the same time completed a pre-trained language model, taking the lead in publishing it on arXiv, also named “ERNIE”. Both teams named it after characters from the American cartoon Sesame Street, as previous pre-trained models like ELMO and BERT were also Sesame Street characters. Google used BERT, so when they wanted to match Google, they thought of it together.

The two “ERNIE” outperformed BERT on some tasks. Baidu’s release on arXiv preceded THUNLP’s cooperating paper, which was accepted afterward. To distinguish from Baidu’s, Liu Zhiyuan et al changed the model name, while Baidu continued to use this name. Later, when Baidu refined LLMs, the Chinese name was “Wenxin”, but the English name remained “ERNIE”.

As expected, pre-training quickly became the mainstream method in NLP. At the same time, some international teams also had keen intuition and quickly followed BERT’s bidirectional pre-training method. In February 2019, OpenAI released GPT-2. Although GPT-2 had better generative effects than GPT-1, it was still inferior to BERT on many language tasks, so OpenAI’s voice was completely overwhelmed by Google at the time.

But a year and a half later, history was refreshed again: In June 2020, OpenAI suddenly released a research result that exceeded everyone’s imagination - GPT-3 with a parameter size of 175 billion. As fellow pre-trained language models, GPT-3’s parameter volume was 500 times that of BERT. Not only could it do language generation, but it also surpassed BERT on various language understanding tasks.

Everyone’s research worldview was subverted.

Chapter 2

OpenAI Unveils 175 Billion Parameter GPT-3 Language Model | Synced

No one expected that enlarging the parameters of pre-trained language models would lead to so-called “Emergent Abilities”. Google’s corroborating paper on this phenomenon was not published until a year later.

BERT had 340 million parameters, which was undoubtedly a LLM compared to all language models in 2018. But people’s focus was more on its pre-training method, and no one thought of directly “piling on parameters” like OpenAI did. GPT-3’s approach of piling on parameters was like having the AI model directly memorize the entire library.

As a result, the rote-memorizing GPT-3 not only had very strong language understanding abilities, but also some reasoning capabilities. Even for some unlabeled data and tasks, GPT-3 could learn as it went, achieving decent results.

Previously, when knowledge was injected into small language models, their intelligence levels would also improve, which everyone could understand. But OpenAI skipped the step of extracting knowledge from text data, and relied entirely on piling parameters to force GPT-3 to learn, which caught everyone completely off guard. Some even claimed that GPT-3 had actually passed the Turing test.

The Turing test was proposed by the “father of AI” Alan Turing in 1950. After 70 years of AI development globally, this was the first time it had been passed, so the impact on the AI community was huge. GPT-3 was not only a major breakthrough in natural language processing, but also a milestone in the field of AI. For a time, discussions on language intelligence reached unprecedented heights. Not only NLP scholars like Liu Zhiyuan, but also researchers in information retrieval were constantly discussing it.

Even more exaggerated was OpenAI’s claim that they used 10,000 GPUs to train GPT-3.

Normally in academic research, the cost of computing devices accounts for about 20% of a professor’s total research funding. Having more than 500 cards means you’re a “wealthy” player in academia. Previously, most AI scientists in China and abroad used single cards, or multiple cards on a single machine, when researching NLP. But GPT-3’s training used a total of 10,000 cards, which can cost about $12 million USD, over 80 million RMB.

From an engineering perspective, the engineering difficulty of training GPT-3 was also unprecedented. Take BERT as an example, the engineering effort for training 340 million parameter BERT compared to 175 billion parameter GPT-3 is like the difference between making a toy car and an airplane. The engineering for a toy car doesn’t work for an airplane, similarly, past knowledge about training small language models didn’t apply to LLMs.

GPT-3 crushed BERT, which is essentially a crushing of “large-scale pre-trained language models” over “pre-trained language models”.

On one hand, everyone was excited about GPT-3. On the other hand, they also felt a huge gap internally. Before this, most domestic scholars thought Chinese teams’ papers were on par with top US universities. After GPT-3, they realized there was still such a big gap between themselves and international state-of-the-art levels.

In the summer of 2020 in Beijing’s Wudaokou, computer and AI scholars from Tsinghua, Peking University, Renmin University, Chinese Academy of Sciences, etc., were all paying attention to GPT-3. Although no one could clearly explain GPT-3’s powerful mechanisms at the time, intuition told everyone that this was an important watershed in the field of AI. The impact of GPT-3 was so great that some scholars decided to research large pre-trained language models, no matter what.

Liu Zhiyuan was one of them. At the time, the most prominent obstacle to researching LLMs was computing power. Liu Zhiyuan reached out to Tsinghua professors in high-performance computing like Chen Wenguang and Han Wentao, to collaborate on using distributed acceleration computing to reduce the training costs of LLMs. He also looked beyond THUNLP to seek outside help.

At that time, Sun Maosong was the Chief Scientist of Natural Language Processing in an emerging AI research institution less than 100 meters from the east gate of Tsinghua, where Liu Zhiyuan was also a young scientist. Naturally, Liu Zhiyuan thought of going there to discuss collaboration.

This institution is the now famous Beijing Academy of Artificial Intelligence (BAAI).

But at the time, BAAI was a research unit that had just been established for a year and a half and was still developing.

BAAI’s establishment was part of the blueprint for the construction of the Beijing International Innovation Center, jointly guided by the Ministry of Science and Technology and Beijing Municipal Government, with the mission of exploring the frontiers of AI. Through projects like “BAAI Scholars”, “BAAI Conventions”, and “Qingyuan Meetings”, BAAI connected around 100 outstanding AI scientists in Beijing, while also working with BAAI scholars to find the “next big thing” in AI.

BAAI President Huang Tiejun told Leiphone that the selection of BAAI scholars was very strict in itself, so after selecting the corresponding scholars, BAAI would provide them with corresponding funding support without requiring them to submit research results. On the contrary, BAAI cared more about everyone being able to explore major AI directions worth investing in together.

In April 2019, BAAI identified several major directions, including natural language processing, machine learning, information retrieval, etc. Each direction gathered 5-10 well-known scholars for discussion. The natural language processing direction had Sun Maosong, He Xiaodong, Liu Zhiyuan and others; the intelligent information retrieval direction had Wen Jirong, Tang Jie and others. After GPT-3 came out, scholars in the major directions discussed GPT-3 and how to research China’s LLMs.

Before reaching a consensus, there were several key discussions within BAAI.

The first two were at Yanqi Lake in Beijing: July 2020 was a meeting for the machine learning direction. Scholars in this direction felt GPT-3 was a major direction. Now that LLMs had emerged, they should research visual LLMs. But after discussion, they felt visual LLMs required even more computing power, so no actions were taken. August was the information retrieval and mining direction, where Wen Jirong, Tang Jie and others discussed LLMs. In September, at BAAI’s office meeting, Liu Zhiyuan proposed researching universal language models.

After National Day, on October 10, BAAI held another discussion at Yanqi Lake, inviting scholars from different directions to attend. They finally reached a consensus at the meeting to form a task force and collaborate on LLMs.

After approval, BAAI sent out “hero recruitment posts” through various channels, inviting scholars interested in LLMs to participate together, with the slogan “Heroes don’t ask where you’re from”. This call to action resonated with scholars, and many signed up.

The first were professors from Tsinghua and Renmin, including Liu Zhiyuan, Wen Jirong, Tang Jie, Huang Minlie and others. Subsequently, scholars from Peking University, Chinese Academy of Sciences and other institutions also expressed interest, and some external BAAI members also participated, such as Yang Hongxia, who worked at Alibaba Damo Academy at the time. In the end, BAAI’s LLM project gathered about 100 people, with the then BAAI Deputy Dean Tang Jie appointed as the project leader.

That October, BAAI reported this “100 People LLM Plan” to Beijing Mayor Chen Jining at the time. Mayor Chen was very excited and said, “This (LLM) is the nuclear fission point for the future of AI, and will bring prosperous ecological development.” Beijing decided to strongly support it and approved special funding for BAAI to purchase computing power.

In fact, at that time, many people still didn’t understand what LLMs were, and developing LLMs required high costs. But in October 2020, from scholars to BAAI, from Beijing to the Ministry of Science and Technology, everyone reached a consensus - to fully advance the research and development of Chinese LLMs. Afterward, many scholars expressed amazement to Leiphone: “Strangely, everyone was decisive at the time.”

Everyone felt LLMs could do something bigger. In addition to LLMs, the idea of “quantitative change leading to qualitative change” could also lead to breakthroughs in other fields. After discussion, they decided to “divide into four groups” and explore Chinese LLMs from four directions: Chinese language models, multimodal models, cognitive models, and protein models, led by Liu Zhiyuan, Wen Jirong and Tang Jie, respectively, with Tang Jie responsible for the latter two, essentially “three teams doing four things”.

In November 2020, the teams discussed names during the NLP annual conference at Chunhui Garden in Shunyi. Sun Maosong said everyone was researching language, so he suggested using “Wen文” (meaning language/literature). After discussion, the four teams were named after four of the seven imperial libraries of the Qing Dynasty’s Complete Library of the Four Treasuries: “Wen Yuan文渊”, “Wen Lan文澜”, “Wen Hui文汇”, and “Wen Su文溯”.

To indicate they were one entity, BAAI suggested giving them a unified codename, and invited everyone to BAAI’s then office in the Sai’er Building in Wudaokou. At the meeting, Tang Jie proposed relating it to Wudaokou, since everyone had deep feelings for Wudaokou. So everyone came up with a few names together. After brainstorming, Song Ruihua from Renmin University suggested “Wudao悟道”, a homophone for “Wudaokou”, and everyone agreed.

That’s how “Wudao悟道” came about.

Chapter 3

1.75万亿参数、在“神威”上训练，刚刚智源发布了全球最大预训练模型“悟道2.0” - 知乎

Wudao’s original intention was very pure: to catch up with GPT-3 and research Chinese LLMs.

So what are “Chinese LLMs”?

Nowadays, there are many types of LLMs in China, to the point that the definition of LLMs has become blurred. But in 2020, Wudao members had a very focused understanding: fundamentally, GPT-3 was an English-centric language model, while China didn’t have one at the time. Therefore, the “Chinese LLM” should first be a Chinese-centric large-scale pre-trained language model with over 175 billion parameters, like GPT-3.

Although later research showed that monolingual language models also have some multilingual capabilities, in the Chinese context, people found that using GPT-3 to solve many Chinese language tasks often led to semantic ambiguities, logical errors, etc. One reason is that GPT-3’s training data is mainly English, and Chinese research teams have no way of knowing GPT-3’s detailed training parameters for fine-tuning. So, whether subjectively or objectively, in 2020, independently developing domestic LLMs was an inevitable choice.

BAAI approved the project in October 2020. Since LLMs require large computing power, BAAI also began heavily investing resources like computing power from October. BAAI originally planned to purchase 300P with existing research funds. With Mayor Chen Jining’s approval to strongly support it, it was decided to allocate another 700P from special funds, so the total was 1000P. However, the approval and purchasing process took over a year, so Wudao relied mainly on rented computing power at the start.

Everyone believed LLMs were the future major direction. Related scholars also brought their own resources to participate in BAAI’s LLM project: in terms of manpower, each professor brought their teams of graduate students; for resources, when BAAI’s computing power was not fully in place, scholars also obtained some computing power through their own channels. For example, Wen Jirong’s team initially trained multimodal LLMs on Renmin University’s machines, while Tang Jie’s team ran on Alibaba Cloud.

Although GPT-3 made big waves, teams like BAAI fully committed to LLMs were still rare in China at the time, and Wudao was even belittled for a time. There were two main reasons for the dismissals: one was that developing LLMs was very costly, with computing costs easily reaching tens of millions; two was that LLMs were not original innovations, relying only on piling parameters, with low technical sophistication. But BAAI insisted on exploring.

After they truly started research, they discovered: OpenAI was not a bluffing charlatan, and the technical barriers to LLMs were not just about “piling computing power” and “piling parameters”. Take Chinese and multimodal LLMs for example. Before Wudao, global AI exploration in these two areas was a complete blank. Since they were the first in China to train LLMs, it was like starting from scratch, a very challenging process.

But relying precisely on this kind of fearless courage to forge ahead, after six months Wudao’s LLMs made leapfrog progress.

Two months after Wudao’s approval in December 2020, Liu Zhiyuan, Huang Minlie and Han Wentao’s Wenyuan team released the world’s first open-source Chinese LLM “CPM”. CPM only had 2.6 billion parameters, negligible compared to GPT-3, but its advantage was using Chinese data. Moreover, compared to 2019’s “ERNIE”, CPM’s parameters increased by several hundred times. This was not only an engineering feat, but also validated the viability of Wenyuan’s approach to training Chinese LLMs.

Almost at the same time as CPM, Wenlan and Wenhui also found solutions. The core Wenlan member Lu Zhiwu’s “Twin Towers” approach was validated in December 2020, and Wenhui’s 100 billion parameter model was completed in January 2021. In March 2021, BAAI combined Wenyuan’s CPM, Wenlan’s 300 million image-text pair trained multimodal model BriVL 1.0, Wenhui’s 100 billion parameter English-Chinese bilingual LLM GLM-10B, multimodal model CogView 1.0 and other results, collectively called “Wudao 1.0” and released them in March 2021.

Objectively, “Wudao 1.0” did not cause much sensation, but at a time when LLMs were still unfamiliar in China, Wudao showed people “what LLMs are”: they can write poetry, answer questions, align text and images... more powerful than any previous NLP algorithms.

At the “Wudao 1.0” press conference, BAAI also first proposed the concept of LLMs大模型, aka LLMs. BAAI President Huang Tiejun coined a phrase, saying that in recent years, AI development had gradually shifted from “refining models” to “refining LLMs”. That is: after the rise of deep learning in 2012, many small AI models appeared globally, while “refining LLMs” trains LLMs intensively, designs more advanced algorithms, integrates more data, pools huge computing power, so one model can serve many enterprises.

In other words, LLMs have not only large parameters, but high intelligence. This press conference cleared up outside doubts about BAAI, and Wudao’s LLMs began to emerge.

In Wenhui led by Tang Jie, Alibaba Damo Academy engineer Yang Hongxia and Recurrent AI Co-Founder Yang Zhilin were core members. BAAI did not restrict Wudao members’ research freedom. Yang Hongxia participated in Alibaba’s LLMs, Yang Zhelin led Recurrent AI to cooperate with Huawei. In April 2021, Alibaba also released its 27 billion parameter LLM “PLUG”, and Huawei released Pangu. Wudao not only connected scholars, but also strengthened cooperation between academia and industry.

Like Wenyuan, Wenhui also gathered young research talents from high-performance computing, such as Chen Wenguang and Zhai Jidong, who along with Han Wentao belonged to Academician Zheng Weimin’s team. For LLMs, high-performance computing’s distributed acceleration computing methods are crucial for improving training speed and reducing costs. High computing talents were also given important responsibilities in the Wudao project.

But for Chinese LLMs, high-performance computing’s greater influence was birthing China’s first trillion-parameter model: “Wudao 2.0”.

At the end of 2020, while advancing Wudao, Tang Jie, Chen Wenguang and Yang Hongxia were also planning another thing: applying for the Gordon Bell Prize, known as the “Nobel Prize of supercomputing applications”.

To apply for the Gordon Bell Prize, the supercomputer needs to meet several requirements: one, it must be the world’s largest; two, the project researched on it must max out the machine; three, the project results must be impactful. After completing GLM-10B in January 2021, they decided to run LLMs on the supercomputer.

So they sent over 30 people to Mountain Sea AI Lab in Qingdao to run LLMs on “Sunway TaihuLight”. The students of Tang Jie and Zhai Jidong were the backbone, and Zhai Jidong was recruited by Tang Jie and Chen Wenguang for his outstanding capabilities in parallel training of low-level operators. There were also some Alibaba engineers providing online support.

They brought all the data they had to Qingdao, including Chinese, English, images, etc., mixed together for training. To meet the Gordon Bell Prize requirement of maxing out the machine, they expanded the model parameters to 1.74 trillion, with no data convergence. After running on the supercomputer for ten days, they trained several versions of LLMs, each with hundreds of billions of parameters.

Although the scale was huge, the operating costs were also extremely high, beyond almost everyone’s affordability. So they trained a more converged MoE-based model with 1.75 trillion parameters, 10 times larger than GPT-3, surpassing Google’s 1.6 trillion parameter Switch Transformer released in April 2021 to become the world’s largest model at the time. At BAAI’s June 2021 conference, where it was unveiled, it shocked the entire audience and was viewed as “Wudao 2.0”, receiving widespread acclaim from top domestic and foreign technology teams.

For a time, BAAI gained unmatched glory and joined the international forefront of LLMs.

Apart from this trillion-parameter model, “Wudao 2.0” also included Wenyuan’s two 10 billion models (11 billion parameter Chinese model, 11 billion parameter English-Chinese bilingual model) and one trillion model (198 billion parameter English-Chinese bilingual MoE model), collectively called “CPM 2.0”; Wenlan’s 5 billion parameter image-text retrieval LLM BriVL 2.0 - this was China’s first multimodal LLM, and the world’s largest and most trained multimodal model at the time.

Before Wenlan, academia’s mainstream approach to multimodality was “single tower”, meaning the Transformer had 12 layers, looking like one tower, with text and image tokens input together interacting, and then scoring based on similarity. But with extremely large parameters, online one-by-one comparison would be very inefficient. So Lu Zhiwu proposed the “Twin Towers” approach:

Images are first processed by an image encoder, text is also first processed by a text encoder, without interaction. After understanding higher-level meanings separately, comparative learning is then conducted. If the image and text meanings are similar, the Twin Towers are close, otherwise distant. Because they pre-encoded images in parallel to convert them into high-dimensional vectors for storage, when retrieving text, they only need to encode the text, and could find matched results in the high-dimensional vectors in less than a second. Wenlan verified the feasibility of the “Twin Towers” approach in November 2020. Two months later, OpenAI’s released CLIP architecture (DALL-E’s backing force) was the same idea.

Afterward, Lu Zhiwu told Leiphone that they don’t think they “did research following others”; whether Chinese, multimodal, or trillion-parameter models, Wudao’s three groups were all pioneering new frontiers in uncharted territory.

To research multimodal LLMs, Lu Zhiwu devoted all his students to Wenlan, and the team went a full year without publishing any academic papers. In academia, this was an enormous risk for both teachers and students.

Similarly, due to the lack of high-quality Chinese data, many of Liu Zhiyuan and Huang Minlie’s students were assigned to data annotation and cleaning when researching Chinese LLMs. In CPM 2.0 research, Wenyuan’s raw data collection reached 50TB, and after cleaning was still 2.6TB. Students invested a huge time and effort.

In general, BAAI’s 100 Wudao members were going all in, “gambling their careers”, but unexpectedly they won: after releasing “Wudao 2.0” in June 2021, BAAI Wudao became a prominent flag for Chinese LLMs, and Wudao members became the first pioneers of Chinese LLMs.

Chapter 4

Stanford HAI on X: "NEW: This comprehensive report investigates foundation models (e.g. BERT, GPT-3), which are engendering a paradigm shift in AI. 100+ scholars across 10 departments at Stanford scrutinize their capabilities,

In reality, 2021 was considered the “Year One of LLMs in China.” After the release of Wudao 2.0 in September, Baidu released its 10-billion-parameter model PLATO-X and 260-billion-parameter model ERNIE 3.0 Titan; in October, Alibaba’s DAMO Academy released an LLM with up to 10 trillion parameters known as ‘M6.’

Despite the high costs of training LLMs, a group of dedicated LLM followers emerged in 2021. Both domestically and internationally, authoritative voices emerged. Two weeks after the launch of Wudao 2.0, Google published a paper claiming that language models would exhibit “Emergent Abilities” when scaling from tens to hundreds of billions of parameters. In August 2021, a review paper on “Foundation Models” authored by a hundred scholars like Li Fei-Fei and Percy Liang from Stanford University caused a significant international stir.

However, many Wudao team members knew that in 2021, a true domestically produced LLM with hundreds of billions of parameters had not yet appeared.

The underlying architecture for both the hundred-billion and trillion-parameter versions of Wudao 2.0 was sparse. The trillion-parameter model took up about 20TB of disk space and required over 500 A100 GPUs for inference. After copying the model from Shandong to Beijing, the Wudao team found it too expensive to operate and opened it to the industrial sector. Several companies copied the files but probably couldn’t use them either.

In terms of technology, the LLM also suffered from “catastrophic forgetting,” particularly when image data were added. This significantly weakened the model’s textual capabilities, making it even less effective than the 10-billion-parameter model GLM-10B.

Compared to their technological breakthroughs, the LLMs’ greater contribution was cultivating a generation of young talents who truly understood how to train LLMs. That’s why, after the launch of Wudao 2.0, the team members were even more determined to develop a model with hundreds of billions of parameters.

By the end of 2021, Tang Jie suggested at an internal Wudao meeting several objectives: training a model with hundreds of billions of parameters, developing a text-to-video model, and a code generation model. But achieving these goals would require 1,000 GPUs running flawlessly for two months, with very high training costs.

Wudao 2.0 attracted a lot of attention, but there were insufficient computational resources. Tang Jie’s team was invited to use the 910A machines at the Pengcheng Laboratory. They also received nearly 2,000 Huawei 920 GPUs, which initially had only 18% of the A100’s operator efficiency. After modifications, the efficiency was raised to about 40%.

During this period, Tang Jie’s team adapted various cards available in the market. They found that it was not possible to quickly converge a hundred-billion-parameter model with 2,000 910A cards, nor with tens of thousands of DCU cards running for two months. In the end, under the name of his startup Zhipu AI, Tang Jie rented 1,000 cards from the Jinan Supercomputing Center and committed a team of over 20 people to train for 8 months. Finally, in July 2022, they trained the hundred-billion-parameter model—GLM-130B.

Meanwhile, other teams based on Wudao developed many unprecedented applications. For example, Liu Zhiyuan’s student Qin Yujia wrote a program that used a Chinese LLM to call Bing’s search engine to answer questions on Zhihu, accumulating thousands of upvotes. Lu Zhiwu’s team used a multi-modal LLM to edit short videos, accumulating 1.5 million views on TikTok.

However, the market in China was not yet willing to pay for LLMs. After setting up their LLM companies, they all went out to raise funds confidently, but not a single investor was willing to pay.

All of Wudao’s LLM achievements were open-source. But even after tens of millions of API calls following the release of Wenlan, many interested large enterprises were not willing to pay for usage.

In 2022, domestic awareness of LLMs was still generally lacking. Everyone knew that LLMs were strong, and everyone also knew that a “hit product” was needed to showcase the capabilities of LLMs. But no one had a solution. Technically, they had become giants; but in terms of products, they were still dwarfs.

That was until the appearance of ChatGPT.

Chapter 5

What Is ChatGPT? A Review Of The AI In Its Own Words – Forbes Advisor

Song Ruihua joined Renmin University in September 2020 and began participating in the Wudao Wenlan research in October. Prior to this, she was the Chief Scientist at Microsoft Xiaoice, specializing in text generation and leading the “Xiaoice Writes Poetry” project.

After moving from Microsoft to Xiaoice in 2018, Song Ruihua began to take an interest in cognitive intelligence and wanted to explore how AI understands human language. That summer, she read a book by Benjamin Bergen, a cognitive science professor at the University of California, San Diego, titled “Louder Than Words: The Science of How The Mind Makes Meaning.” She found it inspiring.

The book points out that when humans read a good piece of work, they often can’t stop reading and imagining scenes corresponding to the text. If the text is well crafted, these scenes come to life in the reader’s mind. Therefore, a key indicator of true understanding is the ability to imagine a scene or even add content not present in the text.

Additionally, understanding language is not about using words to perform tasks, much like reading books is not about preparing for an exam the next day. However, in the past, scientists in the field of computing often evaluated whether AI understands human language by setting up specific, segmented tasks. For example, they would compare sports articles with financial articles to see if AI could distinguish between them.

Before ChatGPT, most of the technical staff researching AI dialogue in China came from the era of forums. Their research ideas mainly originated from forum-type chats, such as thread-based conversations where A posts a topic and B and C reply underneath. In this pattern, the model, when conducting open dialogue, would expose its lack of knowledge because the knowledge wouldn’t exist in these ‘pairs.’ One of Song Ruihua’s colleagues found during a client visit that AI was not good at beauty-related dialogues because their outputs were mainly small talk.

At that time, Song Ruihua kept pondering the problem. She realized that the issue was the lack of worldly knowledge in chat ‘pairs.’ She thought that if all the text on the internet could be used, it would be great. At Xiaoice, her idea was to use articles from public accounts, as these accounts often consciously follow hot topics and analyze them from various angles.

However, she missed a step. She thought too complexly, believing that the text should first be abstracted into a graph, which would then influence the dialogue. For example, if you input “Lu Han鹿晗 (a Chinese male idol),” a mailbox would appear in the graph as a clue for AI, because Lu Han took a photo next to a mailbox on the Bund in Shanghai in 2016. The event became news, and his fans would go to that mailbox to check in. But this method has drawbacks: sometimes the original sentences extracted from the articles are too formal or contain extra information and are not suitable replies.

When ChatGPT was launched by OpenAI, Song Ruihua had an epiphany and was both excited and shocked:

“Bingo! This is how it should be solved!”

As soon as ChatGPT came out, Song Ruihua tried it immediately and was very surprised. Although both are dialogue robots, “Xiaoice and ChatGPT are like two different species.” ChatGPT doesn’t accumulate knowledge around a specific task but learns it into the model first. Just like humans accumulate knowledge through daily reading, the more you read, the more knowledge you accumulate. When encountering a certain “prompt,” you can call upon this accumulated knowledge and apply it in combination, rather than just reciting the original text.

Song Ruihua told Leiphone that she had observed that casual chat dialogue robots lacked a wide range of world knowledge. She also thought of using all the articles on the internet to make up for the deficiencies, but she didn’t have the deep skills of Ilya Sutskever (OpenAI Chief Scientist in charge of ChatGPT) to solve it.

In Ilya’s cognition, the ability of all language tasks can be simplified into a single “AI reasoning” ability. And Ilya also believes that all reasoning can be completed by predicting the next word. For example, let AI read a detective novel, master all the relationships and clues in the novel, and then in the last sentence of the novel, the detective stands up and says to everyone: “The murderer is ____!” At this time, the content of the fill-in-the-blank is very testable for the model’s ability. Some AI models have strong logical abilities and can fill in the correct name; some models will fill in a wrong name but also show some logical abilities; and some models fill in something that is not even a name at all.

Ilya believes that reasoning is whether the accuracy of predicting the next word has improved. Understanding language is difficult to define, but it can be replaced by “prediction.” When AI continuously learns how to predict the next word, it has already learned to understand and reason. Therefore, when Ilya explains why GPT-4 is stronger than GPT-3.5, he will emphasize that “(GPT-4) the accuracy of predicting the next word has improved again.” Scholars from Beijing Normal University, Cambridge, and Microsoft have also experimented with GPT-3.5 and GPT-4 on IQ and psychological tests and found that the level of GPT-4 has significantly improved.

This was something the first generation of large-model researchers in China had not considered. Before this, scholars in China generally believed that humans are adept at mathematical reasoning, so information should be symbolized and knowledge mathematized. Under this mindset, model architectures were often designed to be extremely complex, limiting their capabilities. However, ChatGPT embodies the aesthetic of “simplicity is best,” combining a straightforward framework with a wealth of knowledge and innovative interactive forms, which instantly revitalized the product’s effectiveness.

The power of natural language was recognized for the first time. In a lecture at MIT in May this year, Geoffrey Hinton also pointed out that AI doesn’t need to symbolize information to gain knowledge from text, because humans also rely on language for reasoning. He gave an example that left a deep impression on Song Ruihua: Hinton asked ChatGPT, “We have some rooms in our house that are white, blue, and yellow. The yellow paint will fade to white within a year. If I want all my walls to be white in two years, what should I do?” ChatGPT replied, “You can paint the blue room yellow.” Hinton was shocked because, while ChatGPT may not have understood numbers, it seemed to understand what “fading” means.

Although some users have tested ChatGPT’s capabilities by asking it math questions, many early members of Wudao believe that ChatGPT has already solved some of the most difficult technical challenges in the current NLP field, such as coherence and internal logic in long texts. In some professional scenarios, the answers generated by ChatGPT may not be satisfactory, “but these issues can be improved.”

After the advent of ChatGPT, LLMs suddenly became popular, and previously overlooked LLM companies like Zhipu, Mianbi, Lingxin, Zhizi, Shenyan, etc., have also become the stars of tomorrow in Chinese capital markets. Zhizi Engine, which previously couldn’t raise funds, got a valuation of 100 million RMB in its angel round after ChatGPT’s release. Investors even asked Lu Zhiwu and his student, Zhizi Engine CEO Gao Yizhao, “Is 100 million enough?”

They firmly believe that LLMs are a significant future for AI, but they didn’t expect the future to come so quickly.

However, when the glitz of capital is brushed aside, for scientists seeking to explore language intelligence, the greater revelation from ChatGPT lies in its fundamental understanding of LLMs and product imagination, which is closely related to the grand goal that OpenAI aims to achieve—AGI (Artificial General Intelligence).

ChatGPT’s product is almost perfect: it can understand the user’s intent and answer a variety of questions, and each question usually receives a reasonable answer. It even demonstrates a level of “knowledge” in most answers, thereby transforming into actual productivity in Q&A. This is undoubtedly due to the profound understanding of neural networks and language features by Ilya and others. But what’s even more important is that OpenAI has bold predictions for the future.

So, since its establishment in 2016, when everyone said that AGI was a pipe dream, the team at OpenAI dared to believe that it was the future of AI; when everyone chose BERT, they firmly chose GPT. While BBAI Wudao was exploring LLMs, they didn’t have such grand ambitions; even when Wen Jirong and others proposed researching multi-modal LLMs, it was just because “humans also learn this way,” and they didn’t think in the direction of AGI.

After ChatGPT was released, the various LLM teams in Wudaokou quickly launched similar LLM products due to their previous technical accumulation. For example, Zhipu AI launched ChatGLM in less than two months; Zhizi Engine also released ChatImg on March 8... But they know better that they are still far from the output of language intelligence, let alone AGI.

Everyone knows deeply that ChatGPT is an inspiration, but it is by no means the endpoint.

Chapter 6

AI Safety in China #1 - AI Safety in China

After releasing Wudao 2.0 in June 2021, BAAI has been thinking about the future of LLMs and how they can empower economic and social development. During the launch of Wudao 2.0, Huang Tiejun proposed that LLMs are the “carriers of intelligence.” In his vision, technology hardware and software make up the base layer, AI applications are at the top, and LLMs act as the “trunk” in between. The significance of LLMs is to transform “intelligence” into a public utility, akin to water, electricity, and the internet. The concept of “Model as a Service” (MaaS) also originated from Wudao (which I doubt it).

As Wudao reached its 2.0 version, BAAI's computing resources were becoming a bottleneck; they only had 480 A100 cards available, which was insufficient to support multiple teams. New purchases of 960 A100 cards were on the way but hadn’t yet arrived. With limited resources, BAAI decided to focus on algorithmic innovation for LLMs. All achievements from Wudao 1.0 and 2.0 were open-sourced to support collaborative innovation across academia and industry.

For an open-source project to be successful, it needs to unite a broad community of research and development contributors and also maintain a stable core technical team. In addition to collaborating with academic scholars, BAAI started external recruitment to establish its independent large-model team. In January 2022, Lin Yonghua, former head of IBM China Research Institute, joined BAAI as the Chief Engineer. By June 2022, the LLM training platform “Jiuding” was released, reaching a total computational power of 1000P. Specialized large-model teams were also gradually put in place.

In April 2023, Microsoft President Brad Smith named BAAI as one of the three organizations “at the absolute forefront” globally, alongside OpenAI and Google.

In June 2023, at the 5th BAAI Conference, “Wudao 3.0” was launched. It included the “Wudao- Aquila” series of language models and the “Wudao-Vision” series of visual and multimodal models. Unlike its predecessors, Wudao 3.0 is not just a single LLM but a comprehensive LLM technology system. It also includes the “FlagEval” LLM evaluation system and an open platform, as well as the FlagOpen LLM open-source technology system, reflecting a more macroscopic vision of LLM development.

Additionally, Wudao 3.0 goes beyond the scope of BAAI; it represents the first phase results of a new generation of AI flagship projects, “AI Foundation Model Support Platform and Evaluation Technology.”

When Wudao 1.0 and 2.0 were launched in 2021, an expert group for the “New Generation of AI Major Science and Technology Projects” had already begun discussing how the state should support LLMs. BAAI’s Wudao represented a bold exploration in this direction. However, there were issues of each entity acting on its own. Therefore, the expert group proposed an open mechanism to strengthen “organized scientific research” and guide the “large-scale training of LLMs” from a “brute-force” competition back to a track of rational innovation.

The proposed mechanism was a “1+X+Y” system. Here, “1” represents the flagship project of “AI Foundation Model Support Platform and Evaluation Technology,” serving as the “aircraft carrier” leading the development of LLM technology and industry. “X” consists of a number of key technology projects supporting the core algorithms and technologies of LLMs, selected dynamically through a “horse-racing mechanism.” “Y” includes a series of application demonstration projects aimed at significant application scenarios, using the technical systems constructed by the flagship projects to promote the deep application of AI.

This proposal for an LLM flagship project received strong support from the Ministry of Science and Technology and other relevant departments. It was included in the national “Science and Technology Innovation 2030” new generation of AI major science and technology projects guide for 2022. After the review process, in December 2022, a total of “1+8,” or nine projects, were successfully approved and began implementation on January 1, 2023.

In the view of Huang Tiejun, “Our country has been forward-looking in the direction of LLMs. A year and a half before ChatGPT came out, we had already deployed an 'aircraft carrier fleet' to focus on LLMs.”

Another commendable feature of OpenAI is its excellent organizational ability. In retrospect, BAAI has also managed to bring together a loosely connected group of AI researchers. However, compared to OpenAI, its cohesion is still not strong enough. While having multiple teams working on different directions has its advantages, the downside is obvious: the lack of focused efforts on achieving something big.

Wudao 1.0 and 2.0 have not only spawned the first batch of LLM companies in China but also influenced a group of post-90s AI master's and doctoral students: Yang Zhilin, Qi Fanchao, Zeng Guoyang, Gao Yizhao, Huo Yuqi, and others. More than 85% of the team members in Wudao 1.0 and 2.0 are post-90s young students. After experiencing the pioneering work on LLMs, they have witnessed the explosion of products like Midjourney and ChatGPT in the past year and have many different thoughts about the commercial use of AI in the era of LLMs.

Many of them have grand ambitions to solve the problems in language intelligence and even AGI, and to transform AI into a new productive force in society. As the momentum of economic development begins to wane, the rise of technology to strengthen the country has become a consensus. Whether it’s visual AI, autonomous driving, or today's LLMs, they all represent the active social desire for the construction of new productive forces in the past decade.

Each era has its own dilemmas, and each era also needs its own salvation. Only by walking a different path can we construct new ways of survival, and the world will always be in the hands of young people.

Recode China AI

Discussion about this post