šFrom Zero to One: A Brief History of BAAI's Wudao LLMs
A Collective Endeavor of Top Academic Minds to Build a Chinese Language Model at an Unprecedented Scale
Hello readers! Last week was a bit of a slow news week in Chinaās AI world, so I took the opportunity to finally translate a feature story thatās been on my to-do list for months. Published by Leiphone (é·å³°ē½), one of Chinaās premier deep tech publications, the piece delves into the origin story of Wudao (ęé), a series of Large Language Models (LLMs) from the Beijing Academy of Artificial Intelligence (BAAI). Whatās truly special about Wudao isn't just its capabilities, but how it became a breeding ground for young, talented Chinese scientists who went on to create their own LLMs and companies. The story is a long oneāover 7,000 words (ChatGPT did most of the translation). You can find the original Chinese article here.
Chapter 1
The story began in the autumn of 2018 in Haidian District, Beijing. On that day, October 11, a regular Thursday, Liu Zhiyuan habitually opened the arXiv website as usual and browsed the latest artificial intelligence (AI) works uploaded by scholars from all over the world. Most of the time, the quality of papers on arXiv was uneven, and Liu Zhiyuan only skimmed them to get a general idea. But that day, he was deeply attracted by a paper by Googleās language group.
Originally, he only clicked in to take a look, but he became more and more fascinated and surprised the more he read it. After closing his computer, he still couldnāt recover for a long time, overwhelmed by the ideas in it. Sure enough, he soon discovered that this paper had also attracted widespread attention from other AI scholars in China. Teachers and students from top universities like Tsinghua, Peking, Renmin, and Fudan were also enthusiastically discussing this work.
Everyone vaguely felt: āThis could be another technological paradigm shift in the field of AI.ā
This work was the later famous BERT paper - āBERT: Pre-training of Deep Bidirectional Transformers for Language Understandingā, which has now been cited over 700,000 times on Google Scholar.
In the Chinese context, āparadigmčå¼ā is not a common word. But in Leiphoneās interviews about LLMs, this word was mentioned repeatedly, once describing deep learning in 2012, once BERT in 2018, and another time the direction of LLM startups in China before ChatGPT came out in 2022: āAt that time, no one thought about aiming for artificial general intelligence (AGI), but felt that LLM could be made into a universal AI paradigm.ā Thatās the story later.
Back to BERT.
Paradigm refers to the basic system and framework of a field, such as western suits and Hanfu being two different paradigms in the clothing field. On the basis of these two paradigms, fashion designers can design all kinds of styles and models. In short, the paradigm represents a change in underlying thinking, dividing the past from the future.
And BERTās ābidirectional pre-trainingā approach embodied this potential.
AI has three main directions: computer vision (CV), natural language processing (NLP), and machine learning (ML). The ultimate goal of NLP is to enable computers to understand human language. So how do we judge that a computer has understood human language? For a long time before BERT, the NLP research approach was to break down language understanding into small task directions, such as machine translation, text contrast, semantic analysis, etc., and then design and train AI algorithms for each task separately. For example, Liu Zhiyuanās research direction during his Ph.D. (2006-2011) was a basic NLP task called ākeyword extractionā.
The difference between BERT and traditional methods is that traditional statistical learning or deep learning allows AI algorithms to directly learn data for a specific task (such as text contrast). Before learning this data, AI is like a blank slate, without any basic capabilities, and the trained algorithms can only perform one task. BERTās pre-training method, on the other hand, is first to let AI recite a large amount of annotated data before learning task data, which is like doing a full set of test papers before an exam, so the trained algorithms perform better in subsequent āexamsā.
BERT was not the first pre-trained language model to use pre-training methods. A few months earlier, OpenAI released GPT-1, which was also a pre-trained language model. However, BERTās innovation was that it broke the reliance on pre-training methods for specified task frameworks using bidirectional training ideas.
GPT-1 had a unidirectional structure that could only learn textual information from left to right or from right to left, so the trained algorithm could only perform one language task, such as GPT-1 being good at text generation but not understanding. BERT has a bidirectional structure that can simultaneously learn language representations from both the left and right, and learn from massive unlabeled data across multiple tasks, so it can perform multiple language tasks like question answering, fill-in-the-blank, and text understanding all at once, outperforming all models at the time on each task, and soon dominated the authoritative NLP leaderboard GLUE.
Everyone was shocked by BERTās results, just like going back to when deep learning first demonstrated its power in 2012:
That year, Geoffrey Hinton, a professor at the University of Toronto, led two students, Alex Krizhevsky and Ilya Sutskever (now OpenAIās chief scientist), to use deep learning methods to train AlexNet, which swept the world computer vision competition ImageNet, leaving all other statistical learning algorithms far behind. āDeep learningā became famous overnight, and even NLP scholars kept discussing it.
Compared to deep learning, BERT made much smaller waves at the time, but a number of domestic NLP scholars also felt a sense of urgency that it was now or never.
Although there are no precise statistics, many scholars told Leiphone that after the rise of deep learning in 2012, whether in research or implementation, vision was the direction with the largest number of researchers and the hottest research enthusiasm in the domestic AI circle. From 2012 to 2018, changes in the language field were not as great as those in vision, and it did not stand out, particularly in embracing the wave of deep learning.
Liu Zhiyuan belonged to the Natural Language Processing Laboratory (THUNLP) at Tsinghua University. In 2012, Sun Maosong, the director of the lab, happened to lead the application for a 973 national key project, and in order to better determine the future technical route for NLP, organized several units, including Peking University, Harbin Institute of Technology, Institute of Automation of Chinese Academy of Sciences, and Baidu to discuss together. Everyone was optimistic about deep learning, so after the project was successfully applied for, THUNLP also turned to deep learning starting in 2013. Later, deep learning swept the globe as expected.
Since then, ādaring to revolutionize oneselfā has become THUNLPās research spirit. After BERT came out, Liu Zhiyuan also quickly decided to turn to pre-training methods. Their idea was to use knowledge graph methods to extract abstract knowledge one by one, and then inject it into pre-trained language models to make the models smarter. They cooperated with Liu Qun and Jiang Xin from Huawei Noahās Ark Lab to quickly develop a pre-trained language model called āERNIEā and submitted it to the top NLP academic conference ACL 2019.
Coincidentally, in 2018, Baiduās NLP team was also shocked by BERT, and almost at the same time completed a pre-trained language model, taking the lead in publishing it on arXiv, also named āERNIEā. Both teams named it after characters from the American cartoon Sesame Street, as previous pre-trained models like ELMO and BERT were also Sesame Street characters. Google used BERT, so when they wanted to match Google, they thought of it together.
The two āERNIEā outperformed BERT on some tasks. Baiduās release on arXiv preceded THUNLPās cooperating paper, which was accepted afterward. To distinguish from Baiduās, Liu Zhiyuan et al changed the model name, while Baidu continued to use this name. Later, when Baidu refined LLMs, the Chinese name was āWenxinā, but the English name remained āERNIEā.
As expected, pre-training quickly became the mainstream method in NLP. At the same time, some international teams also had keen intuition and quickly followed BERTās bidirectional pre-training method. In February 2019, OpenAI released GPT-2. Although GPT-2 had better generative effects than GPT-1, it was still inferior to BERT on many language tasks, so OpenAIās voice was completely overwhelmed by Google at the time.
But a year and a half later, history was refreshed again: In June 2020, OpenAI suddenly released a research result that exceeded everyoneās imagination - GPT-3 with a parameter size of 175 billion. As fellow pre-trained language models, GPT-3ās parameter volume was 500 times that of BERT. Not only could it do language generation, but it also surpassed BERT on various language understanding tasks.
Everyoneās research worldview was subverted.
Chapter 2
No one expected that enlarging the parameters of pre-trained language models would lead to so-called āEmergent Abilitiesā. Googleās corroborating paper on this phenomenon was not published until a year later.
BERT had 340 million parameters, which was undoubtedly a LLM compared to all language models in 2018. But peopleās focus was more on its pre-training method, and no one thought of directly āpiling on parametersā like OpenAI did. GPT-3ās approach of piling on parameters was like having the AI model directly memorize the entire library.
As a result, the rote-memorizing GPT-3 not only had very strong language understanding abilities, but also some reasoning capabilities. Even for some unlabeled data and tasks, GPT-3 could learn as it went, achieving decent results.
Previously, when knowledge was injected into small language models, their intelligence levels would also improve, which everyone could understand. But OpenAI skipped the step of extracting knowledge from text data, and relied entirely on piling parameters to force GPT-3 to learn, which caught everyone completely off guard. Some even claimed that GPT-3 had actually passed the Turing test.
The Turing test was proposed by the āfather of AIā Alan Turing in 1950. After 70 years of AI development globally, this was the first time it had been passed, so the impact on the AI community was huge. GPT-3 was not only a major breakthrough in natural language processing, but also a milestone in the field of AI. For a time, discussions on language intelligence reached unprecedented heights. Not only NLP scholars like Liu Zhiyuan, but also researchers in information retrieval were constantly discussing it.
Even more exaggerated was OpenAIās claim that they used 10,000 GPUs to train GPT-3.
Normally in academic research, the cost of computing devices accounts for about 20% of a professorās total research funding. Having more than 500 cards means youāre a āwealthyā player in academia. Previously, most AI scientists in China and abroad used single cards, or multiple cards on a single machine, when researching NLP. But GPT-3ās training used a total of 10,000 cards, which can cost about $12 million USD, over 80 million RMB.
From an engineering perspective, the engineering difficulty of training GPT-3 was also unprecedented. Take BERT as an example, the engineering effort for training 340 million parameter BERT compared to 175 billion parameter GPT-3 is like the difference between making a toy car and an airplane. The engineering for a toy car doesnāt work for an airplane, similarly, past knowledge about training small language models didnāt apply to LLMs.
GPT-3 crushed BERT, which is essentially a crushing of ālarge-scale pre-trained language modelsā over āpre-trained language modelsā.
On one hand, everyone was excited about GPT-3. On the other hand, they also felt a huge gap internally. Before this, most domestic scholars thought Chinese teamsā papers were on par with top US universities. After GPT-3, they realized there was still such a big gap between themselves and international state-of-the-art levels.
In the summer of 2020 in Beijingās Wudaokou, computer and AI scholars from Tsinghua, Peking University, Renmin University, Chinese Academy of Sciences, etc., were all paying attention to GPT-3. Although no one could clearly explain GPT-3ās powerful mechanisms at the time, intuition told everyone that this was an important watershed in the field of AI. The impact of GPT-3 was so great that some scholars decided to research large pre-trained language models, no matter what.
Liu Zhiyuan was one of them. At the time, the most prominent obstacle to researching LLMs was computing power. Liu Zhiyuan reached out to Tsinghua professors in high-performance computing like Chen Wenguang and Han Wentao, to collaborate on using distributed acceleration computing to reduce the training costs of LLMs. He also looked beyond THUNLP to seek outside help.
At that time, Sun Maosong was the Chief Scientist of Natural Language Processing in an emerging AI research institution less than 100 meters from the east gate of Tsinghua, where Liu Zhiyuan was also a young scientist. Naturally, Liu Zhiyuan thought of going there to discuss collaboration.
This institution is the now famous Beijing Academy of Artificial Intelligence (BAAI).
But at the time, BAAI was a research unit that had just been established for a year and a half and was still developing.
BAAIās establishment was part of the blueprint for the construction of the Beijing International Innovation Center, jointly guided by the Ministry of Science and Technology and Beijing Municipal Government, with the mission of exploring the frontiers of AI. Through projects like āBAAI Scholarsā, āBAAI Conventionsā, and āQingyuan Meetingsā, BAAI connected around 100 outstanding AI scientists in Beijing, while also working with BAAI scholars to find the ānext big thingā in AI.
BAAI President Huang Tiejun told Leiphone that the selection of BAAI scholars was very strict in itself, so after selecting the corresponding scholars, BAAI would provide them with corresponding funding support without requiring them to submit research results. On the contrary, BAAI cared more about everyone being able to explore major AI directions worth investing in together.
In April 2019, BAAI identified several major directions, including natural language processing, machine learning, information retrieval, etc. Each direction gathered 5-10 well-known scholars for discussion. The natural language processing direction had Sun Maosong, He Xiaodong, Liu Zhiyuan and others; the intelligent information retrieval direction had Wen Jirong, Tang Jie and others. After GPT-3 came out, scholars in the major directions discussed GPT-3 and how to research Chinaās LLMs.
Before reaching a consensus, there were several key discussions within BAAI.
The first two were at Yanqi Lake in Beijing: July 2020 was a meeting for the machine learning direction. Scholars in this direction felt GPT-3 was a major direction. Now that LLMs had emerged, they should research visual LLMs. But after discussion, they felt visual LLMs required even more computing power, so no actions were taken. August was the information retrieval and mining direction, where Wen Jirong, Tang Jie and others discussed LLMs. In September, at BAAIās office meeting, Liu Zhiyuan proposed researching universal language models.
After National Day, on October 10, BAAI held another discussion at Yanqi Lake, inviting scholars from different directions to attend. They finally reached a consensus at the meeting to form a task force and collaborate on LLMs.
After approval, BAAI sent out āhero recruitment postsā through various channels, inviting scholars interested in LLMs to participate together, with the slogan āHeroes donāt ask where youāre fromā. This call to action resonated with scholars, and many signed up.
The first were professors from Tsinghua and Renmin, including Liu Zhiyuan, Wen Jirong, Tang Jie, Huang Minlie and others. Subsequently, scholars from Peking University, Chinese Academy of Sciences and other institutions also expressed interest, and some external BAAI members also participated, such as Yang Hongxia, who worked at Alibaba Damo Academy at the time. In the end, BAAIās LLM project gathered about 100 people, with the then BAAI Deputy Dean Tang Jie appointed as the project leader.
That October, BAAI reported this ā100 People LLM Planā to Beijing Mayor Chen Jining at the time. Mayor Chen was very excited and said, āThis (LLM) is the nuclear fission point for the future of AI, and will bring prosperous ecological development.ā Beijing decided to strongly support it and approved special funding for BAAI to purchase computing power.
In fact, at that time, many people still didnāt understand what LLMs were, and developing LLMs required high costs. But in October 2020, from scholars to BAAI, from Beijing to the Ministry of Science and Technology, everyone reached a consensus - to fully advance the research and development of Chinese LLMs. Afterward, many scholars expressed amazement to Leiphone: āStrangely, everyone was decisive at the time.ā
Everyone felt LLMs could do something bigger. In addition to LLMs, the idea of āquantitative change leading to qualitative changeā could also lead to breakthroughs in other fields. After discussion, they decided to ādivide into four groupsā and explore Chinese LLMs from four directions: Chinese language models, multimodal models, cognitive models, and protein models, led by Liu Zhiyuan, Wen Jirong and Tang Jie, respectively, with Tang Jie responsible for the latter two, essentially āthree teams doing four thingsā.
In November 2020, the teams discussed names during the NLP annual conference at Chunhui Garden in Shunyi. Sun Maosong said everyone was researching language, so he suggested using āWenęā (meaning language/literature). After discussion, the four teams were named after four of the seven imperial libraries of the Qing Dynastyās Complete Library of the Four Treasuries: āWen Yuanęęøā, āWen Lanęę¾ā, āWen Huięę±ā, and āWen SuęęŗÆā.
To indicate they were one entity, BAAI suggested giving them a unified codename, and invited everyone to BAAIās then office in the Saiāer Building in Wudaokou. At the meeting, Tang Jie proposed relating it to Wudaokou, since everyone had deep feelings for Wudaokou. So everyone came up with a few names together. After brainstorming, Song Ruihua from Renmin University suggested āWudaoęéā, a homophone for āWudaokouā, and everyone agreed.
Thatās how āWudaoęéā came about.
Chapter 3
Wudaoās original intention was very pure: to catch up with GPT-3 and research Chinese LLMs.
So what are āChinese LLMsā?
Nowadays, there are many types of LLMs in China, to the point that the definition of LLMs has become blurred. But in 2020, Wudao members had a very focused understanding: fundamentally, GPT-3 was an English-centric language model, while China didnāt have one at the time. Therefore, the āChinese LLMā should first be a Chinese-centric large-scale pre-trained language model with over 175 billion parameters, like GPT-3.
Although later research showed that monolingual language models also have some multilingual capabilities, in the Chinese context, people found that using GPT-3 to solve many Chinese language tasks often led to semantic ambiguities, logical errors, etc. One reason is that GPT-3ās training data is mainly English, and Chinese research teams have no way of knowing GPT-3ās detailed training parameters for fine-tuning. So, whether subjectively or objectively, in 2020, independently developing domestic LLMs was an inevitable choice.
BAAI approved the project in October 2020. Since LLMs require large computing power, BAAI also began heavily investing resources like computing power from October. BAAI originally planned to purchase 300P with existing research funds. With Mayor Chen Jiningās approval to strongly support it, it was decided to allocate another 700P from special funds, so the total was 1000P. However, the approval and purchasing process took over a year, so Wudao relied mainly on rented computing power at the start.
Everyone believed LLMs were the future major direction. Related scholars also brought their own resources to participate in BAAIās LLM project: in terms of manpower, each professor brought their teams of graduate students; for resources, when BAAIās computing power was not fully in place, scholars also obtained some computing power through their own channels. For example, Wen Jirongās team initially trained multimodal LLMs on Renmin Universityās machines, while Tang Jieās team ran on Alibaba Cloud.
Although GPT-3 made big waves, teams like BAAI fully committed to LLMs were still rare in China at the time, and Wudao was even belittled for a time. There were two main reasons for the dismissals: one was that developing LLMs was very costly, with computing costs easily reaching tens of millions; two was that LLMs were not original innovations, relying only on piling parameters, with low technical sophistication. But BAAI insisted on exploring.
After they truly started research, they discovered: OpenAI was not a bluffing charlatan, and the technical barriers to LLMs were not just about āpiling computing powerā and āpiling parametersā. Take Chinese and multimodal LLMs for example. Before Wudao, global AI exploration in these two areas was a complete blank. Since they were the first in China to train LLMs, it was like starting from scratch, a very challenging process.
But relying precisely on this kind of fearless courage to forge ahead, after six months Wudaoās LLMs made leapfrog progress.
Two months after Wudaoās approval in December 2020, Liu Zhiyuan, Huang Minlie and Han Wentaoās Wenyuan team released the worldās first open-source Chinese LLM āCPMā. CPM only had 2.6 billion parameters, negligible compared to GPT-3, but its advantage was using Chinese data. Moreover, compared to 2019ās āERNIEā, CPMās parameters increased by several hundred times. This was not only an engineering feat, but also validated the viability of Wenyuanās approach to training Chinese LLMs.
Almost at the same time as CPM, Wenlan and Wenhui also found solutions. The core Wenlan member Lu Zhiwuās āTwin Towersā approach was validated in December 2020, and Wenhuiās 100 billion parameter model was completed in January 2021. In March 2021, BAAI combined Wenyuanās CPM, Wenlanās 300 million image-text pair trained multimodal model BriVL 1.0, Wenhuiās 100 billion parameter English-Chinese bilingual LLM GLM-10B, multimodal model CogView 1.0 and other results, collectively called āWudao 1.0ā and released them in March 2021.
Objectively, āWudao 1.0ā did not cause much sensation, but at a time when LLMs were still unfamiliar in China, Wudao showed people āwhat LLMs areā: they can write poetry, answer questions, align text and images... more powerful than any previous NLP algorithms.
At the āWudao 1.0ā press conference, BAAI also first proposed the concept of LLMs大ęØ”å, aka LLMs. BAAI President Huang Tiejun coined a phrase, saying that in recent years, AI development had gradually shifted from ārefining modelsā to ārefining LLMsā. That is: after the rise of deep learning in 2012, many small AI models appeared globally, while ārefining LLMsā trains LLMs intensively, designs more advanced algorithms, integrates more data, pools huge computing power, so one model can serve many enterprises.
In other words, LLMs have not only large parameters, but high intelligence. This press conference cleared up outside doubts about BAAI, and Wudaoās LLMs began to emerge.
In Wenhui led by Tang Jie, Alibaba Damo Academy engineer Yang Hongxia and Recurrent AI Co-Founder Yang Zhilin were core members. BAAI did not restrict Wudao membersā research freedom. Yang Hongxia participated in Alibabaās LLMs, Yang Zhelin led Recurrent AI to cooperate with Huawei. In April 2021, Alibaba also released its 27 billion parameter LLM āPLUGā, and Huawei released Pangu. Wudao not only connected scholars, but also strengthened cooperation between academia and industry.
Like Wenyuan, Wenhui also gathered young research talents from high-performance computing, such as Chen Wenguang and Zhai Jidong, who along with Han Wentao belonged to Academician Zheng Weiminās team. For LLMs, high-performance computingās distributed acceleration computing methods are crucial for improving training speed and reducing costs. High computing talents were also given important responsibilities in the Wudao project.
But for Chinese LLMs, high-performance computingās greater influence was birthing Chinaās first trillion-parameter model: āWudao 2.0ā.
At the end of 2020, while advancing Wudao, Tang Jie, Chen Wenguang and Yang Hongxia were also planning another thing: applying for the Gordon Bell Prize, known as the āNobel Prize of supercomputing applicationsā.
To apply for the Gordon Bell Prize, the supercomputer needs to meet several requirements: one, it must be the worldās largest; two, the project researched on it must max out the machine; three, the project results must be impactful. After completing GLM-10B in January 2021, they decided to run LLMs on the supercomputer.
So they sent over 30 people to Mountain Sea AI Lab in Qingdao to run LLMs on āSunway TaihuLightā. The students of Tang Jie and Zhai Jidong were the backbone, and Zhai Jidong was recruited by Tang Jie and Chen Wenguang for his outstanding capabilities in parallel training of low-level operators. There were also some Alibaba engineers providing online support.
They brought all the data they had to Qingdao, including Chinese, English, images, etc., mixed together for training. To meet the Gordon Bell Prize requirement of maxing out the machine, they expanded the model parameters to 1.74 trillion, with no data convergence. After running on the supercomputer for ten days, they trained several versions of LLMs, each with hundreds of billions of parameters.
Although the scale was huge, the operating costs were also extremely high, beyond almost everyoneās affordability. So they trained a more converged MoE-based model with 1.75 trillion parameters, 10 times larger than GPT-3, surpassing Googleās 1.6 trillion parameter Switch Transformer released in April 2021 to become the worldās largest model at the time. At BAAIās June 2021 conference, where it was unveiled, it shocked the entire audience and was viewed as āWudao 2.0ā, receiving widespread acclaim from top domestic and foreign technology teams.
For a time, BAAI gained unmatched glory and joined the international forefront of LLMs.
Apart from this trillion-parameter model, āWudao 2.0ā also included Wenyuanās two 10 billion models (11 billion parameter Chinese model, 11 billion parameter English-Chinese bilingual model) and one trillion model (198 billion parameter English-Chinese bilingual MoE model), collectively called āCPM 2.0ā; Wenlanās 5 billion parameter image-text retrieval LLM BriVL 2.0 - this was Chinaās first multimodal LLM, and the worldās largest and most trained multimodal model at the time.
Before Wenlan, academiaās mainstream approach to multimodality was āsingle towerā, meaning the Transformer had 12 layers, looking like one tower, with text and image tokens input together interacting, and then scoring based on similarity. But with extremely large parameters, online one-by-one comparison would be very inefficient. So Lu Zhiwu proposed the āTwin Towersā approach:
Images are first processed by an image encoder, text is also first processed by a text encoder, without interaction. After understanding higher-level meanings separately, comparative learning is then conducted. If the image and text meanings are similar, the Twin Towers are close, otherwise distant. Because they pre-encoded images in parallel to convert them into high-dimensional vectors for storage, when retrieving text, they only need to encode the text, and could find matched results in the high-dimensional vectors in less than a second. Wenlan verified the feasibility of the āTwin Towersā approach in November 2020. Two months later, OpenAIās released CLIP architecture (DALL-Eās backing force) was the same idea.
Afterward, Lu Zhiwu told Leiphone that they donāt think they ādid research following othersā; whether Chinese, multimodal, or trillion-parameter models, Wudaoās three groups were all pioneering new frontiers in uncharted territory.
To research multimodal LLMs, Lu Zhiwu devoted all his students to Wenlan, and the team went a full year without publishing any academic papers. In academia, this was an enormous risk for both teachers and students.
Similarly, due to the lack of high-quality Chinese data, many of Liu Zhiyuan and Huang Minlieās students were assigned to data annotation and cleaning when researching Chinese LLMs. In CPM 2.0 research, Wenyuanās raw data collection reached 50TB, and after cleaning was still 2.6TB. Students invested a huge time and effort.
In general, BAAIās 100 Wudao members were going all in, āgambling their careersā, but unexpectedly they won: after releasing āWudao 2.0ā in June 2021, BAAI Wudao became a prominent flag for Chinese LLMs, and Wudao members became the first pioneers of Chinese LLMs.
Chapter 4
In reality, 2021 was considered the āYear One of LLMs in China.ā After the release of Wudao 2.0 in September, Baidu released its 10-billion-parameter model PLATO-X and 260-billion-parameter model ERNIE 3.0 Titan; in October, Alibabaās DAMO Academy released an LLM with up to 10 trillion parameters known as āM6.ā
Despite the high costs of training LLMs, a group of dedicated LLM followers emerged in 2021. Both domestically and internationally, authoritative voices emerged. Two weeks after the launch of Wudao 2.0, Google published a paper claiming that language models would exhibit āEmergent Abilitiesā when scaling from tens to hundreds of billions of parameters. In August 2021, a review paper on āFoundation Modelsā authored by a hundred scholars like Li Fei-Fei and Percy Liang from Stanford University caused a significant international stir.
However, many Wudao team members knew that in 2021, a true domestically produced LLM with hundreds of billions of parameters had not yet appeared.
The underlying architecture for both the hundred-billion and trillion-parameter versions of Wudao 2.0 was sparse. The trillion-parameter model took up about 20TB of disk space and required over 500 A100 GPUs for inference. After copying the model from Shandong to Beijing, the Wudao team found it too expensive to operate and opened it to the industrial sector. Several companies copied the files but probably couldnāt use them either.
In terms of technology, the LLM also suffered from ācatastrophic forgetting,ā particularly when image data were added. This significantly weakened the modelās textual capabilities, making it even less effective than the 10-billion-parameter model GLM-10B.
Compared to their technological breakthroughs, the LLMsā greater contribution was cultivating a generation of young talents who truly understood how to train LLMs. Thatās why, after the launch of Wudao 2.0, the team members were even more determined to develop a model with hundreds of billions of parameters.
By the end of 2021, Tang Jie suggested at an internal Wudao meeting several objectives: training a model with hundreds of billions of parameters, developing a text-to-video model, and a code generation model. But achieving these goals would require 1,000 GPUs running flawlessly for two months, with very high training costs.
Wudao 2.0 attracted a lot of attention, but there were insufficient computational resources. Tang Jieās team was invited to use the 910A machines at the Pengcheng Laboratory. They also received nearly 2,000 Huawei 920 GPUs, which initially had only 18% of the A100ās operator efficiency. After modifications, the efficiency was raised to about 40%.
During this period, Tang Jieās team adapted various cards available in the market. They found that it was not possible to quickly converge a hundred-billion-parameter model with 2,000 910A cards, nor with tens of thousands of DCU cards running for two months. In the end, under the name of his startup Zhipu AI, Tang Jie rented 1,000 cards from the Jinan Supercomputing Center and committed a team of over 20 people to train for 8 months. Finally, in July 2022, they trained the hundred-billion-parameter modelāGLM-130B.
Meanwhile, other teams based on Wudao developed many unprecedented applications. For example, Liu Zhiyuanās student Qin Yujia wrote a program that used a Chinese LLM to call Bingās search engine to answer questions on Zhihu, accumulating thousands of upvotes. Lu Zhiwuās team used a multi-modal LLM to edit short videos, accumulating 1.5 million views on TikTok.
However, the market in China was not yet willing to pay for LLMs. After setting up their LLM companies, they all went out to raise funds confidently, but not a single investor was willing to pay.
All of Wudaoās LLM achievements were open-source. But even after tens of millions of API calls following the release of Wenlan, many interested large enterprises were not willing to pay for usage.
In 2022, domestic awareness of LLMs was still generally lacking. Everyone knew that LLMs were strong, and everyone also knew that a āhit productā was needed to showcase the capabilities of LLMs. But no one had a solution. Technically, they had become giants; but in terms of products, they were still dwarfs.
That was until the appearance of ChatGPT.
Chapter 5
Song Ruihua joined Renmin University in September 2020 and began participating in the Wudao Wenlan research in October. Prior to this, she was the Chief Scientist at Microsoft Xiaoice, specializing in text generation and leading the āXiaoice Writes Poetryā project.
After moving from Microsoft to Xiaoice in 2018, Song Ruihua began to take an interest in cognitive intelligence and wanted to explore how AI understands human language. That summer, she read a book by Benjamin Bergen, a cognitive science professor at the University of California, San Diego, titled āLouder Than Words: The Science of How The Mind Makes Meaning.ā She found it inspiring.
The book points out that when humans read a good piece of work, they often canāt stop reading and imagining scenes corresponding to the text. If the text is well crafted, these scenes come to life in the readerās mind. Therefore, a key indicator of true understanding is the ability to imagine a scene or even add content not present in the text.
Additionally, understanding language is not about using words to perform tasks, much like reading books is not about preparing for an exam the next day. However, in the past, scientists in the field of computing often evaluated whether AI understands human language by setting up specific, segmented tasks. For example, they would compare sports articles with financial articles to see if AI could distinguish between them.
Before ChatGPT, most of the technical staff researching AI dialogue in China came from the era of forums. Their research ideas mainly originated from forum-type chats, such as thread-based conversations where A posts a topic and B and C reply underneath. In this pattern, the model, when conducting open dialogue, would expose its lack of knowledge because the knowledge wouldnāt exist in these āpairs.ā One of Song Ruihuaās colleagues found during a client visit that AI was not good at beauty-related dialogues because their outputs were mainly small talk.
At that time, Song Ruihua kept pondering the problem. She realized that the issue was the lack of worldly knowledge in chat āpairs.ā She thought that if all the text on the internet could be used, it would be great. At Xiaoice, her idea was to use articles from public accounts, as these accounts often consciously follow hot topics and analyze them from various angles.
However, she missed a step. She thought too complexly, believing that the text should first be abstracted into a graph, which would then influence the dialogue. For example, if you input āLu Hané¹æę (a Chinese male idol),ā a mailbox would appear in the graph as a clue for AI, because Lu Han took a photo next to a mailbox on the Bund in Shanghai in 2016. The event became news, and his fans would go to that mailbox to check in. But this method has drawbacks: sometimes the original sentences extracted from the articles are too formal or contain extra information and are not suitable replies.
When ChatGPT was launched by OpenAI, Song Ruihua had an epiphany and was both excited and shocked:
āBingo! This is how it should be solved!ā
As soon as ChatGPT came out, Song Ruihua tried it immediately and was very surprised. Although both are dialogue robots, āXiaoice and ChatGPT are like two different species.ā ChatGPT doesnāt accumulate knowledge around a specific task but learns it into the model first. Just like humans accumulate knowledge through daily reading, the more you read, the more knowledge you accumulate. When encountering a certain āprompt,ā you can call upon this accumulated knowledge and apply it in combination, rather than just reciting the original text.
Song Ruihua told Leiphone that she had observed that casual chat dialogue robots lacked a wide range of world knowledge. She also thought of using all the articles on the internet to make up for the deficiencies, but she didnāt have the deep skills of Ilya Sutskever (OpenAI Chief Scientist in charge of ChatGPT) to solve it.
In Ilyaās cognition, the ability of all language tasks can be simplified into a single āAI reasoningā ability. And Ilya also believes that all reasoning can be completed by predicting the next word. For example, let AI read a detective novel, master all the relationships and clues in the novel, and then in the last sentence of the novel, the detective stands up and says to everyone: āThe murderer is ____!ā At this time, the content of the fill-in-the-blank is very testable for the modelās ability. Some AI models have strong logical abilities and can fill in the correct name; some models will fill in a wrong name but also show some logical abilities; and some models fill in something that is not even a name at all.
Ilya believes that reasoning is whether the accuracy of predicting the next word has improved. Understanding language is difficult to define, but it can be replaced by āprediction.ā When AI continuously learns how to predict the next word, it has already learned to understand and reason. Therefore, when Ilya explains why GPT-4 is stronger than GPT-3.5, he will emphasize that ā(GPT-4) the accuracy of predicting the next word has improved again.ā Scholars from Beijing Normal University, Cambridge, and Microsoft have also experimented with GPT-3.5 and GPT-4 on IQ and psychological tests and found that the level of GPT-4 has significantly improved.
This was something the first generation of large-model researchers in China had not considered. Before this, scholars in China generally believed that humans are adept at mathematical reasoning, so information should be symbolized and knowledge mathematized. Under this mindset, model architectures were often designed to be extremely complex, limiting their capabilities. However, ChatGPT embodies the aesthetic of āsimplicity is best,ā combining a straightforward framework with a wealth of knowledge and innovative interactive forms, which instantly revitalized the productās effectiveness.
The power of natural language was recognized for the first time. In a lecture at MIT in May this year, Geoffrey Hinton also pointed out that AI doesnāt need to symbolize information to gain knowledge from text, because humans also rely on language for reasoning. He gave an example that left a deep impression on Song Ruihua: Hinton asked ChatGPT, āWe have some rooms in our house that are white, blue, and yellow. The yellow paint will fade to white within a year. If I want all my walls to be white in two years, what should I do?ā ChatGPT replied, āYou can paint the blue room yellow.ā Hinton was shocked because, while ChatGPT may not have understood numbers, it seemed to understand what āfadingā means.
Although some users have tested ChatGPTās capabilities by asking it math questions, many early members of Wudao believe that ChatGPT has already solved some of the most difficult technical challenges in the current NLP field, such as coherence and internal logic in long texts. In some professional scenarios, the answers generated by ChatGPT may not be satisfactory, ābut these issues can be improved.ā
After the advent of ChatGPT, LLMs suddenly became popular, and previously overlooked LLM companies like Zhipu, Mianbi, Lingxin, Zhizi, Shenyan, etc., have also become the stars of tomorrow in Chinese capital markets. Zhizi Engine, which previously couldnāt raise funds, got a valuation of 100 million RMB in its angel round after ChatGPTās release. Investors even asked Lu Zhiwu and his student, Zhizi Engine CEO Gao Yizhao, āIs 100 million enough?ā
They firmly believe that LLMs are a significant future for AI, but they didnāt expect the future to come so quickly.
However, when the glitz of capital is brushed aside, for scientists seeking to explore language intelligence, the greater revelation from ChatGPT lies in its fundamental understanding of LLMs and product imagination, which is closely related to the grand goal that OpenAI aims to achieveāAGI (Artificial General Intelligence).
ChatGPTās product is almost perfect: it can understand the userās intent and answer a variety of questions, and each question usually receives a reasonable answer. It even demonstrates a level of āknowledgeā in most answers, thereby transforming into actual productivity in Q&A. This is undoubtedly due to the profound understanding of neural networks and language features by Ilya and others. But whatās even more important is that OpenAI has bold predictions for the future.
So, since its establishment in 2016, when everyone said that AGI was a pipe dream, the team at OpenAI dared to believe that it was the future of AI; when everyone chose BERT, they firmly chose GPT. While BBAI Wudao was exploring LLMs, they didnāt have such grand ambitions; even when Wen Jirong and others proposed researching multi-modal LLMs, it was just because āhumans also learn this way,ā and they didnāt think in the direction of AGI.
After ChatGPT was released, the various LLM teams in Wudaokou quickly launched similar LLM products due to their previous technical accumulation. For example, Zhipu AI launched ChatGLM in less than two months; Zhizi Engine also released ChatImg on March 8... But they know better that they are still far from the output of language intelligence, let alone AGI.
Everyone knows deeply that ChatGPT is an inspiration, but it is by no means the endpoint.
Chapter 6
After releasing Wudao 2.0 in June 2021, BAAI has been thinking about the future of LLMs and how they can empower economic and social development. During the launch of Wudao 2.0, Huang Tiejun proposed that LLMs are the ācarriers of intelligence.ā In his vision, technology hardware and software make up the base layer, AI applications are at the top, and LLMs act as the ātrunkā in between. The significance of LLMs is to transform āintelligenceā into a public utility, akin to water, electricity, and the internet. The concept of āModel as a Serviceā (MaaS) also originated from Wudao (which I doubt it).
As Wudao reached its 2.0 version, BAAI's computing resources were becoming a bottleneck; they only had 480 A100 cards available, which was insufficient to support multiple teams. New purchases of 960 A100 cards were on the way but hadnāt yet arrived. With limited resources, BAAI decided to focus on algorithmic innovation for LLMs. All achievements from Wudao 1.0 and 2.0 were open-sourced to support collaborative innovation across academia and industry.
For an open-source project to be successful, it needs to unite a broad community of research and development contributors and also maintain a stable core technical team. In addition to collaborating with academic scholars, BAAI started external recruitment to establish its independent large-model team. In January 2022, Lin Yonghua, former head of IBM China Research Institute, joined BAAI as the Chief Engineer. By June 2022, the LLM training platform āJiudingā was released, reaching a total computational power of 1000P. Specialized large-model teams were also gradually put in place.
In April 2023, Microsoft President Brad Smith named BAAI as one of the three organizations āat the absolute forefrontā globally, alongside OpenAI and Google.
In June 2023, at the 5th BAAI Conference, āWudao 3.0ā was launched. It included the āWudao- Aquilaā series of language models and the āWudao-Visionā series of visual and multimodal models. Unlike its predecessors, Wudao 3.0 is not just a single LLM but a comprehensive LLM technology system. It also includes the āFlagEvalā LLM evaluation system and an open platform, as well as the FlagOpen LLM open-source technology system, reflecting a more macroscopic vision of LLM development.
Additionally, Wudao 3.0 goes beyond the scope of BAAI; it represents the first phase results of a new generation of AI flagship projects, āAI Foundation Model Support Platform and Evaluation Technology.ā
When Wudao 1.0 and 2.0 were launched in 2021, an expert group for the āNew Generation of AI Major Science and Technology Projectsā had already begun discussing how the state should support LLMs. BAAIās Wudao represented a bold exploration in this direction. However, there were issues of each entity acting on its own. Therefore, the expert group proposed an open mechanism to strengthen āorganized scientific researchā and guide the ālarge-scale training of LLMsā from a ābrute-forceā competition back to a track of rational innovation.
The proposed mechanism was a ā1+X+Yā system. Here, ā1ā represents the flagship project of āAI Foundation Model Support Platform and Evaluation Technology,ā serving as the āaircraft carrierā leading the development of LLM technology and industry. āXā consists of a number of key technology projects supporting the core algorithms and technologies of LLMs, selected dynamically through a āhorse-racing mechanism.ā āYā includes a series of application demonstration projects aimed at significant application scenarios, using the technical systems constructed by the flagship projects to promote the deep application of AI.
This proposal for an LLM flagship project received strong support from the Ministry of Science and Technology and other relevant departments. It was included in the national āScience and Technology Innovation 2030ā new generation of AI major science and technology projects guide for 2022. After the review process, in December 2022, a total of ā1+8,ā or nine projects, were successfully approved and began implementation on January 1, 2023.
In the view of Huang Tiejun, āOur country has been forward-looking in the direction of LLMs. A year and a half before ChatGPT came out, we had already deployed an 'aircraft carrier fleet' to focus on LLMs.ā
Another commendable feature of OpenAI is its excellent organizational ability. In retrospect, BAAI has also managed to bring together a loosely connected group of AI researchers. However, compared to OpenAI, its cohesion is still not strong enough. While having multiple teams working on different directions has its advantages, the downside is obvious: the lack of focused efforts on achieving something big.
Wudao 1.0 and 2.0 have not only spawned the first batch of LLM companies in China but also influenced a group of post-90s AI master's and doctoral students: Yang Zhilin, Qi Fanchao, Zeng Guoyang, Gao Yizhao, Huo Yuqi, and others. More than 85% of the team members in Wudao 1.0 and 2.0 are post-90s young students. After experiencing the pioneering work on LLMs, they have witnessed the explosion of products like Midjourney and ChatGPT in the past year and have many different thoughts about the commercial use of AI in the era of LLMs.
Many of them have grand ambitions to solve the problems in language intelligence and even AGI, and to transform AI into a new productive force in society. As the momentum of economic development begins to wane, the rise of technology to strengthen the country has become a consensus. Whether itās visual AI, autonomous driving, or today's LLMs, they all represent the active social desire for the construction of new productive forces in the past decade.
Each era has its own dilemmas, and each era also needs its own salvation. Only by walking a different path can we construct new ways of survival, and the world will always be in the hands of young people.