đFrom Zero to One: A Brief History of BAAI's Wudao LLMs
A Collective Endeavor of Top Academic Minds to Build a Chinese Language Model at an Unprecedented Scale
Hello readers! Last week was a bit of a slow news week in Chinaâs AI world, so I took the opportunity to finally translate a feature story thatâs been on my to-do list for months. Published by Leiphone (éˇĺł°ç˝), one of Chinaâs premier deep tech publications, the piece delves into the origin story of Wudao (ćé), a series of Large Language Models (LLMs) from the Beijing Academy of Artificial Intelligence (BAAI). Whatâs truly special about Wudao isn't just its capabilities, but how it became a breeding ground for young, talented Chinese scientists who went on to create their own LLMs and companies. The story is a long oneâover 7,000 words (ChatGPT did most of the translation). You can find the original Chinese article here.
Chapter 1
The story began in the autumn of 2018 in Haidian District, Beijing. On that day, October 11, a regular Thursday, Liu Zhiyuan habitually opened the arXiv website as usual and browsed the latest artificial intelligence (AI) works uploaded by scholars from all over the world. Most of the time, the quality of papers on arXiv was uneven, and Liu Zhiyuan only skimmed them to get a general idea. But that day, he was deeply attracted by a paper by Googleâs language group.
Originally, he only clicked in to take a look, but he became more and more fascinated and surprised the more he read it. After closing his computer, he still couldnât recover for a long time, overwhelmed by the ideas in it. Sure enough, he soon discovered that this paper had also attracted widespread attention from other AI scholars in China. Teachers and students from top universities like Tsinghua, Peking, Renmin, and Fudan were also enthusiastically discussing this work.
Everyone vaguely felt: âThis could be another technological paradigm shift in the field of AI.â
This work was the later famous BERT paper - âBERT: Pre-training of Deep Bidirectional Transformers for Language Understandingâ, which has now been cited over 700,000 times on Google Scholar.
In the Chinese context, âparadigmčĺźâ is not a common word. But in Leiphoneâs interviews about LLMs, this word was mentioned repeatedly, once describing deep learning in 2012, once BERT in 2018, and another time the direction of LLM startups in China before ChatGPT came out in 2022: âAt that time, no one thought about aiming for artificial general intelligence (AGI), but felt that LLM could be made into a universal AI paradigm.â Thatâs the story later.
Back to BERT.
Paradigm refers to the basic system and framework of a field, such as western suits and Hanfu being two different paradigms in the clothing field. On the basis of these two paradigms, fashion designers can design all kinds of styles and models. In short, the paradigm represents a change in underlying thinking, dividing the past from the future.
And BERTâs âbidirectional pre-trainingâ approach embodied this potential.
AI has three main directions: computer vision (CV), natural language processing (NLP), and machine learning (ML). The ultimate goal of NLP is to enable computers to understand human language. So how do we judge that a computer has understood human language? For a long time before BERT, the NLP research approach was to break down language understanding into small task directions, such as machine translation, text contrast, semantic analysis, etc., and then design and train AI algorithms for each task separately. For example, Liu Zhiyuanâs research direction during his Ph.D. (2006-2011) was a basic NLP task called âkeyword extractionâ.
The difference between BERT and traditional methods is that traditional statistical learning or deep learning allows AI algorithms to directly learn data for a specific task (such as text contrast). Before learning this data, AI is like a blank slate, without any basic capabilities, and the trained algorithms can only perform one task. BERTâs pre-training method, on the other hand, is first to let AI recite a large amount of annotated data before learning task data, which is like doing a full set of test papers before an exam, so the trained algorithms perform better in subsequent âexamsâ.
BERT was not the first pre-trained language model to use pre-training methods. A few months earlier, OpenAI released GPT-1, which was also a pre-trained language model. However, BERTâs innovation was that it broke the reliance on pre-training methods for specified task frameworks using bidirectional training ideas.
GPT-1 had a unidirectional structure that could only learn textual information from left to right or from right to left, so the trained algorithm could only perform one language task, such as GPT-1 being good at text generation but not understanding. BERT has a bidirectional structure that can simultaneously learn language representations from both the left and right, and learn from massive unlabeled data across multiple tasks, so it can perform multiple language tasks like question answering, fill-in-the-blank, and text understanding all at once, outperforming all models at the time on each task, and soon dominated the authoritative NLP leaderboard GLUE.
Everyone was shocked by BERTâs results, just like going back to when deep learning first demonstrated its power in 2012:
That year, Geoffrey Hinton, a professor at the University of Toronto, led two students, Alex Krizhevsky and Ilya Sutskever (now OpenAIâs chief scientist), to use deep learning methods to train AlexNet, which swept the world computer vision competition ImageNet, leaving all other statistical learning algorithms far behind. âDeep learningâ became famous overnight, and even NLP scholars kept discussing it.
Compared to deep learning, BERT made much smaller waves at the time, but a number of domestic NLP scholars also felt a sense of urgency that it was now or never.
Although there are no precise statistics, many scholars told Leiphone that after the rise of deep learning in 2012, whether in research or implementation, vision was the direction with the largest number of researchers and the hottest research enthusiasm in the domestic AI circle. From 2012 to 2018, changes in the language field were not as great as those in vision, and it did not stand out, particularly in embracing the wave of deep learning.
Liu Zhiyuan belonged to the Natural Language Processing Laboratory (THUNLP) at Tsinghua University. In 2012, Sun Maosong, the director of the lab, happened to lead the application for a 973 national key project, and in order to better determine the future technical route for NLP, organized several units, including Peking University, Harbin Institute of Technology, Institute of Automation of Chinese Academy of Sciences, and Baidu to discuss together. Everyone was optimistic about deep learning, so after the project was successfully applied for, THUNLP also turned to deep learning starting in 2013. Later, deep learning swept the globe as expected.
Since then, âdaring to revolutionize oneselfâ has become THUNLPâs research spirit. After BERT came out, Liu Zhiyuan also quickly decided to turn to pre-training methods. Their idea was to use knowledge graph methods to extract abstract knowledge one by one, and then inject it into pre-trained language models to make the models smarter. They cooperated with Liu Qun and Jiang Xin from Huawei Noahâs Ark Lab to quickly develop a pre-trained language model called âERNIEâ and submitted it to the top NLP academic conference ACL 2019.
Coincidentally, in 2018, Baiduâs NLP team was also shocked by BERT, and almost at the same time completed a pre-trained language model, taking the lead in publishing it on arXiv, also named âERNIEâ. Both teams named it after characters from the American cartoon Sesame Street, as previous pre-trained models like ELMO and BERT were also Sesame Street characters. Google used BERT, so when they wanted to match Google, they thought of it together.
The two âERNIEâ outperformed BERT on some tasks. Baiduâs release on arXiv preceded THUNLPâs cooperating paper, which was accepted afterward. To distinguish from Baiduâs, Liu Zhiyuan et al changed the model name, while Baidu continued to use this name. Later, when Baidu refined LLMs, the Chinese name was âWenxinâ, but the English name remained âERNIEâ.
As expected, pre-training quickly became the mainstream method in NLP. At the same time, some international teams also had keen intuition and quickly followed BERTâs bidirectional pre-training method. In February 2019, OpenAI released GPT-2. Although GPT-2 had better generative effects than GPT-1, it was still inferior to BERT on many language tasks, so OpenAIâs voice was completely overwhelmed by Google at the time.
But a year and a half later, history was refreshed again: In June 2020, OpenAI suddenly released a research result that exceeded everyoneâs imagination - GPT-3 with a parameter size of 175 billion. As fellow pre-trained language models, GPT-3âs parameter volume was 500 times that of BERT. Not only could it do language generation, but it also surpassed BERT on various language understanding tasks.
Everyoneâs research worldview was subverted.
Chapter 2
No one expected that enlarging the parameters of pre-trained language models would lead to so-called âEmergent Abilitiesâ. Googleâs corroborating paper on this phenomenon was not published until a year later.
BERT had 340 million parameters, which was undoubtedly a LLM compared to all language models in 2018. But peopleâs focus was more on its pre-training method, and no one thought of directly âpiling on parametersâ like OpenAI did. GPT-3âs approach of piling on parameters was like having the AI model directly memorize the entire library.
As a result, the rote-memorizing GPT-3 not only had very strong language understanding abilities, but also some reasoning capabilities. Even for some unlabeled data and tasks, GPT-3 could learn as it went, achieving decent results.
Previously, when knowledge was injected into small language models, their intelligence levels would also improve, which everyone could understand. But OpenAI skipped the step of extracting knowledge from text data, and relied entirely on piling parameters to force GPT-3 to learn, which caught everyone completely off guard. Some even claimed that GPT-3 had actually passed the Turing test.
The Turing test was proposed by the âfather of AIâ Alan Turing in 1950. After 70 years of AI development globally, this was the first time it had been passed, so the impact on the AI community was huge. GPT-3 was not only a major breakthrough in natural language processing, but also a milestone in the field of AI. For a time, discussions on language intelligence reached unprecedented heights. Not only NLP scholars like Liu Zhiyuan, but also researchers in information retrieval were constantly discussing it.
Even more exaggerated was OpenAIâs claim that they used 10,000 GPUs to train GPT-3.
Normally in academic research, the cost of computing devices accounts for about 20% of a professorâs total research funding. Having more than 500 cards means youâre a âwealthyâ player in academia. Previously, most AI scientists in China and abroad used single cards, or multiple cards on a single machine, when researching NLP. But GPT-3âs training used a total of 10,000 cards, which can cost about $12 million USD, over 80 million RMB.
From an engineering perspective, the engineering difficulty of training GPT-3 was also unprecedented. Take BERT as an example, the engineering effort for training 340 million parameter BERT compared to 175 billion parameter GPT-3 is like the difference between making a toy car and an airplane. The engineering for a toy car doesnât work for an airplane, similarly, past knowledge about training small language models didnât apply to LLMs.
GPT-3 crushed BERT, which is essentially a crushing of âlarge-scale pre-trained language modelsâ over âpre-trained language modelsâ.
On one hand, everyone was excited about GPT-3. On the other hand, they also felt a huge gap internally. Before this, most domestic scholars thought Chinese teamsâ papers were on par with top US universities. After GPT-3, they realized there was still such a big gap between themselves and international state-of-the-art levels.
In the summer of 2020 in Beijingâs Wudaokou, computer and AI scholars from Tsinghua, Peking University, Renmin University, Chinese Academy of Sciences, etc., were all paying attention to GPT-3. Although no one could clearly explain GPT-3âs powerful mechanisms at the time, intuition told everyone that this was an important watershed in the field of AI. The impact of GPT-3 was so great that some scholars decided to research large pre-trained language models, no matter what.
Liu Zhiyuan was one of them. At the time, the most prominent obstacle to researching LLMs was computing power. Liu Zhiyuan reached out to Tsinghua professors in high-performance computing like Chen Wenguang and Han Wentao, to collaborate on using distributed acceleration computing to reduce the training costs of LLMs. He also looked beyond THUNLP to seek outside help.
At that time, Sun Maosong was the Chief Scientist of Natural Language Processing in an emerging AI research institution less than 100 meters from the east gate of Tsinghua, where Liu Zhiyuan was also a young scientist. Naturally, Liu Zhiyuan thought of going there to discuss collaboration.
This institution is the now famous Beijing Academy of Artificial Intelligence (BAAI).
But at the time, BAAI was a research unit that had just been established for a year and a half and was still developing.
BAAIâs establishment was part of the blueprint for the construction of the Beijing International Innovation Center, jointly guided by the Ministry of Science and Technology and Beijing Municipal Government, with the mission of exploring the frontiers of AI. Through projects like âBAAI Scholarsâ, âBAAI Conventionsâ, and âQingyuan Meetingsâ, BAAI connected around 100 outstanding AI scientists in Beijing, while also working with BAAI scholars to find the ânext big thingâ in AI.
BAAI President Huang Tiejun told Leiphone that the selection of BAAI scholars was very strict in itself, so after selecting the corresponding scholars, BAAI would provide them with corresponding funding support without requiring them to submit research results. On the contrary, BAAI cared more about everyone being able to explore major AI directions worth investing in together.
In April 2019, BAAI identified several major directions, including natural language processing, machine learning, information retrieval, etc. Each direction gathered 5-10 well-known scholars for discussion. The natural language processing direction had Sun Maosong, He Xiaodong, Liu Zhiyuan and others; the intelligent information retrieval direction had Wen Jirong, Tang Jie and others. After GPT-3 came out, scholars in the major directions discussed GPT-3 and how to research Chinaâs LLMs.
Before reaching a consensus, there were several key discussions within BAAI.
The first two were at Yanqi Lake in Beijing: July 2020 was a meeting for the machine learning direction. Scholars in this direction felt GPT-3 was a major direction. Now that LLMs had emerged, they should research visual LLMs. But after discussion, they felt visual LLMs required even more computing power, so no actions were taken. August was the information retrieval and mining direction, where Wen Jirong, Tang Jie and others discussed LLMs. In September, at BAAIâs office meeting, Liu Zhiyuan proposed researching universal language models.
After National Day, on October 10, BAAI held another discussion at Yanqi Lake, inviting scholars from different directions to attend. They finally reached a consensus at the meeting to form a task force and collaborate on LLMs.
After approval, BAAI sent out âhero recruitment postsâ through various channels, inviting scholars interested in LLMs to participate together, with the slogan âHeroes donât ask where youâre fromâ. This call to action resonated with scholars, and many signed up.
The first were professors from Tsinghua and Renmin, including Liu Zhiyuan, Wen Jirong, Tang Jie, Huang Minlie and others. Subsequently, scholars from Peking University, Chinese Academy of Sciences and other institutions also expressed interest, and some external BAAI members also participated, such as Yang Hongxia, who worked at Alibaba Damo Academy at the time. In the end, BAAIâs LLM project gathered about 100 people, with the then BAAI Deputy Dean Tang Jie appointed as the project leader.
That October, BAAI reported this â100 People LLM Planâ to Beijing Mayor Chen Jining at the time. Mayor Chen was very excited and said, âThis (LLM) is the nuclear fission point for the future of AI, and will bring prosperous ecological development.â Beijing decided to strongly support it and approved special funding for BAAI to purchase computing power.
In fact, at that time, many people still didnât understand what LLMs were, and developing LLMs required high costs. But in October 2020, from scholars to BAAI, from Beijing to the Ministry of Science and Technology, everyone reached a consensus - to fully advance the research and development of Chinese LLMs. Afterward, many scholars expressed amazement to Leiphone: âStrangely, everyone was decisive at the time.â
Everyone felt LLMs could do something bigger. In addition to LLMs, the idea of âquantitative change leading to qualitative changeâ could also lead to breakthroughs in other fields. After discussion, they decided to âdivide into four groupsâ and explore Chinese LLMs from four directions: Chinese language models, multimodal models, cognitive models, and protein models, led by Liu Zhiyuan, Wen Jirong and Tang Jie, respectively, with Tang Jie responsible for the latter two, essentially âthree teams doing four thingsâ.
In November 2020, the teams discussed names during the NLP annual conference at Chunhui Garden in Shunyi. Sun Maosong said everyone was researching language, so he suggested using âWenćâ (meaning language/literature). After discussion, the four teams were named after four of the seven imperial libraries of the Qing Dynastyâs Complete Library of the Four Treasuries: âWen Yuanćć¸â, âWen Lanććžâ, âWen Huiććąâ, and âWen Suć溯â.
To indicate they were one entity, BAAI suggested giving them a unified codename, and invited everyone to BAAIâs then office in the Saiâer Building in Wudaokou. At the meeting, Tang Jie proposed relating it to Wudaokou, since everyone had deep feelings for Wudaokou. So everyone came up with a few names together. After brainstorming, Song Ruihua from Renmin University suggested âWudaoćéâ, a homophone for âWudaokouâ, and everyone agreed.
Thatâs how âWudaoćéâ came about.
Chapter 3
Wudaoâs original intention was very pure: to catch up with GPT-3 and research Chinese LLMs.
So what are âChinese LLMsâ?
Nowadays, there are many types of LLMs in China, to the point that the definition of LLMs has become blurred. But in 2020, Wudao members had a very focused understanding: fundamentally, GPT-3 was an English-centric language model, while China didnât have one at the time. Therefore, the âChinese LLMâ should first be a Chinese-centric large-scale pre-trained language model with over 175 billion parameters, like GPT-3.
Although later research showed that monolingual language models also have some multilingual capabilities, in the Chinese context, people found that using GPT-3 to solve many Chinese language tasks often led to semantic ambiguities, logical errors, etc. One reason is that GPT-3âs training data is mainly English, and Chinese research teams have no way of knowing GPT-3âs detailed training parameters for fine-tuning. So, whether subjectively or objectively, in 2020, independently developing domestic LLMs was an inevitable choice.
BAAI approved the project in October 2020. Since LLMs require large computing power, BAAI also began heavily investing resources like computing power from October. BAAI originally planned to purchase 300P with existing research funds. With Mayor Chen Jiningâs approval to strongly support it, it was decided to allocate another 700P from special funds, so the total was 1000P. However, the approval and purchasing process took over a year, so Wudao relied mainly on rented computing power at the start.
Everyone believed LLMs were the future major direction. Related scholars also brought their own resources to participate in BAAIâs LLM project: in terms of manpower, each professor brought their teams of graduate students; for resources, when BAAIâs computing power was not fully in place, scholars also obtained some computing power through their own channels. For example, Wen Jirongâs team initially trained multimodal LLMs on Renmin Universityâs machines, while Tang Jieâs team ran on Alibaba Cloud.
Although GPT-3 made big waves, teams like BAAI fully committed to LLMs were still rare in China at the time, and Wudao was even belittled for a time. There were two main reasons for the dismissals: one was that developing LLMs was very costly, with computing costs easily reaching tens of millions; two was that LLMs were not original innovations, relying only on piling parameters, with low technical sophistication. But BAAI insisted on exploring.
After they truly started research, they discovered: OpenAI was not a bluffing charlatan, and the technical barriers to LLMs were not just about âpiling computing powerâ and âpiling parametersâ. Take Chinese and multimodal LLMs for example. Before Wudao, global AI exploration in these two areas was a complete blank. Since they were the first in China to train LLMs, it was like starting from scratch, a very challenging process.
But relying precisely on this kind of fearless courage to forge ahead, after six months Wudaoâs LLMs made leapfrog progress.
Two months after Wudaoâs approval in December 2020, Liu Zhiyuan, Huang Minlie and Han Wentaoâs Wenyuan team released the worldâs first open-source Chinese LLM âCPMâ. CPM only had 2.6 billion parameters, negligible compared to GPT-3, but its advantage was using Chinese data. Moreover, compared to 2019âs âERNIEâ, CPMâs parameters increased by several hundred times. This was not only an engineering feat, but also validated the viability of Wenyuanâs approach to training Chinese LLMs.
Almost at the same time as CPM, Wenlan and Wenhui also found solutions. The core Wenlan member Lu Zhiwuâs âTwin Towersâ approach was validated in December 2020, and Wenhuiâs 100 billion parameter model was completed in January 2021. In March 2021, BAAI combined Wenyuanâs CPM, Wenlanâs 300 million image-text pair trained multimodal model BriVL 1.0, Wenhuiâs 100 billion parameter English-Chinese bilingual LLM GLM-10B, multimodal model CogView 1.0 and other results, collectively called âWudao 1.0â and released them in March 2021.
Objectively, âWudao 1.0â did not cause much sensation, but at a time when LLMs were still unfamiliar in China, Wudao showed people âwhat LLMs areâ: they can write poetry, answer questions, align text and images... more powerful than any previous NLP algorithms.
At the âWudao 1.0â press conference, BAAI also first proposed the concept of LLMs大樥ĺ, aka LLMs. BAAI President Huang Tiejun coined a phrase, saying that in recent years, AI development had gradually shifted from ârefining modelsâ to ârefining LLMsâ. That is: after the rise of deep learning in 2012, many small AI models appeared globally, while ârefining LLMsâ trains LLMs intensively, designs more advanced algorithms, integrates more data, pools huge computing power, so one model can serve many enterprises.
In other words, LLMs have not only large parameters, but high intelligence. This press conference cleared up outside doubts about BAAI, and Wudaoâs LLMs began to emerge.
In Wenhui led by Tang Jie, Alibaba Damo Academy engineer Yang Hongxia and Recurrent AI Co-Founder Yang Zhilin were core members. BAAI did not restrict Wudao membersâ research freedom. Yang Hongxia participated in Alibabaâs LLMs, Yang Zhelin led Recurrent AI to cooperate with Huawei. In April 2021, Alibaba also released its 27 billion parameter LLM âPLUGâ, and Huawei released Pangu. Wudao not only connected scholars, but also strengthened cooperation between academia and industry.
Like Wenyuan, Wenhui also gathered young research talents from high-performance computing, such as Chen Wenguang and Zhai Jidong, who along with Han Wentao belonged to Academician Zheng Weiminâs team. For LLMs, high-performance computingâs distributed acceleration computing methods are crucial for improving training speed and reducing costs. High computing talents were also given important responsibilities in the Wudao project.
But for Chinese LLMs, high-performance computingâs greater influence was birthing Chinaâs first trillion-parameter model: âWudao 2.0â.
At the end of 2020, while advancing Wudao, Tang Jie, Chen Wenguang and Yang Hongxia were also planning another thing: applying for the Gordon Bell Prize, known as the âNobel Prize of supercomputing applicationsâ.
To apply for the Gordon Bell Prize, the supercomputer needs to meet several requirements: one, it must be the worldâs largest; two, the project researched on it must max out the machine; three, the project results must be impactful. After completing GLM-10B in January 2021, they decided to run LLMs on the supercomputer.
So they sent over 30 people to Mountain Sea AI Lab in Qingdao to run LLMs on âSunway TaihuLightâ. The students of Tang Jie and Zhai Jidong were the backbone, and Zhai Jidong was recruited by Tang Jie and Chen Wenguang for his outstanding capabilities in parallel training of low-level operators. There were also some Alibaba engineers providing online support.
They brought all the data they had to Qingdao, including Chinese, English, images, etc., mixed together for training. To meet the Gordon Bell Prize requirement of maxing out the machine, they expanded the model parameters to 1.74 trillion, with no data convergence. After running on the supercomputer for ten days, they trained several versions of LLMs, each with hundreds of billions of parameters.
Although the scale was huge, the operating costs were also extremely high, beyond almost everyoneâs affordability. So they trained a more converged MoE-based model with 1.75 trillion parameters, 10 times larger than GPT-3, surpassing Googleâs 1.6 trillion parameter Switch Transformer released in April 2021 to become the worldâs largest model at the time. At BAAIâs June 2021 conference, where it was unveiled, it shocked the entire audience and was viewed as âWudao 2.0â, receiving widespread acclaim from top domestic and foreign technology teams.
For a time, BAAI gained unmatched glory and joined the international forefront of LLMs.
Apart from this trillion-parameter model, âWudao 2.0â also included Wenyuanâs two 10 billion models (11 billion parameter Chinese model, 11 billion parameter English-Chinese bilingual model) and one trillion model (198 billion parameter English-Chinese bilingual MoE model), collectively called âCPM 2.0â; Wenlanâs 5 billion parameter image-text retrieval LLM BriVL 2.0 - this was Chinaâs first multimodal LLM, and the worldâs largest and most trained multimodal model at the time.
Before Wenlan, academiaâs mainstream approach to multimodality was âsingle towerâ, meaning the Transformer had 12 layers, looking like one tower, with text and image tokens input together interacting, and then scoring based on similarity. But with extremely large parameters, online one-by-one comparison would be very inefficient. So Lu Zhiwu proposed the âTwin Towersâ approach:
Images are first processed by an image encoder, text is also first processed by a text encoder, without interaction. After understanding higher-level meanings separately, comparative learning is then conducted. If the image and text meanings are similar, the Twin Towers are close, otherwise distant. Because they pre-encoded images in parallel to convert them into high-dimensional vectors for storage, when retrieving text, they only need to encode the text, and could find matched results in the high-dimensional vectors in less than a second. Wenlan verified the feasibility of the âTwin Towersâ approach in November 2020. Two months later, OpenAIâs released CLIP architecture (DALL-Eâs backing force) was the same idea.
Afterward, Lu Zhiwu told Leiphone that they donât think they âdid research following othersâ; whether Chinese, multimodal, or trillion-parameter models, Wudaoâs three groups were all pioneering new frontiers in uncharted territory.
To research multimodal LLMs, Lu Zhiwu devoted all his students to Wenlan, and the team went a full year without publishing any academic papers. In academia, this was an enormous risk for both teachers and students.
Similarly, due to the lack of high-quality Chinese data, many of Liu Zhiyuan and Huang Minlieâs students were assigned to data annotation and cleaning when researching Chinese LLMs. In CPM 2.0 research, Wenyuanâs raw data collection reached 50TB, and after cleaning was still 2.6TB. Students invested a huge time and effort.
In general, BAAIâs 100 Wudao members were going all in, âgambling their careersâ, but unexpectedly they won: after releasing âWudao 2.0â in June 2021, BAAI Wudao became a prominent flag for Chinese LLMs, and Wudao members became the first pioneers of Chinese LLMs.
Chapter 4
In reality, 2021 was considered the âYear One of LLMs in China.â After the release of Wudao 2.0 in September, Baidu released its 10-billion-parameter model PLATO-X and 260-billion-parameter model ERNIE 3.0 Titan; in October, Alibabaâs DAMO Academy released an LLM with up to 10 trillion parameters known as âM6.â
Despite the high costs of training LLMs, a group of dedicated LLM followers emerged in 2021. Both domestically and internationally, authoritative voices emerged. Two weeks after the launch of Wudao 2.0, Google published a paper claiming that language models would exhibit âEmergent Abilitiesâ when scaling from tens to hundreds of billions of parameters. In August 2021, a review paper on âFoundation Modelsâ authored by a hundred scholars like Li Fei-Fei and Percy Liang from Stanford University caused a significant international stir.
However, many Wudao team members knew that in 2021, a true domestically produced LLM with hundreds of billions of parameters had not yet appeared.
The underlying architecture for both the hundred-billion and trillion-parameter versions of Wudao 2.0 was sparse. The trillion-parameter model took up about 20TB of disk space and required over 500 A100 GPUs for inference. After copying the model from Shandong to Beijing, the Wudao team found it too expensive to operate and opened it to the industrial sector. Several companies copied the files but probably couldnât use them either.
In terms of technology, the LLM also suffered from âcatastrophic forgetting,â particularly when image data were added. This significantly weakened the modelâs textual capabilities, making it even less effective than the 10-billion-parameter model GLM-10B.
Compared to their technological breakthroughs, the LLMsâ greater contribution was cultivating a generation of young talents who truly understood how to train LLMs. Thatâs why, after the launch of Wudao 2.0, the team members were even more determined to develop a model with hundreds of billions of parameters.
By the end of 2021, Tang Jie suggested at an internal Wudao meeting several objectives: training a model with hundreds of billions of parameters, developing a text-to-video model, and a code generation model. But achieving these goals would require 1,000 GPUs running flawlessly for two months, with very high training costs.
Wudao 2.0 attracted a lot of attention, but there were insufficient computational resources. Tang Jieâs team was invited to use the 910A machines at the Pengcheng Laboratory. They also received nearly 2,000 Huawei 920 GPUs, which initially had only 18% of the A100âs operator efficiency. After modifications, the efficiency was raised to about 40%.
During this period, Tang Jieâs team adapted various cards available in the market. They found that it was not possible to quickly converge a hundred-billion-parameter model with 2,000 910A cards, nor with tens of thousands of DCU cards running for two months. In the end, under the name of his startup Zhipu AI, Tang Jie rented 1,000 cards from the Jinan Supercomputing Center and committed a team of over 20 people to train for 8 months. Finally, in July 2022, they trained the hundred-billion-parameter modelâGLM-130B.
Meanwhile, other teams based on Wudao developed many unprecedented applications. For example, Liu Zhiyuanâs student Qin Yujia wrote a program that used a Chinese LLM to call Bingâs search engine to answer questions on Zhihu, accumulating thousands of upvotes. Lu Zhiwuâs team used a multi-modal LLM to edit short videos, accumulating 1.5 million views on TikTok.
However, the market in China was not yet willing to pay for LLMs. After setting up their LLM companies, they all went out to raise funds confidently, but not a single investor was willing to pay.
All of Wudaoâs LLM achievements were open-source. But even after tens of millions of API calls following the release of Wenlan, many interested large enterprises were not willing to pay for usage.
In 2022, domestic awareness of LLMs was still generally lacking. Everyone knew that LLMs were strong, and everyone also knew that a âhit productâ was needed to showcase the capabilities of LLMs. But no one had a solution. Technically, they had become giants; but in terms of products, they were still dwarfs.
That was until the appearance of ChatGPT.
Chapter 5
Song Ruihua joined Renmin University in September 2020 and began participating in the Wudao Wenlan research in October. Prior to this, she was the Chief Scientist at Microsoft Xiaoice, specializing in text generation and leading the âXiaoice Writes Poetryâ project.
After moving from Microsoft to Xiaoice in 2018, Song Ruihua began to take an interest in cognitive intelligence and wanted to explore how AI understands human language. That summer, she read a book by Benjamin Bergen, a cognitive science professor at the University of California, San Diego, titled âLouder Than Words: The Science of How The Mind Makes Meaning.â She found it inspiring.
The book points out that when humans read a good piece of work, they often canât stop reading and imagining scenes corresponding to the text. If the text is well crafted, these scenes come to life in the readerâs mind. Therefore, a key indicator of true understanding is the ability to imagine a scene or even add content not present in the text.
Additionally, understanding language is not about using words to perform tasks, much like reading books is not about preparing for an exam the next day. However, in the past, scientists in the field of computing often evaluated whether AI understands human language by setting up specific, segmented tasks. For example, they would compare sports articles with financial articles to see if AI could distinguish between them.
Before ChatGPT, most of the technical staff researching AI dialogue in China came from the era of forums. Their research ideas mainly originated from forum-type chats, such as thread-based conversations where A posts a topic and B and C reply underneath. In this pattern, the model, when conducting open dialogue, would expose its lack of knowledge because the knowledge wouldnât exist in these âpairs.â One of Song Ruihuaâs colleagues found during a client visit that AI was not good at beauty-related dialogues because their outputs were mainly small talk.
At that time, Song Ruihua kept pondering the problem. She realized that the issue was the lack of worldly knowledge in chat âpairs.â She thought that if all the text on the internet could be used, it would be great. At Xiaoice, her idea was to use articles from public accounts, as these accounts often consciously follow hot topics and analyze them from various angles.
However, she missed a step. She thought too complexly, believing that the text should first be abstracted into a graph, which would then influence the dialogue. For example, if you input âLu Hanéšżć (a Chinese male idol),â a mailbox would appear in the graph as a clue for AI, because Lu Han took a photo next to a mailbox on the Bund in Shanghai in 2016. The event became news, and his fans would go to that mailbox to check in. But this method has drawbacks: sometimes the original sentences extracted from the articles are too formal or contain extra information and are not suitable replies.
When ChatGPT was launched by OpenAI, Song Ruihua had an epiphany and was both excited and shocked:
âBingo! This is how it should be solved!â
As soon as ChatGPT came out, Song Ruihua tried it immediately and was very surprised. Although both are dialogue robots, âXiaoice and ChatGPT are like two different species.â ChatGPT doesnât accumulate knowledge around a specific task but learns it into the model first. Just like humans accumulate knowledge through daily reading, the more you read, the more knowledge you accumulate. When encountering a certain âprompt,â you can call upon this accumulated knowledge and apply it in combination, rather than just reciting the original text.
Song Ruihua told Leiphone that she had observed that casual chat dialogue robots lacked a wide range of world knowledge. She also thought of using all the articles on the internet to make up for the deficiencies, but she didnât have the deep skills of Ilya Sutskever (OpenAI Chief Scientist in charge of ChatGPT) to solve it.
In Ilyaâs cognition, the ability of all language tasks can be simplified into a single âAI reasoningâ ability. And Ilya also believes that all reasoning can be completed by predicting the next word. For example, let AI read a detective novel, master all the relationships and clues in the novel, and then in the last sentence of the novel, the detective stands up and says to everyone: âThe murderer is ____!â At this time, the content of the fill-in-the-blank is very testable for the modelâs ability. Some AI models have strong logical abilities and can fill in the correct name; some models will fill in a wrong name but also show some logical abilities; and some models fill in something that is not even a name at all.
Ilya believes that reasoning is whether the accuracy of predicting the next word has improved. Understanding language is difficult to define, but it can be replaced by âprediction.â When AI continuously learns how to predict the next word, it has already learned to understand and reason. Therefore, when Ilya explains why GPT-4 is stronger than GPT-3.5, he will emphasize that â(GPT-4) the accuracy of predicting the next word has improved again.â Scholars from Beijing Normal University, Cambridge, and Microsoft have also experimented with GPT-3.5 and GPT-4 on IQ and psychological tests and found that the level of GPT-4 has significantly improved.
This was something the first generation of large-model researchers in China had not considered. Before this, scholars in China generally believed that humans are adept at mathematical reasoning, so information should be symbolized and knowledge mathematized. Under this mindset, model architectures were often designed to be extremely complex, limiting their capabilities. However, ChatGPT embodies the aesthetic of âsimplicity is best,â combining a straightforward framework with a wealth of knowledge and innovative interactive forms, which instantly revitalized the productâs effectiveness.
The power of natural language was recognized for the first time. In a lecture at MIT in May this year, Geoffrey Hinton also pointed out that AI doesnât need to symbolize information to gain knowledge from text, because humans also rely on language for reasoning. He gave an example that left a deep impression on Song Ruihua: Hinton asked ChatGPT, âWe have some rooms in our house that are white, blue, and yellow. The yellow paint will fade to white within a year. If I want all my walls to be white in two years, what should I do?â ChatGPT replied, âYou can paint the blue room yellow.â Hinton was shocked because, while ChatGPT may not have understood numbers, it seemed to understand what âfadingâ means.
Although some users have tested ChatGPTâs capabilities by asking it math questions, many early members of Wudao believe that ChatGPT has already solved some of the most difficult technical challenges in the current NLP field, such as coherence and internal logic in long texts. In some professional scenarios, the answers generated by ChatGPT may not be satisfactory, âbut these issues can be improved.â
After the advent of ChatGPT, LLMs suddenly became popular, and previously overlooked LLM companies like Zhipu, Mianbi, Lingxin, Zhizi, Shenyan, etc., have also become the stars of tomorrow in Chinese capital markets. Zhizi Engine, which previously couldnât raise funds, got a valuation of 100 million RMB in its angel round after ChatGPTâs release. Investors even asked Lu Zhiwu and his student, Zhizi Engine CEO Gao Yizhao, âIs 100 million enough?â
They firmly believe that LLMs are a significant future for AI, but they didnât expect the future to come so quickly.
However, when the glitz of capital is brushed aside, for scientists seeking to explore language intelligence, the greater revelation from ChatGPT lies in its fundamental understanding of LLMs and product imagination, which is closely related to the grand goal that OpenAI aims to achieveâAGI (Artificial General Intelligence).
ChatGPTâs product is almost perfect: it can understand the userâs intent and answer a variety of questions, and each question usually receives a reasonable answer. It even demonstrates a level of âknowledgeâ in most answers, thereby transforming into actual productivity in Q&A. This is undoubtedly due to the profound understanding of neural networks and language features by Ilya and others. But whatâs even more important is that OpenAI has bold predictions for the future.
So, since its establishment in 2016, when everyone said that AGI was a pipe dream, the team at OpenAI dared to believe that it was the future of AI; when everyone chose BERT, they firmly chose GPT. While BBAI Wudao was exploring LLMs, they didnât have such grand ambitions; even when Wen Jirong and others proposed researching multi-modal LLMs, it was just because âhumans also learn this way,â and they didnât think in the direction of AGI.
After ChatGPT was released, the various LLM teams in Wudaokou quickly launched similar LLM products due to their previous technical accumulation. For example, Zhipu AI launched ChatGLM in less than two months; Zhizi Engine also released ChatImg on March 8... But they know better that they are still far from the output of language intelligence, let alone AGI.
Everyone knows deeply that ChatGPT is an inspiration, but it is by no means the endpoint.
Chapter 6
After releasing Wudao 2.0 in June 2021, BAAI has been thinking about the future of LLMs and how they can empower economic and social development. During the launch of Wudao 2.0, Huang Tiejun proposed that LLMs are the âcarriers of intelligence.â In his vision, technology hardware and software make up the base layer, AI applications are at the top, and LLMs act as the âtrunkâ in between. The significance of LLMs is to transform âintelligenceâ into a public utility, akin to water, electricity, and the internet. The concept of âModel as a Serviceâ (MaaS) also originated from Wudao (which I doubt it).
As Wudao reached its 2.0 version, BAAI's computing resources were becoming a bottleneck; they only had 480 A100 cards available, which was insufficient to support multiple teams. New purchases of 960 A100 cards were on the way but hadnât yet arrived. With limited resources, BAAI decided to focus on algorithmic innovation for LLMs. All achievements from Wudao 1.0 and 2.0 were open-sourced to support collaborative innovation across academia and industry.
For an open-source project to be successful, it needs to unite a broad community of research and development contributors and also maintain a stable core technical team. In addition to collaborating with academic scholars, BAAI started external recruitment to establish its independent large-model team. In January 2022, Lin Yonghua, former head of IBM China Research Institute, joined BAAI as the Chief Engineer. By June 2022, the LLM training platform âJiudingâ was released, reaching a total computational power of 1000P. Specialized large-model teams were also gradually put in place.
In April 2023, Microsoft President Brad Smith named BAAI as one of the three organizations âat the absolute forefrontâ globally, alongside OpenAI and Google.
In June 2023, at the 5th BAAI Conference, âWudao 3.0â was launched. It included the âWudao- Aquilaâ series of language models and the âWudao-Visionâ series of visual and multimodal models. Unlike its predecessors, Wudao 3.0 is not just a single LLM but a comprehensive LLM technology system. It also includes the âFlagEvalâ LLM evaluation system and an open platform, as well as the FlagOpen LLM open-source technology system, reflecting a more macroscopic vision of LLM development.
Additionally, Wudao 3.0 goes beyond the scope of BAAI; it represents the first phase results of a new generation of AI flagship projects, âAI Foundation Model Support Platform and Evaluation Technology.â
When Wudao 1.0 and 2.0 were launched in 2021, an expert group for the âNew Generation of AI Major Science and Technology Projectsâ had already begun discussing how the state should support LLMs. BAAIâs Wudao represented a bold exploration in this direction. However, there were issues of each entity acting on its own. Therefore, the expert group proposed an open mechanism to strengthen âorganized scientific researchâ and guide the âlarge-scale training of LLMsâ from a âbrute-forceâ competition back to a track of rational innovation.
The proposed mechanism was a â1+X+Yâ system. Here, â1â represents the flagship project of âAI Foundation Model Support Platform and Evaluation Technology,â serving as the âaircraft carrierâ leading the development of LLM technology and industry. âXâ consists of a number of key technology projects supporting the core algorithms and technologies of LLMs, selected dynamically through a âhorse-racing mechanism.â âYâ includes a series of application demonstration projects aimed at significant application scenarios, using the technical systems constructed by the flagship projects to promote the deep application of AI.
This proposal for an LLM flagship project received strong support from the Ministry of Science and Technology and other relevant departments. It was included in the national âScience and Technology Innovation 2030â new generation of AI major science and technology projects guide for 2022. After the review process, in December 2022, a total of â1+8,â or nine projects, were successfully approved and began implementation on January 1, 2023.
In the view of Huang Tiejun, âOur country has been forward-looking in the direction of LLMs. A year and a half before ChatGPT came out, we had already deployed an 'aircraft carrier fleet' to focus on LLMs.â
Another commendable feature of OpenAI is its excellent organizational ability. In retrospect, BAAI has also managed to bring together a loosely connected group of AI researchers. However, compared to OpenAI, its cohesion is still not strong enough. While having multiple teams working on different directions has its advantages, the downside is obvious: the lack of focused efforts on achieving something big.
Wudao 1.0 and 2.0 have not only spawned the first batch of LLM companies in China but also influenced a group of post-90s AI master's and doctoral students: Yang Zhilin, Qi Fanchao, Zeng Guoyang, Gao Yizhao, Huo Yuqi, and others. More than 85% of the team members in Wudao 1.0 and 2.0 are post-90s young students. After experiencing the pioneering work on LLMs, they have witnessed the explosion of products like Midjourney and ChatGPT in the past year and have many different thoughts about the commercial use of AI in the era of LLMs.
Many of them have grand ambitions to solve the problems in language intelligence and even AGI, and to transform AI into a new productive force in society. As the momentum of economic development begins to wane, the rise of technology to strengthen the country has become a consensus. Whether itâs visual AI, autonomous driving, or today's LLMs, they all represent the active social desire for the construction of new productive forces in the past decade.
Each era has its own dilemmas, and each era also needs its own salvation. Only by walking a different path can we construct new ways of survival, and the world will always be in the hands of young people.