👨🏻🏫 Generative AI Security Standards, LLM‘s 200K Context Window, Alibaba's Open-Source Obsession, and Baidu World 2023
Weekly China AI News from October 2 to October 15
Hi, my beloved readers, last week was jam-packed with AI updates that I can hardly keep up! In this issue, I will briefly discuss China’s first draft guideline focused on the security of generative AI. A Chinese startup whose LLM leaves GPT-4 in the dust when it comes to context window. Alibaba and BAAI aren’t slowing down on their open-source LLMs. Keep an eye out for Baidu’s upcoming tech festival where they’ll be revealing the next-gen ERNIE model. Also, get this—a Chinese food company just bought over 2,000 NVIDIA H800 GPU chips. Mind blown!
China Releases Draft Guideline on Generative AI Security Requirements
What’s new: China’s regulatory authority, the National Information Security Standardization Technical Committee, has unveiled a draft guideline to recommend security standards for generative AI services. Released on October 11, the document is titled Basic Requirements for Generative Artificial Intelligence Services. The guideline requires that generative AI service providers must establish a blacklist for data corpora containing more than 5% of illegal or unhealthy information. Additionally, AI models should not be trained on data containing copyrighted materials or commercial secrets.
Why it matters: This marks the first regulatory guideline aimed at the security of generative AI services in China. Although these guidelines are not legally binding, the committee’s influence suggests that its recommended practices will probably be integrated into future regulations.
More details: The draft provides a set of security requirements that generative AI service providers should adhere to, including dataset security, model security, safety measures, and security assessment. Service providers need to undergo safety assessments before filing an application for service launch.
Dataset Security: The draft proposes establishing a blacklist of corpus sources that are prohibited for training purposes. Any source containing more than 5% of illegal or undesirable information should be added to the blacklist. Data explicitly marked as uncollectable by others should not be used, including but not limited to robot protocols, which are used to instruct automated web crawlers.
Personal Data Protection: When using a corpus that contains personal information, the consent of the individual should be obtained. This requirement extends to sensitive personal data and biometric information
Intellectual Property Rights: Detailed advice is given for avoiding intellectual property infringements.
Model Security: Foundation models used for development should be approved by relevant authorities.
Monitoring and Quality Control: A proportionate monitoring team should be set up to ensure the quality of generated content.
Security Assessment Criteria: A set of evaluation metrics is laid out, including manual inspections that require at least a 90% qualification rate for generated content.
New Chinese LLM Can Take 200K Words as Inputs
What’s new: Moonshot AI (月之暗面), a startup company established just six months ago, has announced a breakthrough in long-text language models with their new product, Kimi Chat. This product can accept up to 200,000 Chinese characters as input. The company claims it is the longest context window supported in a commercially available LLM.
Who is Moonshot AI? Moonshot AI was one of the five companies listed by The Information in June as most likely to become China’s OpenAI. Moonshot AI Founder Yang Zhilin, a graduate of Tsinghua University and Carnegie Mellon University, is the first author of two influential LLM papers Transformer-XL and XLNet, which have been cited nearly 20,000 times on Google Scholar.
Why context window matters: Context window refers to the range of tokens or words an LLM can access when generating responses to prompts. A longer context window allows the model to process long documents like novels and legal papers more effectively. Context length has consequently been an increasingly important theme of research, according to the State of the AI report.
Kimi Chat can support around 200,000 Chinese characters in context, which is 2.5 times more than Anthropic’s Claude-100k (about 80,000 Chinese words) and eight times more than OpenAI's GPT-4-32k (25,000 Chinese words). According to Moonshot AI, their techniques avoid common pitfalls in long-text model development, like “goldfish” models that forget context quickly, “bee” models that focus only on localized areas, and “tadpole” models that compromise on capabilities to handle longer context.
My two cents: The capability to process long text inputs opens up new possibilities for real-world applications. The question, however, is whether the model can live up to the expectations. Research by Samaya.ai, UC Berkeley, Stanford, and LMSYS.org revealed that even the most advanced LLMs may fail in multi-document question answering and key-value retrieval tasks when faced with long input lengths.
What’s next: Unlike Baichuan and Zhipu AI, which are currently focusing their operations on the enterprise market, Moonshot AI is targeting the consumer sector with the goal of developing a killer app.
Alibaba Cloud, BAAI Fuel Open-Source LLM Race with Qwen-14B and Aquila2
What’s new: Alibaba Cloud and Beijing Academy of Artificial Intelligence (BAAI) have each made strides in the LLM open-source arena. Alibaba Cloud has just open-sourced its 14-billion parameter model, Qwen-14B, along with a conversational variant called Qwen-14B-Chat. This marks Alibaba Cloud's third open-source large language model in just over a month, following Qwen-7B and Qwen-VL.
On the other hand, BAAI has gone all out on open-sourcing, offering a suite of updated models and platforms that include the Aquila2 model series, a new version of the semantic vector model BGE, FlagScale for efficient parallel training, and FlagAttention, a high-performance Attention operator set.
How it works: The models from Alibaba are available on its ModelScope community and platforms like HuggingFace. Alibaba has also released detailed technical reports explaining the training process. Qwen-14B was trained on a dataset comprising 3 trillion tokens, while Qwen-7B utilized 2.4 trillion tokens.
Meanwhile, BAAI’s new Aquila2 series includes basic models with 34 billion and 7 billion parameters, chat models, and text-to-SQL language models. BAAI said Aquila2 outperforms the models of similar model sizes on a series of benchmark datasets.
Why it matters: Open-source LLMs like Alibaba Cloud’s Qwen-14B and BAAI’s Aquila2 provide a viable and flexible alternative to closed-source counterparts, especially for enterprise clients who prioritize privacy, budget sensitivity, and the need for customization.
My two cents: While Alibaba Cloud’s efforts in open-sourcing Qwen is commendable, it's worth noting that we haven't heard much about their closed-source model and consumer product, Tongyi Qianwen (Qwen), lately. Open-sourcing Qwen is undoubtedly a boon for developers, but it has yet to significantly narrow the gap between Chinese companies and Western tech giants like OpenAI in the journey toward AGI.
What to Expect at Baidu World 2023
What’s new: Baidu is set to host its annual flagship event, Baidu World 2023, in Beijing on October 17th. The company is expected to unveil multiple AI-native applications, present the latest advances in foundation models, and offer valuable insights on seizing new opportunities through AI-native thinking.
ERNIE 4.0: Baidu is expected to unveil its next-gen foundation model, ERNIE 4.0, according to multiple media sources. This successor to ERNIE 3.5, which currently drives the ERNIE Bot, promises to be a transformative model boasting an impressive number of parameters. In addition to ERNIE 4.0, Baidu plans to showcase over 20 AI-native applications. This includes fresh takes on Baidu Search, Baidu Drive, Baidu Maps, and the introduction of novel applications such as Generative Business Intelligence.
Live Streaming Details: Catch the event live on YouTube or Twitter. The broadcast starts at 10:00 a.m. Beijing Time on October 17th (equivalent to October 16th, 10:00 p.m. New York Time and 7:00 p.m. PDT). Don’t miss out!
Weekly News Roundup
😂 Lianhua Health, a Chinese maker of Monosodium Glutamate (MSG), a common flavor enhancer among Chinese households, has surprisingly purchased 330 NVIDIA H800 GPU servers. The servers, priced at RMB2.1 million each, resulted in a contract total of RMB693 million.
🌦️ Huawei Cloud collaborates with Shenzhen Meteorological Bureau to develop an AI-driven regional weather forecasting model. This innovative step aims to enhance the accuracy and computational speed of weather forecasts, especially for extreme weather events.
💸 Business magnate Li Ka-shing leads a $97 million Series B funding round for edge AI computing company Kneron. The funds will accelerate AI advancements, particularly lightweight GPT solutions for the automotive sector.
🗣️ NetEase Youdao announces the launch of the world’s first virtual speaking coach app, Hi Echo, powered by the Youdao “Zi Yue” education-specific LLM. Available as a standalone app and WeChat Mini Program, Hi Echo supports 8 conversational scenarios and 68 topics, offering grammar corrections and style refinement.
📊 Six departments, including the Ministry of Industry and Information Technology and the Central Cyberspace Affairs Commission, have released a “High-Quality Development Action Plan for Computational Infrastructure”. The plan sets quantitative development goals for 2025, including achieving computational scales surpassing 300 EFLOPS with AI computing making up 35%.
🚫 According to Reuters, the Biden administration is contemplating measures to prevent Chinese companies from accessing U.S. AI chips through overseas cloud units.
Trending Research
Edge learning using a fully integrated neuro-inspired memristor chip: Researchers from Tsinghua University have developed a STELLAR architecture using memristor-based computing, successfully integrating it into a full-system chip to achieve efficient on-chip learning for tasks like motion control, image classification, and speech recognition, all while maintaining software-comparable accuracy and lower hardware costs.
MathCoder: Seamless Code Integration in LLMs for Enhanced Mathematical Reasoning: The paper presents a method for fine-tuning open-source language models to enhance their mathematical reasoning abilities through a new dataset and approach called MathCodeInstruct, leading to the development of MathCoder models that outperform existing solutions, including GPT-4, in solving complex math problems on the MATH and GSM8K datasets.
Octopus: Embodied Vision-Language Programmer from Environmental Feedback: The paper introduces Octopus, a novel vision-language model that, when embodied in an agent, can interpret visual and textual data to execute complex tasks, ranging from daily chores to intricate video game interactions.