Insights ¦ EVERY FLOP COUNTS: SCALING A 300B MIXTURE-OF-EXPERTS LLM WITHOUT PREMIUM GPUS

Published by: Ling Team, AI@Ant Group
Search for original: Link

Key Take Aways

  1. The development of Ling’s series of large language models demonstrates that state-of-the-art 300B MoE models can be effectively trained on lower-performance hardware, reducing costs significantly without sacrificing performance.

  2. Utilising heterogeneous computing infrastructure and optimisation techniques, the models achieved approximately 20% cost savings during training, making large-scale AI deployment more accessible for organisations with constrained budgets.

  3. The models, Ling-Lite (16.8B parameters) and Ling-Plus (290B parameters), exhibit performance comparable to industry benchmarks, highlighting scalable solutions suited for resource-limited environments.

  4. Innovative engineering strategies—including model architecture refinement, training anomaly handling, and efficient data and evaluation pipelines—are critical for stabilising training and improving model robustness.

  5. Emphasis on optimisation of model architecture, training frameworks, and storage solutions collectively enable effective large-model training on diverse hardware, enhancing resource efficiency and flexibility.

  6. The adoption of a systematic approach to data quality — including high-quality data curation, deduplication, and specialised data selection — underpins the model’s high performance in multilingual, knowledge-intensive, and reasoning tasks.

  7. Technical advancements such as multi-criteria evaluation and adaptive benchmarking improve training stability and evaluation reliability, especially in resource-constrained settings.

  8. The models excel in tool utilisation, demonstrating superior capability in handling complex real-world scenarios through extensive data synthesis and strategic tool integration.

  9. The research outlines robustness in multi-cluster cross-compatibility, highlighting solutions for heterogeneous infrastructure, data synchronization, and storage that optimise large-scale distributed training efficiency.

  10. Offline inference infrastructure ‘Flood’ significantly boosts throughput for long-context tasks, enabling better handling of extended sequences up to 16K tokens, relevant for sophisticated financial document processing.

  11. Technical solutions for training stability, including loss spike mitigation, expert load balancing, and platform alignment, are vital for reliable large-model deployment across diverse environments.

  12. The models’ safety profile shows effective balance, with Ling-Plus outperforming benchmarks in safety and refusal metrics, reinforcing the importance of responsible AI in high-stakes applications like finance.

See also  INSIGHTS ¦ Money Advice Trust Outcome Report

Key Statistics

  • Ling-Lite contains 16.8 billion parameters with 2.75 billion activated; Ling-Plus boasts 290 billion parameters with 28.8 billion activated.

  • Cost savings of approximately 20% achieved by training on lower-spec hardware, amounting to roughly 1.27 million RMB per model.

  • High-quality pre-training dataset of approximately 9 trillion tokens, including multilingual (English, Chinese) and code data.

  • Training for 9 trillion tokens across multiple hardware configurations; the cost for 1 trillion tokens on high-performance hardware was around 6.35 million RMB, reduced to 5.08 million RMB on lower-performance hardware.

  • Performance achievements include top-tier results on benchmarks such as MMLU, GSM8K, and CMMLU, with specific scores like 82.33 on MMLU and 83.54 on HumanEval.

  • Infrastructure improvements include a storage system (PCache) enabling up to 8TB/s throughput across large clusters, reducing I/O bottlenecks.

  • Evaluation results show Ling models outperform comparable open-source models in key benchmarks, including safety scores (average 93.56%) and tool use accuracy.

  • The inference framework Flood achieves a speedup of up to 2.4 times over existing systems, supporting long sequence processing efficiently.

  • Cross-platform initiatives have allowed training consistency across various hardware setups, ensuring stable convergence and robust deployment.

  • Regular evaluation and mitigation strategies for technical issues such as loss spikes and expert imbalance have maintained training stability.

  • Safety assessment indicates Ling-Plus outperforms peers, with an average safety score of 89.50 and false refusal rate as low as 96.09 among safety metrics.


Key Discussion Points

  1. Large-scale MoE models can be cost-effectively trained on less specialised hardware, dramatically lowering barriers to entry for resource-constrained organisations.

  2. The critical role of technical optimisation—covering architecture, data, evaluation, and storage—in enabling scalable and stable training processes.

  3. Innovation in data quality control, including deduplication and high-quality curation, is essential for achieving robust multilingual and reasoning capabilities.

  4. The importance of asynchronous and heterogeneous training frameworks to facilitate compatibility across diverse computing environments.

  5. Significant efficiency gains are realised through new training algorithms such as EDiT, which reduce communication overhead and enhance scaling.

  6. Infrastructure solutions like PCache and Babel enhance distributed data management and synchronization, crucial for large models and datasets.

  7. The implementation of offline inference frameworks such as Flood improves long-sequence handling, with applications in complex document analysis in finance.

  8. Addressing training stability issues—such as loss spikes and expert load imbalance—is vital for dependable deployment of ultra-large models.

  9. The models’ ability to perform advanced tool utilisation and comprehension tasks demonstrates potential for deployment in complex, real-world financial services applications.

  10. Systematic evaluation improvements ensure consistent performance measurement, guiding data tuning and computational resource management.

  11. Safety protocols and responsible deployment metrics indicate advanced risk mitigation, aligning AI development with compliance standards.

  12. The research exemplifies how open-source collaboration and technical innovation enable responsible scaling and accessible deployment of large language models in financial sectors.

See also  Insights ¦ Labour Market Outlook Winter 2024–25

Document Description

This article provides a comprehensive overview of the development, optimisation, and deployment of a large-scale 300-billion-parameter Mixture of Experts language model series—Ling—focusing on cost-efficiency and resource adaptability. It explores innovative architectural strategies, infrastructure enhancements, data quality measures, and evaluation frameworks designed to facilitate training on lower-performance hardware across heterogeneous environments. The article also highlights performance benchmarks, safety assessments, and deployment techniques, demonstrating practical applications relevant for financial services seeking scalable, responsible AI solutions.


RO-AR insider newsletter

Receive notifications of new RO-AR content notifications: Also subscribe here - unsubscribe anytime