How Meta trains large language models at scale #106

YeonwooSung · 2024-06-20T01:00:39Z

meta engineering blog post

Meta requires massive computational power to train large language models (LLMs)
Traditional AI model training trains a large number of models, but requires a relatively small number of GPUs
With the advent of generative AI (GenAI), fewer tasks are required, but they are very large tasks.

Challenges of training large-scale models

Hardware reliability: Requires rigorous testing and quality control to minimize training disruption due to hardware failure.
Fast recovery in case of failure: need to be able to recover quickly when hardware failures occur. Reduced rescheduling overhead and fast training reinitialization required.
Efficient preservation of training state: Need to be able to efficiently save and recover training state in the event of a failure.
Optimal connectivity between GPUs: Data transfer between GPUs is critical for large-scale model training. This requires high-speed network infrastructure and efficient data transfer protocols.

Improving all layers of the infrastructure stack is critical

Training software

Enable researchers to quickly move from research to production using open source like PyTorch.
Developing new algorithms and techniques for large-scale training and integrating new software tools and frameworks.

Scheduling

Allocating and dynamically scheduling resources based on the needs of the job, using complex algorithms to optimize resources.

Hardware

Requires high-performance hardware to handle large-scale model training.
Optimized existing hardware and modified the Grand Teton platform with NVIDIA H100 GPUs, increasing the TDP of the GPUs to 700W and switching to HBM3.

Data Center Placement

Optimized resources (power, cooling, networking, etc.) by optimally placing GPUs and systems in the data center.
We deployed as many GPU racks as possible for maximum compute density.

Reliability

Detection and recovery plans in place to minimize downtime in the event of hardware failure.
Common failure modes: GPU unrecognized, DRAM & SRAM UCE, hardware network cable issues.

Network

High-speed network infrastructure and efficient data transfer protocols are required for large-scale model training.
Built two network clusters, RoCE and InfiniBand, to learn from operational experience.

Storage

Invested in high-capacity, high-speed storage technologies for large-scale data storage and developed new data storage solutions for specific tasks.

Looking ahead

We will use hundreds of thousands of GPUs to process more data and cover longer distances and latencies.
We plan to adopt new hardware technologies and GPU architectures and evolve our infrastructure.
We will explore the evolving landscape of AI and strive to push the boundaries of what is possible.

YeonwooSung · 2024-06-20T01:02:11Z

대규모 모델 훈련의 도전 과제

하드웨어 신뢰성: 하드웨어 고장으로 인한 훈련 중단을 최소화하기 위해 엄격한 테스트와 품질 관리 필요함.
고장 시 빠른 복구: 하드웨어 고장이 발생하면 빠르게 복구할 수 있어야 함. 재스케줄링 오버헤드를 줄이고 빠른 훈련 재초기화 필요함.
훈련 상태의 효율적 보존: 고장 시 훈련 상태를 효율적으로 저장하고 복구할 수 있어야 함.
GPU 간 최적의 연결성: 대규모 모델 훈련은 GPU 간 데이터 전송이 중요함. 이를 위해 고속 네트워크 인프라와 효율적인 데이터 전송 프로토콜 필요함.

인프라 스택의 모든 계층을 개선하는 것이 중요함

훈련 소프트웨어

연구자들이 PyTorch와 같은 오픈 소스를 사용하여 빠르게 연구에서 생산으로 전환할 수 있도록 지원함.
대규모 훈련을 위한 새로운 알고리즘과 기술 개발 및 새로운 소프트웨어 도구와 프레임워크 통합함.

스케줄링

자원을 최적화하기 위해 복잡한 알고리듬을 사용하여 작업의 필요에 따라 자원을 할당하고 동적으로 스케줄링함.

하드웨어

대규모 모델 훈련을 처리하기 위해 고성능 하드웨어 필요함.
기존 하드웨어를 최적화하고, NVIDIA H100 GPU를 사용한 Grand Teton 플랫폼을 수정하여 GPU의 TDP를 700W로 증가시키고 HBM3로 전환함.

데이터 센터 배치

GPU와 시스템을 데이터 센터에 최적 배치하여 자원(전력, 냉각, 네트워킹 등)을 최적화함.
최대 컴퓨팅 밀도를 위해 GPU 랙을 최대한 많이 배치함.

신뢰성

하드웨어 고장 시 다운타임을 최소화하기 위해 감지 및 복구 계획 수립함.
자주 발생하는 고장 모드: GPU 인식 불가, DRAM & SRAM UCE, 하드웨어 네트워크 케이블 문제.

네트워크

대규모 모델 훈련을 위해 고속 네트워크 인프라와 효율적인 데이터 전송 프로토콜 필요함.
RoCE와 InfiniBand 두 가지 네트워크 클러스터를 구축하여 운영 경험을 통해 학습함.

스토리지

대규모 데이터 저장을 위해 고용량, 고속 스토리지 기술에 투자하고 특정 작업에 맞는 새로운 데이터 저장 솔루션 개발함.

향후 전망

수십만 개의 GPU를 사용하여 더 많은 데이터를 처리하고 더 긴 거리와 지연 시간을 다룰 예정임.
새로운 하드웨어 기술과 GPU 아키텍처를 채택하고 인프라를 발전시킬 계획임.
AI의 진화하는 환경을 탐색하며 가능한 한계를 넘어서기 위해 노력할 것임.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How Meta trains large language models at scale #106

How Meta trains large language models at scale #106

YeonwooSung commented Jun 20, 2024

YeonwooSung commented Jun 20, 2024

How Meta trains large language models at scale #106

How Meta trains large language models at scale #106

Comments

YeonwooSung commented Jun 20, 2024

Challenges of training large-scale models

Improving all layers of the infrastructure stack is critical

Training software

Scheduling

Hardware

Data Center Placement

Reliability

Network

Storage

Looking ahead

YeonwooSung commented Jun 20, 2024

대규모 모델 훈련의 도전 과제

인프라 스택의 모든 계층을 개선하는 것이 중요함

훈련 소프트웨어

스케줄링

하드웨어

데이터 센터 배치

신뢰성

네트워크

스토리지

향후 전망