AI Factory

 

Why is it called "AI Factory"?

The term was coined by NVIDIA's Jensen Huang. The analogy is:

Traditional Factory: AI Factory:

  Raw materials                 Raw Data
       ↓                             ↓
  Assembly line                 GPU Compute
       ↓                             ↓
  Finished product              Trained AI Model / Tokens

Just like a factory mass-produces physical goods, an AI Factory mass-produces intelligence — tokens, embeddings, model outputs.

Key characteristics that make it a "factory":

Factory conceptAI Factory equivalent
Assembly lineGPU pipeline (data → training → inference)
Raw materialData (text, images, video)
MachinesGPUs (H100, H200, B200)
Factory floorData center / GPU cluster
OutputTrained models, tokens, predictions
Throughput metricTokens/sec, FLOPS/sec
Uptime = revenueGPU utilization = revenue

The entire infrastructure — networking, power, cooling, storage — is engineered around one goal: keep GPUs busy 100% of the time.


What does "Rail-Optimized" mean?

A rail is a dedicated network path connecting one NIC port per GPU server to one specific ToR switch.

Non-Rail (Traditional) topology:Server has 1 uplink → ToR

All GPU traffic shares same path

     Server
       │
      NIC (single uplink)
       │
      ToR
  • Simple but bottleneck — all 8 GPUs fight for one link
  • Poor ECMP for RoCE (all flows go same path)

Rail-Optimized topology:GPU Server (8x GPUs)


GPU0 ── NIC0 ──────────────── Rail-ToR-0
GPU1 ── NIC1 ──────────────── Rail-ToR-1
GPU2 ── NIC2 ──────────────── Rail-ToR-2
GPU3 ── NIC3 ──────────────── Rail-ToR-3
GPU4 ── NIC4 ──────────────── Rail-ToR-4
GPU5 ── NIC5 ──────────────── Rail-ToR-5
GPU6 ── NIC6 ──────────────── Rail-ToR-6
GPU7 ── NIC7 ──────────────── Rail-ToR-7

Each GPU has its own dedicated NIC and its own dedicated ToR switch (rail) — no sharing.



Comments

Popular posts from this blog

eBGP sonic lab + Ansible config & validation

SONiC-2

RDMA RoCE