Fabric Manager for NVIDIA NVSwitch Systems
Fabric Manager for NVIDIA NVSwitch Systems
User Guide / Virtualization / High Availability Modes
DU-09883-001_v0.7 | January 2021
DU-09883-001_v0.7
Version Date
0.1
Oct 25, 2019
0.2
Mar 23, 2020
0.3
May 11, 2020
0.4
July 7, 2020
0.5
July 17, 2020
0.6
Aug 03, 2020
0.7
Jan 26, 2021
Document History
Authors SB SB YL
SB SB SB
GT, CC
Description of Change Initial Beta Release Updated error handling and bare metal mode Updated Shared NVSwitch APIs section with new API information Updated MIG interoperability and high availability details. Updated running as non-root instructions Updated installation instructions based on CUDA repo and updated SXid error details Updated with vGPU multitenancy virtualization mode
Fabric Ma nager fo r NVIDIA NVSwitch Sy stems
DU-09883-001_v0.7 | ii
Table of Contents
Chapter 1. Overview...................................................................................................... 1 1.1 Introduction.............................................................................................................1 1.2 Terminology ............................................................................................................1 1.3 NVSwitch Core Software Stack.................................................................................2 1.4 What is Fabric Manager?..........................................................................................3
Chapter 2. Getting Started With Fabric Manager ...................................................... 5 2.1 Basic Components...................................................................................................5 2.2 NVSwitch and NVLink Initialization ...........................................................................5 2.3 Supported Platforms................................................................................................6 2.4 Supported Deployment Models.................................................................................6 2.5 Other NVIDIA Software Packages .............................................................................6 2.6 Installation ..............................................................................................................7 2.7 Managing Fabric Manager Service............................................................................7 2.8 Fabric Manager Startup Options...............................................................................8 2.9 Fabric Manager Service File.....................................................................................9 2.10 Running Fabric Manager as Non-Root......................................................................9 2.11 Fabric Manager Config Options ..............................................................................11
Chapter 3. Bare Metal Mode ...................................................................................... 19 3.1 Introduction...........................................................................................................19 3.2 Fabric Manager Installation ...................................................................................19 3.3 Runtime NVSwitch and GPU errors ........................................................................19 3.4 Interoperability With MIG........................................................................................21
Chapter 4. Virtualization Models ............................................................................... 23 4.1 Introduction...........................................................................................................23 4.2 Supported Virtualization Models.............................................................................23
Chapter 5. Fabric Manager SDK................................................................................ 25 5.1 Data Structures.....................................................................................................25 5.2 Initializing FM API interface....................................................................................28 5.3 Shutting Down FM API interface.............................................................................28 5.4 Connect to Running FM Instance............................................................................28 5.5 Disconnect from Running FM Instance...................................................................29 5.6 Getting Supported Partitions..................................................................................29 5.7 Activate a GPU Partition.........................................................................................30 5.8 Activate a GPU Partition with VFs ...........................................................................30
Fabric Ma nager fo r NVIDIA NVSwitch Sy stems
DU-09883-001_v0.7 | iii
5.9 Deactivate a GPU Partition.....................................................................................31 5.10 Set Activated Partition List after Fabric Manager Restart ........................................32 5.11 Get NVLink Failed Devices......................................................................................32 5.12 Get Unsupported Partitions....................................................................................33
Chapter 6. Full Passthrough Virtualization Model .................................................. 34 6.1 Supported VM Configurations.................................................................................36 6.2 Virtual Machines with 16 GPUs...............................................................................36 6.3 Virtual Machines with 8 GPUS ................................................................................36 6.4 Virtual Machines with 4 GPUS ................................................................................36 6.5 Virtual Machines with 2 GPUs.................................................................................37 6.6 Virtual Machine with 1 GPU....................................................................................37 6.7 Other Requirements ..............................................................................................37 6.8 Hypervisor Sequences............................................................................................37 6.9 Monitoring Errors..................................................................................................38 6.10 Limitations ............................................................................................................38
Chapter 7. Shared NVSwitch Virtualization Model .................................................. 39 7.1 Software Stack ......................................................................................................39 7.2 Preparing Service VM.............................................................................................40 7.3 FM Shared Library and APIs...................................................................................41 7.4 Fabric Manager Resiliency.....................................................................................44 7.5 Cycle Management ................................................................................................45 7.6 Guest VM Life Cycle Management...........................................................................46 7.7 Error Handling.......................................................................................................47 7.8 Interoperability With MIG........................................................................................48
Chapter 8. vGPU Virtualization Model....................................................................... 49 8.1 Software Stack ......................................................................................................49 8.2 Preparing the vGPU Host .......................................................................................50 8.3 FM Shared Library and APIs...................................................................................51 8.4 Fabric Manager Resiliency.....................................................................................51 8.5 vGPU Partitions .....................................................................................................51 8.6 Guest VM Life Cycle Management...........................................................................52 8.7 Error Handling.......................................................................................................54 8.8 GPU Reset.............................................................................................................54 8.9 Interoperability with MIG........................................................................................54
Chapter 9. Supported High Availability Modes ......................................................... 56 9.1 Definition of Common Terms..................................................................................56 9.2 GPU Access NVLink Failure ...................................................................................57 9.3 Trunk NVLink Failure.............................................................................................58 9.4 NVSwitch Failure...................................................................................................60
Fabric Ma nager fo r NVIDIA NVSwitch Sy stems
DU-09883-001_v0.7 | iv
9.5 GPU Failure...........................................................................................................61 9.6 Manual Degradation ..............................................................................................62 Appendix A. NVLink Topology ...................................................................................... 70 Appendix B. GPU Partitions.......................................................................................... 72 Appendix C. Resiliency.................................................................................................. 74 Appendix D. Error Handling ......................................................................................... 77
Fabric Ma nager fo r NVIDIA NVSwitch Sy stems
DU-09883-001_v0.7 | v
Overview
Introduction
NVIDIA DGXTM A100 and NVIDIA HGXTM A100 8-GPU 1 server systems use NVIDIA? NVLink? switches (NVIDIA? NVSwitchTM) which enable all-to-all communication over the NVLink fabric. The DGX A100 and HGX A100 8-GPU systems both consist of a GPU baseboard, with eight NVIDIA A100 GPUs and six NVSwitches. Each A100 GPU has two NVLink connections to each NVSwitch on the same GPU baseboard. Additionally, two GPU baseboards can be connected together to build a 16-GPU system. Between the two GPU baseboards, the only NVLink connections are between NVSwitches with each switch from one GPU baseboard connected to a single NVSwitch on the other GPU baseboard for a total of sixteen NVLink connections
Terminology
Abbreviations FM MMIO VM GPU register SBR DCGM NVML Service VM Access NVLink Trunk NVLink SMBPBI
Definitions Fabric Manager Memory Mapped IO Virtual Machine A location in the GPU MMIO space Secondary Bus Reset NVIDIA Data Center GPU manager NVIDIA Management Library A privileged VM where NVIDIA NVSwitch software stack runs NVLink between a GPU and an NVSwitch NVLink between two GPU baseboards NVIDIA SMBus Post-Box Interface
1 The NVIDIA HGX A100 8-GPU will be referred to as the HGX A100 in the rest of this document.
Fabric Manager for NVIDIA NVSwitch Systems
DU-09883-001_v0.7| 1
Overview
Abbreviations vGPU MIG SR-IOV PF VF GFID Partition
Definitions
NVIDIA GRID Virtual GPU Multi-Instance GPU Single-Root IO Virtualization Physical Function Virtual Function GPU Function Identification A collection of GPUs which are allowed to perform NVLink Peer-to-Peer Communication among themselves
NVSwitch Core Software Stack
The core software stack required for NVSwitch management consists of a NVSwitch kernel driver and a privileged process called Fabric Manager (FM). The kernel driver performs low level hardware management in response to Fabric Manager requests. The software stack also provides in-band and out-of-band monitoring solutions for reporting both NVSwitch and GPU errors and status information.
Fabric Ma nager fo r NVIDIA NVSwitch Sy stems
DU-09883-001_v0.7 | 2
Figure 1. NVSwitch core software stack
Overview
Third Party Integration Point for GPU & NVSwitch Monitoring
DCGM (GPU & NVSwitch
Monitoring)
NVML (Monitoring APIs)
Fabric Manager Service
NVSwitch Audit Tool
Fabric Manager Package
User Mode Kernel Mode
Out-ofBand
GPU Driver
NVSwitch Driver NVIDIA Driver Package
GPUs
NVSwitches
BMC
What is Fabric Manager?
NVIDIA Fabric Manager (FM) configures the NVSwitch memory fabrics to form a single memory fabric among all participating GPUs and monitors the NVLinks that support the fabric. At a high level, Fabric Manager has the following responsibilities ,, Coordinate with NVSwitch driver to initialize and train NVSwitch to NVSwitch NVLink
interconnects. ,, Coordinate with GPU driver to initialize and train NVSwitch to GPU NVLink
interconnects. ,, Configure routing among NVSwitch ports.
Fabric Ma nager fo r NVIDIA NVSwitch Sy stems
DU-09883-001_v0.7 | 3
................
................
In order to avoid copyright disputes, this page is only a partial summary.
To fulfill the demand for quickly locating and searching documents.
It is intelligent file search solution for home and business.
Related download
- fabric manager for nvidia nvswitch systems
- container performance benchmark between docker lxd
- high performance next generation deep learning clusters
- amd raidxpert2 user guide
- ubuntu pro for azure
- intel distribution for python 2021 release 4
- precision 5560 technical guidebook
- endpoint security agent software fireeye
- about the tutorial
- ubuntu linux setup guide
Related searches
- free contact manager for windows
- best expense manager for android
- business contact manager for outlook 2016
- finance manager for auto dealership
- project manager for dummies
- contact manager for windows 10
- android contact manager for windows
- the new account manager for ford credit
- best contact manager for windows
- best contact manager for iphone
- android manager for windows 10
- business contact manager for outlook