Fabric Manager for NVIDIA NVSwitch Systems
Fabric Manager for NVIDIA NVSwitch Systems
Virtualization/High Availability Modes User Guide
DU-09883-001_v1.3 | October 2023
Document History
DU-09883-001_v1.3
Version Date
0.1
October 25, 2019
0.2
March 23, 2020
0.3
May 11, 2020
0.4
July 7, 2020
0.5
July 17, 2020
0.6
August 03, 2020
0.7
January 26, 2021
0.8
March 19, 2021
0.9
October 19, 2022
1.0
Jan 20, 2023
1.1
June 23, 2023
1.2
July 7, 2023
1.3
October 3, 2023
Authors
SB SB YL SB SB SB GT, CC SB YL, SB, GT YL SB
EK, PKS
YL, SB
Description of Change
Initial Beta Release
Updated error handling and bare metal mode
Updated Shared NVSwitch APIs section with new API information
Updated multi-instance GPU (MIG) interoperability and high availability details.
Updated running as non-root instructions
Updated installation instructions based on CUDA repo and updated SXid error details
Updated with NVIDIA Virtual GPU (vGPU) multitenancy virtualization mode
Updated High Availability section to reflect recent GPU excluded option changes.
Updated with NVIDIA? DGXTM H100 and NVIDIA HGXTM H100
Updated GPU Module ID for DGX H100 and NVIDIA HGX H100
? Updated with log rotation options. ? Updated NVIDIA HGX H100 NVIDIA NVLink? topology
information. ? Added support language for NVIDIA HGX A800 and
NVIDIA HGX H800.
? Updated D.4 Non-Fatal NVSwitch SXid Errors ? Updated D. Fatal NVSwitch SXid Errors ? Added D.9 GPU/VM/System Reset Capabilities and
Limitations
? Updated Shared NVSwitch 2 GPU partitions for DGX H100 and HGX H100
? Updated various FM package details, FM Service restart consideration for DGX H100 and HGX H100, Service VM memory requirements.
Fabric Manager for NVIDIA NVSwitch Systems
DU-09883-001_v1.3 | ii
Table of Contents
Overview ................................................................................................. 1 Introduction ........................................................................................................................ 1 NVSwitch-Based Systems ................................................................................................. 1 Terminology ........................................................................................................................ 2 NVSwitch Core Software Stack ......................................................................................... 2 What is Fabric Manager? ................................................................................................... 3
Getting Started with Fabric Manager .................................................... 5 Basic Components ............................................................................................................. 5 NVSwitch and NVLink Initialization ................................................................................... 5 Supported Platforms.......................................................................................................... 7 Supported Deployment Models ......................................................................................... 7 Other NVIDIA Software Packages ..................................................................................... 8 Installation .......................................................................................................................... 8 Managing the Fabric Manager Service ............................................................................. 9 Fabric Manager Startup Options ..................................................................................... 10 Fabric Manager Service File............................................................................................ 11 Running Fabric Manager as Non-Root ........................................................................... 12 Fabric Manager Config Options....................................................................................... 13
Bare Metal Mode.................................................................................. 23 Introduction ...................................................................................................................... 23 Fabric Manager Installation ............................................................................................ 23 Runtime NVSwitch and GPU Errors ................................................................................ 23 Interoperability With MIG ................................................................................................. 25
Virtualization Models ........................................................................... 27 Introduction ...................................................................................................................... 27 Supported Virtualization Models ..................................................................................... 27
Fabric Manager SDK............................................................................ 28 Data Structures ................................................................................................................ 28 Initializing the Fabric Manager API interface ................................................................. 31 Shutting Down the Fabric Manager API interface.......................................................... 31 Connect to Running the Fabric Manager Instance......................................................... 32 Disconnect from Running the Fabric Manager Instance ............................................... 32 Getting Supported Partitions ........................................................................................... 33 Activate a GPU Partition................................................................................................... 33 Activate a GPU Partition with Virtual Functions ............................................................. 34
Fabric Manager for NVIDIA NVSwitch Systems
DU-09883-001_v1.3 | iii
Deactivate a GPU Partition .............................................................................................. 35 Set Activated Partition List after a Fabric Manager Restart.......................................... 35 Get the NVLink Failed Devices ........................................................................................ 36 Get Unsupported Partitions ............................................................................................. 36
Full Passthrough Virtualization Model................................................ 38 Supported Virtual Machine Configurations..................................................................... 40 Virtual Machines with 16 GPUs ....................................................................................... 41 Virtual Machines with Eight GPUS .................................................................................. 41 Virtual Machines with Four GPUS ................................................................................... 41 Virtual Machines with Two GPUs..................................................................................... 41 Virtual Machine with One GPU ........................................................................................ 42 Other Requirements......................................................................................................... 42 Hypervisor Sequences ..................................................................................................... 42 Monitoring Errors............................................................................................................. 43 Limitations ........................................................................................................................ 43
Shared NVSwitch Virtualization Model................................................ 44 Software Stack ................................................................................................................. 44 Guest VM to Service VM Interaction ................................................................................ 45 Preparing the Service Virtual Machine ........................................................................... 46 FM Shared Library and APIs............................................................................................ 47 Fabric Manager Resiliency .............................................................................................. 50 Service Virtual Machine Life Cycle Management ........................................................... 50 Guest Virtual Machine Life Cycle Management.............................................................. 52 Error Handling.................................................................................................................. 54 Interoperability With a Multi-Instance GPU.................................................................... 55
vGPU Virtualization Model ................................................................... 56 Software Stack ................................................................................................................. 56 Preparing the vGPU Host................................................................................................. 57 Fabric Manager-Shared Library and APIs ...................................................................... 58 Fabric Manager Resiliency .............................................................................................. 58 vGPU Partitions ................................................................................................................ 58 Guest Virtual Machine Life Cycle Management.............................................................. 59 Error Handling.................................................................................................................. 60 GPU Reset......................................................................................................................... 60 Interoperability with MIG.................................................................................................. 61
Supported High Availability Modes...................................................... 62 Common Terms................................................................................................................ 62 GPU Access NVLink Failure ............................................................................................ 63 Trunk NVLink Failure ....................................................................................................... 64
Fabric Manager for NVIDIA NVSwitch Systems
DU-09883-001_v1.3 | iv
NVSwitch Failure.............................................................................................................. 66 GPU Failure ...................................................................................................................... 67 Manual Degradation......................................................................................................... 68 Appendix A. NVLink Topology .................................................................................. 76 Appendix B. GPU Partitions ..................................................................................... 81 Appendix C. Resiliency ............................................................................................. 85 Appendix D. Error Handling ..................................................................................... 88
Fabric Manager for NVIDIA NVSwitch Systems
DU-09883-001_v1.3 | v
................
................
In order to avoid copyright disputes, this page is only a partial summary.
To fulfill the demand for quickly locating and searching documents.
It is intelligent file search solution for home and business.
Related download
- fortigate virtual appliances data sheet
- ibm qradar installation guide
- oracle linux kvm user s guide
- oracle vm server for x86 virtualization and management
- performance best practices for vmware vsphere 6
- 2nd edition black hat python
- supermicro server management utilities
- fabric manager for nvidia nvswitch systems
- flow 3d user manual
Related searches
- free contact manager for windows
- best expense manager for android
- business contact manager for outlook 2016
- finance manager for auto dealership
- project manager for dummies
- contact manager for windows 10
- android contact manager for windows
- the new account manager for ford credit
- best contact manager for windows
- best contact manager for iphone
- android manager for windows 10
- business contact manager for outlook