Fabric Manager for NVIDIA NVSwitch Systems

Fabric Manager for NVIDIA NVSwitch Systems

Virtualization/High Availability Modes User Guide

DU-09883-001_v1.3 | October 2023

Document History

DU-09883-001_v1.3

Version Date

0.1

October 25, 2019

0.2

March 23, 2020

0.3

May 11, 2020

0.4

July 7, 2020

0.5

July 17, 2020

0.6

August 03, 2020

0.7

January 26, 2021

0.8

March 19, 2021

0.9

October 19, 2022

1.0

Jan 20, 2023

1.1

June 23, 2023

1.2

July 7, 2023

1.3

October 3, 2023

Authors

SB SB YL SB SB SB GT, CC SB YL, SB, GT YL SB

EK, PKS

YL, SB

Description of Change

Initial Beta Release

Updated error handling and bare metal mode

Updated Shared NVSwitch APIs section with new API information

Updated multi-instance GPU (MIG) interoperability and high availability details.

Updated running as non-root instructions

Updated installation instructions based on CUDA repo and updated SXid error details

Updated with NVIDIA Virtual GPU (vGPU) multitenancy virtualization mode

Updated High Availability section to reflect recent GPU excluded option changes.

Updated with NVIDIA? DGXTM H100 and NVIDIA HGXTM H100

Updated GPU Module ID for DGX H100 and NVIDIA HGX H100

? Updated with log rotation options. ? Updated NVIDIA HGX H100 NVIDIA NVLink? topology

information. ? Added support language for NVIDIA HGX A800 and

NVIDIA HGX H800.

? Updated D.4 Non-Fatal NVSwitch SXid Errors ? Updated D. Fatal NVSwitch SXid Errors ? Added D.9 GPU/VM/System Reset Capabilities and

Limitations

? Updated Shared NVSwitch 2 GPU partitions for DGX H100 and HGX H100

? Updated various FM package details, FM Service restart consideration for DGX H100 and HGX H100, Service VM memory requirements.

Fabric Manager for NVIDIA NVSwitch Systems

DU-09883-001_v1.3 | ii

Table of Contents

Overview ................................................................................................. 1 Introduction ........................................................................................................................ 1 NVSwitch-Based Systems ................................................................................................. 1 Terminology ........................................................................................................................ 2 NVSwitch Core Software Stack ......................................................................................... 2 What is Fabric Manager? ................................................................................................... 3

Getting Started with Fabric Manager .................................................... 5 Basic Components ............................................................................................................. 5 NVSwitch and NVLink Initialization ................................................................................... 5 Supported Platforms.......................................................................................................... 7 Supported Deployment Models ......................................................................................... 7 Other NVIDIA Software Packages ..................................................................................... 8 Installation .......................................................................................................................... 8 Managing the Fabric Manager Service ............................................................................. 9 Fabric Manager Startup Options ..................................................................................... 10 Fabric Manager Service File............................................................................................ 11 Running Fabric Manager as Non-Root ........................................................................... 12 Fabric Manager Config Options....................................................................................... 13

Bare Metal Mode.................................................................................. 23 Introduction ...................................................................................................................... 23 Fabric Manager Installation ............................................................................................ 23 Runtime NVSwitch and GPU Errors ................................................................................ 23 Interoperability With MIG ................................................................................................. 25

Virtualization Models ........................................................................... 27 Introduction ...................................................................................................................... 27 Supported Virtualization Models ..................................................................................... 27

Fabric Manager SDK............................................................................ 28 Data Structures ................................................................................................................ 28 Initializing the Fabric Manager API interface ................................................................. 31 Shutting Down the Fabric Manager API interface.......................................................... 31 Connect to Running the Fabric Manager Instance......................................................... 32 Disconnect from Running the Fabric Manager Instance ............................................... 32 Getting Supported Partitions ........................................................................................... 33 Activate a GPU Partition................................................................................................... 33 Activate a GPU Partition with Virtual Functions ............................................................. 34

Fabric Manager for NVIDIA NVSwitch Systems

DU-09883-001_v1.3 | iii

Deactivate a GPU Partition .............................................................................................. 35 Set Activated Partition List after a Fabric Manager Restart.......................................... 35 Get the NVLink Failed Devices ........................................................................................ 36 Get Unsupported Partitions ............................................................................................. 36

Full Passthrough Virtualization Model................................................ 38 Supported Virtual Machine Configurations..................................................................... 40 Virtual Machines with 16 GPUs ....................................................................................... 41 Virtual Machines with Eight GPUS .................................................................................. 41 Virtual Machines with Four GPUS ................................................................................... 41 Virtual Machines with Two GPUs..................................................................................... 41 Virtual Machine with One GPU ........................................................................................ 42 Other Requirements......................................................................................................... 42 Hypervisor Sequences ..................................................................................................... 42 Monitoring Errors............................................................................................................. 43 Limitations ........................................................................................................................ 43

Shared NVSwitch Virtualization Model................................................ 44 Software Stack ................................................................................................................. 44 Guest VM to Service VM Interaction ................................................................................ 45 Preparing the Service Virtual Machine ........................................................................... 46 FM Shared Library and APIs............................................................................................ 47 Fabric Manager Resiliency .............................................................................................. 50 Service Virtual Machine Life Cycle Management ........................................................... 50 Guest Virtual Machine Life Cycle Management.............................................................. 52 Error Handling.................................................................................................................. 54 Interoperability With a Multi-Instance GPU.................................................................... 55

vGPU Virtualization Model ................................................................... 56 Software Stack ................................................................................................................. 56 Preparing the vGPU Host................................................................................................. 57 Fabric Manager-Shared Library and APIs ...................................................................... 58 Fabric Manager Resiliency .............................................................................................. 58 vGPU Partitions ................................................................................................................ 58 Guest Virtual Machine Life Cycle Management.............................................................. 59 Error Handling.................................................................................................................. 60 GPU Reset......................................................................................................................... 60 Interoperability with MIG.................................................................................................. 61

Supported High Availability Modes...................................................... 62 Common Terms................................................................................................................ 62 GPU Access NVLink Failure ............................................................................................ 63 Trunk NVLink Failure ....................................................................................................... 64

Fabric Manager for NVIDIA NVSwitch Systems

DU-09883-001_v1.3 | iv

NVSwitch Failure.............................................................................................................. 66 GPU Failure ...................................................................................................................... 67 Manual Degradation......................................................................................................... 68 Appendix A. NVLink Topology .................................................................................. 76 Appendix B. GPU Partitions ..................................................................................... 81 Appendix C. Resiliency ............................................................................................. 85 Appendix D. Error Handling ..................................................................................... 88

Fabric Manager for NVIDIA NVSwitch Systems

DU-09883-001_v1.3 | v

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download