Class NvidiaGPUPluginForRuntimeV2
java.lang.Object
org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.com.nvidia.NvidiaGPUPluginForRuntimeV2
- All Implemented Interfaces:
DevicePlugin,DevicePluginScheduler
public class NvidiaGPUPluginForRuntimeV2
extends Object
implements DevicePlugin, DevicePluginScheduler
Nvidia GPU plugin supporting both Nvidia container runtime v2 for Docker and
non-Docker container.
It has topology aware as well as simple scheduling ability.
-
Nested Class Summary
Nested ClassesModifier and TypeClassDescriptionstatic enumDifferent type of link.classA shell wrapper class easy for test. -
Field Summary
FieldsModifier and TypeFieldDescriptionstatic final org.slf4j.Loggerstatic final Stringstatic final StringThe container can set this environment variable.static final StringSchedule policy that prefer the faster GPU-GPU communication.static final StringSchedule policy that prefer the faster CPU-GPU communication. -
Constructor Summary
Constructors -
Method Summary
Modifier and TypeMethodDescriptionCalled when allocating devices.voidbasicSchedule(Set<Device> allocation, int count, Set<Device> availableDevices) intcomputeCostOfDevices(Device[] devices) The cost function used to calculate costs of a sub set of devices.Called when update node resource.Called first when device plugin framework wants to register.voidbooleanonDevicesAllocated(Set<Device> allocatedDevices, YarnRuntimeType yarnRuntime) Asking how these devices should be prepared/used before/when container launch.voidonDevicesReleased(Set<Device> releasedDevices) Called after device released.voidA typical sample topo output: GPU0 GPU1 GPU2 GPU3 CPU Affinity GPU0 X PHB SOC SOC 0-31 GPU1 PHB X SOC SOC 0-31 GPU2 SOC SOC X PHB 0-31 GPU3 SOC SOC PHB X 0-31 Legend: X = Self SOC = Connection traversing PCIe as well as the SMP link between CPU sockets(e.g.voidsetPathOfGpuBinary(String pOfGpuBinary) voidvoidtopologyAwareSchedule(Set<Device> allocation, int count, Map<String, String> envs, Set<Device> availableDevices, Map<Integer, List<Map.Entry<Set<Device>, Integer>>> cTable) Topology Aware schedule algorithm.
-
Field Details
-
LOG
public static final org.slf4j.Logger LOG -
NV_RESOURCE_NAME
- See Also:
-
TOPOLOGY_POLICY_ENV_KEY
The container can set this environment variable. To tell the scheduler what's the policy to use when do scheduling- See Also:
-
TOPOLOGY_POLICY_PACK
Schedule policy that prefer the faster GPU-GPU communication. Suitable for heavy GPU computation workload generally.- See Also:
-
TOPOLOGY_POLICY_SPREAD
Schedule policy that prefer the faster CPU-GPU communication. Suitable for heavy CPU-GPU IO operations generally.- See Also:
-
-
Constructor Details
-
NvidiaGPUPluginForRuntimeV2
public NvidiaGPUPluginForRuntimeV2()
-
-
Method Details
-
getRegisterRequestInfo
Description copied from interface:DevicePluginCalled first when device plugin framework wants to register.- Specified by:
getRegisterRequestInfoin interfaceDevicePlugin- Returns:
- DeviceRegisterRequest
DeviceRegisterRequest - Throws:
Exception
-
getDevices
Description copied from interface:DevicePluginCalled when update node resource.- Specified by:
getDevicesin interfaceDevicePlugin- Returns:
- a set of
Device,TreeSetrecommended - Throws:
Exception
-
onDevicesAllocated
public DeviceRuntimeSpec onDevicesAllocated(Set<Device> allocatedDevices, YarnRuntimeType yarnRuntime) throws Exception Description copied from interface:DevicePluginAsking how these devices should be prepared/used before/when container launch. A plugin can do some tasks in its own or define it in DeviceRuntimeSpec to let the framework do it. For instance, defineVolumeSpecto let the framework to create volume before running container.- Specified by:
onDevicesAllocatedin interfaceDevicePlugin- Parameters:
allocatedDevices- A set of allocatedDevice.yarnRuntime- Indicate which runtime YARN will use Could beRUNTIME_DEFAULTorRUNTIME_DOCKERinDeviceRuntimeSpecconstants. The default means YARN's non-docker container runtime is used. The docker means YARN's docker container runtime is used.- Returns:
- a
DeviceRuntimeSpecdescription about environment,VolumeSpec,MountVolumeSpec. etc - Throws:
Exception
-
onDevicesReleased
Description copied from interface:DevicePluginCalled after device released.- Specified by:
onDevicesReleasedin interfaceDevicePlugin- Parameters:
releasedDevices- A set of released devices- Throws:
Exception
-
allocateDevices
public Set<Device> allocateDevices(Set<Device> availableDevices, int count, Map<String, String> envs) Description copied from interface:DevicePluginSchedulerCalled when allocating devices. The framework will do all device book keeping and fail recovery. So this hook could be stateless and only do scheduling based on available devices passed in. It could be invoked multiple times by the framework. The hint in environment variables passed in could be potentially used in making better scheduling decision. For instance, GPU scheduling might support different kind of policy. The container can set it through environment variables.- Specified by:
allocateDevicesin interfaceDevicePluginScheduler- Parameters:
availableDevices- Devices allowed to be chosen from.count- Number of device to be allocated.envs- Environment variables of the container.- Returns:
- A set of
Deviceallocated
-
initCostTable
- Throws:
IOException
-
computeCostOfDevices
The cost function used to calculate costs of a sub set of devices. It calculate link weight of each pair in non-duplicated combination of devices. -
topologyAwareSchedule
@VisibleForTesting public void topologyAwareSchedule(Set<Device> allocation, int count, Map<String, String> envs, Set<Device> availableDevices, Map<Integer, List<Map.Entry<Set<Device>, Integer>>> cTable) Topology Aware schedule algorithm. It doesn't consider CPU affinity or NUMA or bus bandwidths. It support two plicy: "spread" and "pack" which can be set by container's environment variable. Use pack by default which means prefer the faster GPU-GPU. "Spread" means prefer the faster CPU-GPU. It can potentially be extend to take GPU attribute like GPU chip memory into consideration. -
basicSchedule
-
parseTopo
A typical sample topo output: GPU0 GPU1 GPU2 GPU3 CPU Affinity GPU0 X PHB SOC SOC 0-31 GPU1 PHB X SOC SOC 0-31 GPU2 SOC SOC X PHB 0-31 GPU3 SOC SOC PHB X 0-31 Legend: X = Self SOC = Connection traversing PCIe as well as the SMP link between CPU sockets(e.g. QPI) PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe switches (without traversing the PCIe Host Bridge) PIX = Connection traversing a single PCIe switch NV# = Connection traversing a bonded set of # NVLinks」 -
setPathOfGpuBinary
-
setShellExecutor
@VisibleForTesting public void setShellExecutor(NvidiaGPUPluginForRuntimeV2.NvidiaCommandExecutor shellExecutor) -
isTopoInitialized
@VisibleForTesting public boolean isTopoInitialized() -
getCostTable
-
getDevicePairToWeight
-