java.lang.Object
org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.com.nvidia.NvidiaGPUPluginForRuntimeV2
All Implemented Interfaces:
DevicePlugin, DevicePluginScheduler

public class NvidiaGPUPluginForRuntimeV2 extends Object implements DevicePlugin, DevicePluginScheduler
Nvidia GPU plugin supporting both Nvidia container runtime v2 for Docker and non-Docker container. It has topology aware as well as simple scheduling ability.
  • Field Details

    • LOG

      public static final org.slf4j.Logger LOG
    • NV_RESOURCE_NAME

      public static final String NV_RESOURCE_NAME
      See Also:
    • TOPOLOGY_POLICY_ENV_KEY

      public static final String TOPOLOGY_POLICY_ENV_KEY
      The container can set this environment variable. To tell the scheduler what's the policy to use when do scheduling
      See Also:
    • TOPOLOGY_POLICY_PACK

      public static final String TOPOLOGY_POLICY_PACK
      Schedule policy that prefer the faster GPU-GPU communication. Suitable for heavy GPU computation workload generally.
      See Also:
    • TOPOLOGY_POLICY_SPREAD

      public static final String TOPOLOGY_POLICY_SPREAD
      Schedule policy that prefer the faster CPU-GPU communication. Suitable for heavy CPU-GPU IO operations generally.
      See Also:
  • Constructor Details

    • NvidiaGPUPluginForRuntimeV2

      public NvidiaGPUPluginForRuntimeV2()
  • Method Details

    • getRegisterRequestInfo

      public DeviceRegisterRequest getRegisterRequestInfo() throws Exception
      Description copied from interface: DevicePlugin
      Called first when device plugin framework wants to register.
      Specified by:
      getRegisterRequestInfo in interface DevicePlugin
      Returns:
      DeviceRegisterRequest DeviceRegisterRequest
      Throws:
      Exception
    • getDevices

      public Set<Device> getDevices() throws Exception
      Description copied from interface: DevicePlugin
      Called when update node resource.
      Specified by:
      getDevices in interface DevicePlugin
      Returns:
      a set of Device, TreeSet recommended
      Throws:
      Exception
    • onDevicesAllocated

      public DeviceRuntimeSpec onDevicesAllocated(Set<Device> allocatedDevices, YarnRuntimeType yarnRuntime) throws Exception
      Description copied from interface: DevicePlugin
      Asking how these devices should be prepared/used before/when container launch. A plugin can do some tasks in its own or define it in DeviceRuntimeSpec to let the framework do it. For instance, define VolumeSpec to let the framework to create volume before running container.
      Specified by:
      onDevicesAllocated in interface DevicePlugin
      Parameters:
      allocatedDevices - A set of allocated Device.
      yarnRuntime - Indicate which runtime YARN will use Could be RUNTIME_DEFAULT or RUNTIME_DOCKER in DeviceRuntimeSpec constants. The default means YARN's non-docker container runtime is used. The docker means YARN's docker container runtime is used.
      Returns:
      a DeviceRuntimeSpec description about environment, VolumeSpec, MountVolumeSpec. etc
      Throws:
      Exception
    • onDevicesReleased

      public void onDevicesReleased(Set<Device> releasedDevices) throws Exception
      Description copied from interface: DevicePlugin
      Called after device released.
      Specified by:
      onDevicesReleased in interface DevicePlugin
      Parameters:
      releasedDevices - A set of released devices
      Throws:
      Exception
    • allocateDevices

      public Set<Device> allocateDevices(Set<Device> availableDevices, int count, Map<String,String> envs)
      Description copied from interface: DevicePluginScheduler
      Called when allocating devices. The framework will do all device book keeping and fail recovery. So this hook could be stateless and only do scheduling based on available devices passed in. It could be invoked multiple times by the framework. The hint in environment variables passed in could be potentially used in making better scheduling decision. For instance, GPU scheduling might support different kind of policy. The container can set it through environment variables.
      Specified by:
      allocateDevices in interface DevicePluginScheduler
      Parameters:
      availableDevices - Devices allowed to be chosen from.
      count - Number of device to be allocated.
      envs - Environment variables of the container.
      Returns:
      A set of Device allocated
    • initCostTable

      @VisibleForTesting public void initCostTable() throws IOException
      Throws:
      IOException
    • computeCostOfDevices

      @VisibleForTesting public int computeCostOfDevices(Device[] devices)
      The cost function used to calculate costs of a sub set of devices. It calculate link weight of each pair in non-duplicated combination of devices.
    • topologyAwareSchedule

      @VisibleForTesting public void topologyAwareSchedule(Set<Device> allocation, int count, Map<String,String> envs, Set<Device> availableDevices, Map<Integer,List<Map.Entry<Set<Device>,Integer>>> cTable)
      Topology Aware schedule algorithm. It doesn't consider CPU affinity or NUMA or bus bandwidths. It support two plicy: "spread" and "pack" which can be set by container's environment variable. Use pack by default which means prefer the faster GPU-GPU. "Spread" means prefer the faster CPU-GPU. It can potentially be extend to take GPU attribute like GPU chip memory into consideration.
    • basicSchedule

      @VisibleForTesting public void basicSchedule(Set<Device> allocation, int count, Set<Device> availableDevices)
    • parseTopo

      public void parseTopo(String topo, Map<String,Integer> deviceLinkToWeight)
      A typical sample topo output: GPU0 GPU1 GPU2 GPU3 CPU Affinity GPU0 X PHB SOC SOC 0-31 GPU1 PHB X SOC SOC 0-31 GPU2 SOC SOC X PHB 0-31 GPU3 SOC SOC PHB X 0-31 Legend: X = Self SOC = Connection traversing PCIe as well as the SMP link between CPU sockets(e.g. QPI) PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe switches (without traversing the PCIe Host Bridge) PIX = Connection traversing a single PCIe switch NV# = Connection traversing a bonded set of # NVLinks」
    • setPathOfGpuBinary

      @VisibleForTesting public void setPathOfGpuBinary(String pOfGpuBinary)
    • setShellExecutor

      @VisibleForTesting public void setShellExecutor(NvidiaGPUPluginForRuntimeV2.NvidiaCommandExecutor shellExecutor)
    • isTopoInitialized

      @VisibleForTesting public boolean isTopoInitialized()
    • getCostTable

      @VisibleForTesting public Map<Integer,List<Map.Entry<Set<Device>,Integer>>> getCostTable()
    • getDevicePairToWeight

      @VisibleForTesting public Map<String,Integer> getDevicePairToWeight()