Hardware
Hardware#
The system is based around the IBM POWER9 CPU and NVIDIA Tesla GPUs. Connectivity within a node is optimised by both the CPUs and GPUs being connected to an NVIDIA NVLink 2.0 bus, and outside of a node by a dual-rail Mellanox EDR InfiniBand interconnect allowing GPUDirect RDMA communications (direct memory transfers to/from GPU memory).
Together with IBM’s software engineering, the POWER9 architecture is uniquely positioned for:
Large memory GPU use, as the GPUs are able to access main system memory via POWER9’s large model feature.
Multi node GPU use, via IBM’s Distributed Deep Learning (DDL) software.
There are:
2x
login
nodes, each containing:2x POWER9 CPUs @ 2.4GHz (40 cores total and 4 hardware threads per core), with NVLink 2.0
512GB DDR4 RAM
4x Tesla V100 32G NVLink 2.0
1x Mellanox EDR (100Gbit/s) InfiniBand port
32x
gpu
nodes, each containing:2x POWER9 CPUs @ 2.7GHz (32 cores total and 4 hardware threads per core), with NVLink 2.0
512GB DDR4 RAM
4x Tesla V100 32G NVLink 2.0
2x Mellanox EDR (100Gbit/s) InfiniBand ports
4x
infer
nodes, each containing:2x POWER9 CPUs @ 2.9GHz (40 cores total and 4 hardware threads per core)
256GB DDR4 RAM
4x Tesla T4 16G PCIe
1x Mellanox EDR (100Gbit/s) InfiniBand port
The Mellanox EDR InfiniBand interconnect is organised in a 2:1 block fat
tree topology. GPUDirect RDMA transfers are supported on the 32 gpu
nodes only, as this requires an InfiniBand port per POWER9 CPU socket.
Storage is provided by a 2PB Lustre filesystem capable of reaching 10GB/s read or write performance, supplemented by an NFS service providing modest home and project directory needs.