PyTorch#
PyTorch is an end-to-end machine learning framework. PyTorch enables fast, flexible experimentation and efficient production through a user-friendly front-end, distributed training, and ecosystem of tools and libraries.
The main method of distribution for PyTorch for ppc64le
is via Conda, with Open-CE providing a simple method for installing multiple machine learning frameworks into a single conda environment.
The upstream Conda and pip distributions do not provide ppc64le pytorch packages at this time.
Installing via Conda#
With a working Conda installation (see Installing Miniconda) the following instructions can be used to create a Python 3.9 conda environment named torch
with the latest Open-CE provided PyTorch:
Note
Pytorch installations via conda can be relatively large. Consider installing your miniconda (and therfore your conda environments) to the /nobackup
file store.
# Create a new conda environment named torch within your conda installation
conda create -y --name torch python=3.9
# Activate the conda environment
conda activate torch
# Add the OSU Open-CE conda channel to the current environment config
conda config --env --prepend channels https://ftp.osuosl.org/pub/open-ce/current/
# Also use strict channel priority
conda config --env --set channel_priority strict
# Install the latest available version of PyTorch
conda install -y pytorch
In subsequent interactive sessions, and when submitting batch jobs which use PyTorch, you will then need to re-activate the conda environment.
For example, to verify that PyTorch is available and print the version:
# Activate the conda environment
conda activate torch
# Invoke python
python3 -c "import torch;print(torch.__version__)"
Installation via the upstream Conda channel is not currently possible, due to the lack of ppc64le
or noarch
distributions.
Warning
Conda builds of PyTorch for aarch64
do not include CUDA support as of July 2024.
For now, consider:
Install a
2.4.0
build using the CUDA 12.4 PyTorch channel via pip (see Installing via pip)Use containers provided by Nvidia for a pip-based environment (see Using NGC PyTorch Containers)
Build PyTorch from source into a conda environment.
Installing via pip#
Warning
pip
does not provide ppc64le
builds of PyTorch (from PyPI or the PyTorch wheel repositories). Instead, see Installing via Conda or build from source.
PyTorch pip packages for aarch64
prior to PyTorch 2.4
do not include CUDA support.
CUDA support is only included in the PyTorch 2.4.0
wheels for aarch64
using CUDA 12.4
.
Warning
CUDA 11.8 and CUDA 12.1
aarch64
builds do not include CUDA support (as of PyTorch2.4.0
). You must use the CUDA 12.4 repository.CUDA enabled
aarch64
wheels are large (over 2GB). Consider creating yourvenv
/ conda env in/nobackup
to avoid filling your home directory quota.As with other PyTorch
2.x
builds, you may see a warning if you do not also installnumpy
into your python environment.
# Create a python venv in /nobackup, replacing your project name and following path as appropriate
python3 -m venv /nobackup/projects/bdXXXXX/pytorch-venv
# Activate the venv, replacing the path as appropriate
source /nobackup/projects/bdXXXXX/pytorch-venv/bin/activate
# Install the latest release using the CUDA 12.4 repository
python3 -m pip install torch --index-url https://download.pytorch.org/whl/cu124
# Ensure that CUDA support is enabled
python3 -c "import torch; print(torch.__version__); print(torch.cuda.is_available());)"
In subsequent interactive sessions, and when submitting batch jobs which use PyTorch, you will then need to re-source the python venv.
For example, to verify that PyTorch is available and print the version:
# Activate the venv, replacing the path as appropriate
source /nobackup/projects/bdXXXXX/pytorch-nightly-venv/bin/activate
# Invoke python
python3 -c "import torch; print(torch.__version__)"
Using NGC PyTorch Containers#
Warning
NVIDIA do not provide ppc64le
containers for pytorch through NGC. This method should only be used for aarch64
partitions.
NVIDIA provide docker containers with CUDA-enabled pytorch builds for x86_64
and aarch64
architectures through NGC.
The NGC PyTorch containers have included Hopper support since 22.09
.
22.09
and22.10
provide a conda-based install of pytorch.22.11+
provide a pip-based install in the default python environment.
For details of which pytorch version is provided by the each container release, see the NGC PyTorch container release notes.
Apptainer can be used to convert and run docker containers, or to build an apptainer container based on a docker container.
These can be built on the aarch64
nodes in Bede using Rootless Container Builds.
Note
PyTorch containers can consume a large amount of disk space. Consider setting APPTAINER_CACHEDIR to an appropriate location in /nobackup
, e.g. export APPTAINER_CACHEDIR=/nobackup/projects/${SLURM_JOB_ACCOUNT}/${USER}/apptainer-cache
.
Note
The following apptainer commands should be executed from an aarch64
node only, i.e. on ghlogin
, gh
or ghtest
.
Docker containers can be fetched and converted using apptainer pull
, prior to using apptainer exec
to execute code within the container.
# Pull and convert the docker container. This may take a while.
apptainer pull docker://nvcr.io/nvidia/pytorch:24.03-py3
# Run a command in the container, i.e. showing the pytorch version
apptainer exec --nv docker://nvcr.io/nvidia/pytorch:24.03-py3 python3 -c "import torch;print(torch.__version__);"
Alternatively, if you require more than just pytorch within the container you can create an apptainer definition file.
E.g. for a container based on pytorch:24.03-py3
which also installs HuggingFace Transformers 4.37.0
, the following definition file could be used:
Bootstrap: docker
From: nvcr.io/nvidia/pytorch:24.03-py3
%post
# Install other python dependencies, e.g. hugging face transformers
python3 -m pip install transformers[torch]==4.37.0
%test
# Print the torch version, if CUDA is enabled and which architectures
python3 -c "import torch;print(torch.__version__); print(torch.cuda.is_available());print(torch.cuda.get_arch_list());"
# Print the pytorch transformers version, demonstrating it is available.
python3 -c "import transformers;print(transformers.__version__);"
Assuming this is named pytorch-transformers.def
, a corresponding apptainer image file name pytorch-transformers.sif
can then be created via:
apptainer build --nv pytorch-transformers.sif pytorch-transformers.def
Commands within this container can then be executed using apptainer exec
.
I.e. to see the version of transformers installed within the container:
apptainer exec --nv pytorch-transformers.sif python3 -c "import transformers;print(transformers.__version__);"
Or in this case due to the %test
segment of the container, run the test command.
apptainer test --nv pytorch-transformers.sif
Further Information#
For more information on the usage of PyTorch, see the Online Documentation.