3 NVIDIA Grace-Hopper nodes (GH200 480) are now available. See Using Bede for more information.

Wanderings in CUDA land - Exploring parallel computing with CUDA#

Some background terminology#

In the following, references are made frequently about CPU and GPU actions. This can be confusing shorthand. As explained in most introductory tutorials about using / programming GPUs, (for instance https://developer.nvidia.com/blog/cuda-refresher-reviewing-the-origins-of-gpu-computing/ ) the GPU with its memory and processor is distinct from the conventional computer arrangement with one or more CPUs, memory attached directly to the CPUs and connections to storage devices and external connections. In other words, data stored in the GPU memory is not normally accessible from the CPU and conversely, data in the CPU memory is not normally accessible from the GPU. Until recently, users were responsible for the explicit copying of data to / from the GPU memory. Therefore, the reader should understand that the term CPU is shorthand for the compute performed by the CPU using data in memory that is addressable (accessible) from that CPU. GPU refers to the GPU-specific domain. One profound difference among the two domains, which is critical for this note, is that the way parallel computation is expressed and performed in the CPU and GPU domains is fundamentally different.

Finally, in some of the comments of the accompanying codes, reference is often made to “the device”. In this context, the device always refers to the active GPU.

In order to make compuations on the GPU more accessible. NVIDIA, via its CUDA products, has provided an extensive programming environment and set of libraries for different functionalities that are invoked on the CPU to perform on the GPU using data already on the GPU. These libraries include FFTs (libcufft.so), operations on sparse matrices (libcusparse.so) and Basic Linear Algebra operations on dense vectors and matrices (libcublas.so). Some of our examples exploit routines in the libcublas.

NVIDIA also provides a C++ compatible compiler wrapper to make compiling some of CUDA-based codes easier, called nvcc. nvcc performs extensive preprocessing on C/C++ codes with the extension .cu. It also hides the linkage to basic CUDA support libraries.

A getting started example#

The first code is a traditional CPU-GPU call of the float Level 3 BLAS matrix-matrix multiplication, sgemm. The workflow in this program can be broken down as follows:

1. Create or read-in matrics on the CPU side.

2. Create companion matrices on the GPU.

3. Create a handle for the CUBLAS context on the GPU.

4. Copy data from CPU to GPU.

5. Perform computations on the GPU.

6. Copy results back to CPU for output and final processing.

7. Free up the storage used on both CPU and GPU side.

The matrices are initialised on the CPU, copied to similarly sized arrays on the GPU, the sgemm call is made on the GPU and then the relevant array is copied back to the CPU to print out.

// nvcc 036sgemm.c -lcublas
// From "Matrix computations on the GPU CUBLAS, CUSOLVER and MAGMA by example. Version 2017" 
// by Andrzej Chrzeszczyk and Jacob Anders
#include <stdio.h>
#include <stdlib.h>
#include <cuda_runtime.h>
#include "cublas_v2.h"
#define IDX2C(i,j,ld) (((j)*( ld ))+( i ))
#define m 6 // a - mxk matrix
#define n 4 // b - kxn matrix
#define k 5 // c - mxn matrix
int main(void) {
	cudaError_t cudaStat; // cudaMalloc status
	cublasStatus_t stat; // CUBLAS functions status
	cublasHandle_t handle; // CUBLAS context
	int i, j; // i-row index ,j- column index
	float* a; // mxk matrix a on the host
	float* b; // kxn matrix b on the host
	float* c; // mxn matrix c on the host
	a = (float*)malloc(m*k * sizeof(float)); // host memory for a
	b = (float*)malloc(k*n * sizeof(float)); // host memory for b
	c = (float*)malloc(m*n * sizeof(float)); // host memory for c
	// define an mxk matrix a column by column
	int ind = 11; // a:
	for (j = 0;j < k;j++) {                           // 11 ,17 ,23 ,29 ,35
		for (i = 0;i < m;i++) {                   // 12 ,18 ,24 ,30 ,36
			a[IDX2C(i, j, m)] = (float)ind++; // 13 ,19 ,25 ,31 ,37
		}                                         // 14 ,20 ,26 ,32 ,38
	}                                                 // 15 ,21 ,27 ,33 ,39
	                                                  // 16 ,22 ,28 ,34 ,40
	// print a row by row
	printf("a:\n");
	for (i = 0;i < m;i++) {
		for (j = 0;j < k;j++) {
			printf(" %5.0f", a[IDX2C(i, j, m)]);
		}
		printf("\n");
	}
	// define a kxn matrix b column by column
	ind = 11; // b:
	for (j = 0;j < n;j++) {                           // 11 ,16 ,21 ,26
		for (i = 0;i < k;i++) { 		  // 12 ,17 ,22 ,27
			b[IDX2C(i, j, k)] = (float)ind++; // 13 ,18 ,23 ,28
		} 					  // 14 ,19 ,24 ,29
	}  						  // 15 ,20 ,25 ,30
	// print b row by row
	printf("b:\n");
	for (i = 0;i < k;i++) {
		for (j = 0;j < n;j++) {
			printf(" %5.0f", b[IDX2C(i, j, k)]);
		}
		printf("\n");
	}
	// define an mxn matrix c column by column
	ind = 11; // c:
	for (j = 0;j < n;j++) { 			  // 11 ,17 ,23 ,29
		for (i = 0;i < m;i++) { 		  // 12 ,18 ,24 ,30
			c[IDX2C(i, j, m)] = (float)ind++; // 13 ,19 ,25 ,31
		} 					  // 14 ,20 ,26 ,32
	} 					          // 15 ,21 ,27 ,33
	 						  // 16 ,22 ,28 ,34
	// print c row by row
	printf("c:\n");
	for (i = 0;i < m;i++) {
		for (j = 0;j < n;j++) {
			printf(" %5.0f", c[IDX2C(i, j, m)]);
		}
		printf("\n");
	}
	// on the device
	float* d_a; // d_a - a on the device
	float* d_b; // d_b - b on the device
	float* d_c; // d_c - c on the device
	cudaStat = cudaMalloc((void **)& d_a, m*k * sizeof(*a)); // device
	// memory alloc for a
	cudaStat = cudaMalloc((void **)& d_b, k*n * sizeof(*b)); // device
	// memory alloc for b
	cudaStat = cudaMalloc((void **)& d_c, m*n * sizeof(*c)); // device
	// memory alloc for c
	stat = cublasCreate(&handle); // initialize CUBLAS context
	// copy matrices from the host to the device
	stat = cublasSetMatrix(m, k, sizeof(*a), a, m, d_a, m); //a -> d_a
	stat = cublasSetMatrix(k, n, sizeof(*b), b, k, d_b, k); //b -> d_b
	stat = cublasSetMatrix(m, n, sizeof(*c), c, m, d_c, m); //c -> d_c
	float al = 1.0f; // al =1
	float bet = 1.0f; // bet =1
	// matrix - matrix multiplication : d_c = al*d_a *d_b + bet *d_c
	// d_a -mxk matrix , d_b -kxn matrix , d_c -mxn matrix ;
	// al ,bet -scalars
	stat = cublasSgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N, m, n, k, &al, d_a,
		m, d_b, k, &bet, d_c, m);
	stat = cublasGetMatrix(m, n, sizeof(*c), d_c, m, c, m); // cp d_c ->c
	printf("c after Sgemm :\n");
	for (i = 0;i < m;i++) {
		for (j = 0;j < n;j++) {
			printf(" %7.0f", c[IDX2C(i, j, m)]); // print c after Sgemm
		}
		printf("\n");
	}
	cudaFree(d_a); // free device memory
	cudaFree(d_b); // free device memory
	cudaFree(d_c); // free device memory
	cublasDestroy(handle); // destroy CUBLAS context
	free(a); // free host memory
	free(b); // free host memory
	free(c); // free host memory
	return EXIT_SUCCESS;
}

The next code is nearly identical, but this time unified memory is used, so that the data is not copied explicitly from the CPU to the GPU.

Unified memory first appeared in 2014, but spurred by the Summit and Sierra systems from IBM from 2018 onwards, it has undergone extensive revisions behind the scenes to make it more efficient on different architectures. A good overview of unified memory can be found in the article https://developer.nvidia.com/blog/unified-memory-cuda-beginners/ written by Mark Harris in 2017. The presentation https://on-demand.gputechconf.com/gtc/2018/presentation/s8430-everything-you-need-to-know-about-unified-memory.pdf is also good and this goes into details on how unified memory sits atop a virtual memory shared between processors with page access being used to determine whether pages are physically located in the CPU or GPU memory (or on different GPUs). Mark also has an earlier Blog post about basic CUDA C/C++ programming that is worth reading at https://developer.nvidia.com/blog/even-easier-introduction-cuda/.

Now, rather than creating shadow arrays on the GPU side that mirror those on the CPU with explicit copying between them, arrays are declared once via commands like cudaMallocManaged. Such arrays can be accessed on both the CPU and GPU easily, but there can be a significant performance hit if access patterns between the two domains are not managed intelligently.

// nvcc 036 sgemm.cu -lcublas
#include <stdio.h>
#include "cublas_v2.h"
#define IDX2C(i,j,ld) (((j)*( ld ))+( i ))
#define m 6 // a - mxk matrix
#define n 4 // b - kxn matrix
#define k 5 // c - mxn matrix
int main(void) {
	cublasHandle_t handle; // CUBLAS context
	int i, j; // i-row index ,j- column index
	float* a; // mxk matrix
	float* b; // kxn matrix
	float* c; // mxn matrix
	// unified memory for a,b,c
	cudaMallocManaged(&a, m*k * sizeof(float)); 
	cudaMallocManaged(&b, k*n * sizeof(float));
	cudaMallocManaged(&c, m*n * sizeof(float));
	// define an mxk matrix a column by column
	int ind = 11; // a:
	for (j = 0;j < k;j++) { 		  // 11 ,17 ,23 ,29 ,35
		for (i = 0;i < m;i++) { 	  // 12 ,18 ,24 ,30 ,36
		a[IDX2C(i, j, m)] = (float)ind++; // 13 ,19 ,25 ,31 ,37
		} 				  // 14 ,20 ,26 ,32 ,38
	} 					  // 15 ,21 ,27 ,33 ,39
						  // 16 ,22 ,28 ,34 ,40
	// print a row by row
	printf("a:\n");
	for (i = 0;i < m;i++) {
		for (j = 0;j < k;j++) {
			printf(" %5.0f", a[IDX2C(i, j, m)]);
		}
		printf("\n");
	}
	// define a kxn matrix b column by column
	ind = 11; // b:
	for (j = 0;j < n;j++) {  			  // 11 ,16 ,21 ,26
		for (i = 0;i < k;i++) { 		  // 12 ,17 ,22 ,27
			b[IDX2C(i, j, k)] = (float)ind++; // 13 ,18 ,23 ,28
		} 				   	  // 14 ,19 ,24 ,29
	} 						  // 15 ,20 ,25 ,30
	// print b row by row
	printf("b:\n");
	for (i = 0;i < k;i++) {
		for (j = 0;j < n;j++) {
			printf(" %5.0f", b[IDX2C(i, j, k)]);
		}
		printf("\n");
	}
	// define an mxn matrix c column by column
	ind = 11; // c:
	for (j = 0;j < n;j++) {  			  // 11 ,17 ,23 ,29
		for (i = 0;i < m;i++) { 		  // 12 ,18 ,24 ,30
			c[IDX2C(i, j, m)] = (float)ind++; // 13 ,19 ,25 ,31
		}  					  // 14 ,20 ,26 ,32
	} 						  // 15 ,21 ,27 ,33
							  // 16 ,22 ,28 ,34
	// print c row by row
	printf("c:\n");
	for (i = 0;i < m;i++) {
		for (j = 0;j < n;j++) {
			printf(" %5.0f", c[IDX2C(i, j, m)]);
		}
		printf("\n");
	}
	cublasCreate(&handle); // initialize CUBLAS context
	float al = 1.0f; // al =1
	float bet = 1.0f; // bet =1
	// matrix - matrix multiplication : c = al*a*b + bet *c
	// a -mxk matrix , b -kxn matrix , c -mxn matrix ;
	// al ,bet -scalars
	cublasSgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N, m, n, k, &al, a, m, b, k,
		&bet, c, m);
	cudaDeviceSynchronize();
	printf("c after Sgemm :\n");
	for (i = 0;i < m;i++) {
		for (j = 0;j < n;j++) {
			printf(" %7.0f", c[IDX2C(i, j, m)]); // print c after Sgemm
		}
		printf("\n");
	}
	cudaFree(a); // free memory
	cudaFree(b); // free memory
	cudaFree(c); // free memory
	cublasDestroy(handle); // destroy CUBLAS context
	return EXIT_SUCCESS;
}

Looking more carefully at unified memory#

If we are going to generate large matrices for our tests, it is sensible to do the generation / initialisation on the GPU. This requires a “slight” diversion to learn about kernel functions and within GPU parallelism.

This first example is modified from Mark Harris’s code in https://developer.nvidia.com/blog/even-easier-introduction-cuda/ to include initialisation on the GPU. The code does a 1-D partitioning of the work among thread groups.

Note that the kernel functions init and add run on the GPU. CUDA GPUs contain a number of threads, each of which can perform a computation independently. In addition, there is a higher level of parallelism that boosts performance further.

Quoting from Harris,

“CUDA GPUs have many parallel processors grouped into Streaming Multiprocessors, or SMs. Each SM can run multiple concurrent thread blocks. As an example, a Tesla P100 GPU based on the Pascal GPU Architecture has 56 SMs, each capable of supporting up to 2048 active threads.”

Without direction, all threads execute precisely the same code, effectivly serial performance. Parallel performance requires tying particular pieces of work to a specific thread id and doing this concurrently across all threads.

Again, from Harris: “CUDA C++ provides keywords that let kernels get the indices of the running threads. Specifically, threadIdx.x contains the index of the current thread within its block, and blockDim.x contains the number of threads in the block.”

    // From  https://developer.nvidia.com/blog/unified-memory-cuda-beginners/  by Mark Harris, 2017  

    #include <iostream>
    #include <math.h>
    #include <stdio.h> 
    // CUDA kernel to add elements of two 1D arrays (i.e. vectors)
    __global__
    void add(int n, float *x, float *y)
    {
      int index = blockIdx.x * blockDim.x + threadIdx.x;
      int stride = blockDim.x * gridDim.x;
      for (int i = index; i < n; i += stride)
        y[i] = x[i] + y[i];
    }
    __global__ void init(int n, float *x, float *y) {
    int index = threadIdx.x + blockIdx.x * blockDim.x;
    int stride = blockDim.x * gridDim.x;
    for (int i = index; i < n; i += stride) {
      x[i] = 1.0f;
      y[i] = 2.0f;
      }
    }
    int main(void)
    {
      int N = 1<<20;
      float *x, *y;
      int device = -1;

      printf(" N %d\n",N);
      // Allocate Unified Memory -- accessible from CPU or GPU
      cudaMallocManaged(&x, N*sizeof(float));
      cudaMallocManaged(&y, N*sizeof(float));
      cudaGetDevice(&device);
      cudaMemPrefetchAsync(x, N*sizeof(float), device, NULL);
      cudaMemPrefetchAsync(y, N*sizeof(float), device, NULL);    

     
      // Launch kernel on 1M elements on the GPU
      int blockSize = 256;
      int numBlocks = (N + blockSize - 1) / blockSize;
      init<<<numBlocks, blockSize>>>(N, x, y);
      add<<<numBlocks, blockSize>>>(N, x, y);
     
      // Wait for GPU to finish before accessing on host
      cudaDeviceSynchronize();
     
      // Check for errors (all values should be 3.0f)
      float maxError = 0.0f;
      for (int i = 0; i < N; i++)
        maxError = fmax(maxError, fabs(y[i]-3.0f));
      std::cout << "Max error: " << maxError << std::endl;
     
      // Free memory
      cudaFree(x);
      cudaFree(y);
     
      return 0;
    }

It is worth examining a few parts of this code in more detail.

Notice the invocation of the kernel functions involves passing execution configuration specifying how many blocks of threads are used (referenced internally in a kernel function as the gridDim) and the size of each block of threads (referenced internally in a kernel function as the blockDim). Thread blocks have to have a multiple of 32 threads, with a maximum of 1024 threads in a block.

int blockSize = 256;
int numBlocks = (N + blockSize - 1) / blockSize;
init<<<numBlocks, blockSize>>>(N, x, y);
add<<<numBlocks, blockSize>>>(N, x, y);

The specification of the gridDim (numBlocks) and blockDim (blockSize) for the kernel function is passed via metaparameters enclosed with <<< and >>>. In this example the int values are coerced into the underlying dim3 type, which can take 3 values, one value for each dimension of the gridDim amd blockDim. If only one dimension is specified, the other two values are defaulted to 1. The need to preprocess such calls is part of the reason that CUDA C++ routines normally have the extension .cu. This alerts the CUDA nvcc compiler to the need for preprocessing.

In the kernel functions, there are other predefined identifiers to assist in specifying unique thread ids.

int index = blockIdx.x * blockDim.x + threadIdx.x;
int stride = blockDim.x * gridDim.x;
for (int i = index; i < n; i += stride)
  y[i] = x[i] + y[i];

Provided the value of gridDim.x is sufficiently large, that is blockDim.x * gridDim.x >= n the loop is effectively executed once with i == index. This loop can therefore be considered as defensive programming to ensure that the full range of the array is covered exactly once by the pool of thread blocks. To confirm, in this case:

n = 1048576
blockDim = 256
gridDim = n / blockDim = 4096
stride = n

There are alternative ways to protect the computations from trying to access past the end of arrays. A common alternative is given in the next example. The critical thing to note is that by defining the number of grid blocks as a function of the array dimension and thread pool size, it is possible to treat the array computation as being completely parallel in its execution. The reality is that some thread scheduling will be needed, but this is performed seamlessly by the GPU, in a fashion that is analogous to loop partitioning with OpenMP.

One important comment about kernel functions. The kernel functions are asynchronous with the CPU. It is therefore essential to impose a synchronisation between the GPU and CPU via the command cudaDeviceSynchronize() before results are accessed from the CPU side.

This example also exploits another feature of unified memory. By default, entities are created in the CPU memory and migrated on demand to GPU memory. This can be inefficient if we know in advance that the entities are going to be initialised and manipulated on the GPU. The command cudaMemPrefetchAsync can be used to specify that a data area is to created on a particular GPU.

With matrices, it is more natural to do a 2-D partitioning. Things are slightly more complicated, but the basic idea scales fairly well. Note that rather than strided access across both problem dimensions, this code just checks that the array indices are in range.

// Based heavily on https://developer.nvidia.com/blog/cuda-refresher-cuda-programming-model/
#include <stdio.h> 
const int N = 1024; const int blocksize = 16;

__global__ void add_matrix( float *a, float *b, float *c, int N) {
int i = blockIdx.x * blockDim.x + threadIdx.x; // blockIdx, blockDim and threadIdx are predefined
int j = blockIdx.y * blockDim.y + threadIdx.y; // variables - initialised from meta-arguments
int index = i + j*N;
if ( i < N && j < N ) // Keep indices in range 
  c[index] = a[index] + b[index]; 
}

int main(void){
const int size = N*N*sizeof(float);
float *a ; float *b; float *c ;
float maxError = 0.0f;
cudaMallocManaged( (void**)&a, size );
cudaMallocManaged( (void**)&b, size ); 
cudaMallocManaged( (void**)&c, size );
for ( int i = 0; i < N*N; ++i ) {
  a[i] = 1.0f; b[i] = 3.5f; }

dim3 dimBlock( blocksize, blocksize );     // dim3 structure to deal with 1D, 2D or 3D thread collections.
dim3 dimGrid( N/dimBlock.x, N/dimBlock.y); // dimBlock.x - first dimension, dimBlock.y - second dimension
					   // dimBlock.z for third dimension (not used)
add_matrix<<<dimGrid, dimBlock>>>( a, b, c, N);    // Note meta arguments that pass information on 
						   // Number of thread groups (Grid) and number of
						   // threads in each group (Block).

// Wait for GPU to finish before accessing on host - major source of errors
      cudaDeviceSynchronize();
					 	   

for (int j = 0; j < N; j++){		
 for (int i = 0; i < N;i++) {
     maxError = fmax(maxError, fabs(c[i+j*N]-4.5f));
 }
}
printf("Max error: %.16f\n", maxError );

cudaFree( a ); cudaFree( b ); cudaFree( c ); // CLEAN UP, RETURN
return 0;
}

Putting the pieces together#

We are now at the point where we can do a larger test. Matrices a and b are such that, when square, their product is the identity matrix (ones on the diagonal and zeroes elsewhere). Matrix c is defined as an identity matrix, so the final result is a matrix with two’s down its diagonal. I cannot find a reference to the specific formulas used for these matrices, but mathematically we have:

\[a_{ij} = b_{ij} = \sqrt{2 \over (m+1)} \times sin \left({(i+1) \times (j+1) \times \pi \over m+1}\right) \; i=0,..,m-1,\; j=0,..,m-1\]

Notice in the kernel function initmatrix we have redundant calculation, but the threads would have needed to wait for this computation anyway:

pi = two*asin(one);
rkplus1 = one/(float(m) + one);
rk = sqrt(two*rkplus1);
if ( i < m && j < n) {
  x[index] = rk*__sinf((float)(i+1)*(float)(j+1)*pi*rkplus1);
}

The full code is below:

// nvcc 036 sgemm.cu -lcublas
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <math.h>
#include <cuda_runtime.h>
#include "cublas_v2.h"


#define printlim 10
#define BILLION 1000000000L
#define IDX2C(i,j,ld) (((j)*( ld ))+( i ))

__global__  void initmatrix(int m, int n, float *x) {
	float rk,rkplus1,pi;
	float one=1.0f;
	float two=2.0f;
        int i = blockIdx.x * blockDim.x + threadIdx.x;
	int j = blockIdx.y * blockDim.y + threadIdx.y;
        long index = i + j*m;
	pi = two*asin(one);
        rkplus1 = one/(float(m) + one);  
	rk = sqrt(two*rkplus1);
        if ( i < m && j < n) {
	  x[index] = rk*__sinf((float)(i+1)*(float)(j+1)*pi*rkplus1);
	   
        }
     }

__global__  void initident(int m, int n, float *x) {
        int i = blockIdx.x * blockDim.x + threadIdx.x;
	int j = blockIdx.y * blockDim.y + threadIdx.y;
        long index = i + j*m;
        if ( i < m && j < n) {
	  x[index] = 0.e0;
	  if ( i == j ) x[index] = 1.e0;
        }
     }


int main(void) {
	struct timespec start , stop ; // variables for timing
	double accum ; // elapsed time variable
	cublasHandle_t handle; // CUBLAS context	
	int m=20000; // #defines in cuda call get the literal swapped in - invalid code
	int k=20000;
	int n=20000;
	int i,j; // i-row index, j- column index
	float *a; // mxk matrix
	float *b; // kxn matrix
	float *c; // mxn matrix
        dim3 blockSize;
        dim3 numBlocks;
  	
	// unified memory for a,b,c
	cudaMallocManaged(&a, m*k * sizeof(cuComplex));
	cudaMallocManaged(&b, k*n * sizeof(cuComplex));
	cudaMallocManaged(&c, m*n * sizeof(cuComplex));
	
	clock_gettime ( CLOCK_REALTIME ,&start ); // timer start

        blockSize.x = 32; // Have a 32 by 32 block - 1024 threads - max allowed
	blockSize.y = 32;

        numBlocks.x = (m + blockSize.x - 1) / blockSize.x;
        numBlocks.y = (k + blockSize.y - 1) / blockSize.y;	
	printf("Numblks x %d Blksize x %d\n",numBlocks.x, blockSize.x);
        // a - orthogonal symmetric matix
        initmatrix<<<numBlocks,blockSize>>>(m,k,a);

        numBlocks.x = (k + blockSize.x - 1) / blockSize.x;
        numBlocks.y = (n + blockSize.y - 1) / blockSize.y;
        // b - orthogonal symmetric matix
        initmatrix<<<numBlocks,blockSize>>>(k,n,b);

        numBlocks.x = (m + blockSize.x - 1) / blockSize.x;
        numBlocks.y = (n + blockSize.y - 1) / blockSize.y;
	// c - ones along diagonal; zero elsewhere
        initident<<<numBlocks,blockSize>>>(m,n,c);
        cudaDeviceSynchronize(); // Need matrices to be defined before continue
	clock_gettime ( CLOCK_REALTIME ,&stop ); // timer stop
 	accum =( stop.tv_sec - start.tv_sec )+ // elapsed time create A
	       ( stop.tv_nsec - start.tv_nsec )/(double)BILLION ;	
 	printf (" Initialise : %lf sec .\n",accum ); // print el. time
	// print a row by row
	printf("a:\n");
	
	cublasCreate(&handle); // initialize CUBLAS context
	float al = 1.0f; // al =1
	float bet = 1.0f; // bet =1
	// matrix - matrix multiplication : c = al*a*b + bet *c
	// a -mxk matrix , b -kxn matrix , c -mxn matrix ;
	// al ,bet -scalars
	clock_gettime ( CLOCK_REALTIME ,&start ); // timer start
	cublasSgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N, m, n, k, &al, a, m, b, k,
		&bet, c, m);
	cudaDeviceSynchronize();
	clock_gettime ( CLOCK_REALTIME ,&stop );  // timer stop
 	accum =( stop.tv_sec - start.tv_sec )+    // elapsed time for MN(2K+3) flops
	       ( stop.tv_nsec - start.tv_nsec )/(double)BILLION ;
        printf (" sgemm : %lf sec .\n",accum ); // print el. time
        printf (" Gflops : %lf \n",(double)m*(double)n*(2.e0*k+3.0)*1.e-9/accum); // print Gflops
	printf("c after Sgemm :\n");
	for (i = 0;i < printlim;i++) {
		for (j = 0;j < printlim;j++) {
			printf(" %7.4f", c[IDX2C(i, j, m)]); // print c after Sgemm
		}
		printf("\n");
	}

	cudaFree(a); // free memory
	cudaFree(b); // free memory
	cudaFree(c); // free memory
	cublasDestroy(handle); // destroy CUBLAS context
	return EXIT_SUCCESS;
}

The performance on different systems is interesting and it also demonstrates the ease of using nvprof (only an abbreviated output is shown).

On a system with a Quadro P4000 GPU and Skylake 6138 processor (2.0 GHz) - Driver Version: 460.32.03 CUDA Version: 11.2

Numblks x 625 Blksize x 32
Initialise : 0.438705 sec .
sgemm : 3.547108 sec .
Gflops : 4511.054646



           Type  Time(%)      Time     Calls       Avg       Min       Max  Name
GPU activities:   88.99%  3.54694s         1  3.54694s  3.54694s  3.54694s  maxwell_sgemm_128x128_nn
                   7.99%  318.64ms         2  159.32ms  125.55ms  193.09ms  initmatrix(int, int, float*)
                   3.01%  119.98ms         1  119.98ms  119.98ms  119.98ms  initident(int, int, float*)
                   0.00%     960ns         1     960ns     960ns     960ns  [CUDA memcpy HtoD]

On a system with a v100 and Cascade Lake 5218 (2.30GHz) - Driver Version: 460.32.03 CUDA Version: 11.2

Numblks x 625 Blksize x 32
Initialise : 0.254990 sec .
sgemm : 1.299593 sec .
Gflops : 12312.468032

           Type  Time(%)      Time     Calls       Avg       Min       Max  Name
GPU activities:   82.77%  1.22389s         1  1.22389s  1.22389s  1.22389s  volta_sgemm_128x64_nn
                  11.72%  173.30ms         2  86.650ms  75.697ms  97.603ms  initmatrix(int, int, float*)
                   5.51%  81.526ms         1  81.526ms  81.526ms  81.526ms  initident(int, int, float*)
                   0.00%  1.8240us         1  1.8240us  1.8240us  1.8240us  [CUDA memcpy HtoD]

On login2 of Bede - AC922 - v100 with Power 9 (3.8GHz) - Driver Version: 440.95.01 CUDA Version: 10.2

Numblks x 625 Blksize x 32
Initialise : 0.420643 sec .
sgemm : 1.154472 sec .
Gflops : 13860.183378

           Type  Time(%)      Time     Calls       Avg       Min       Max  Name
GPU activities:   73.29%  1.15420s         1  1.15420s  1.15420s  1.15420s  volta_sgemm_128x32_sliced1x4_nn
                  18.77%  295.52ms         2  147.76ms  147.09ms  148.43ms  initmatrix(int, int, float*)
                   7.94%  125.04ms         1  125.04ms  125.04ms  125.04ms  initident(int, int, float*)
                   0.00%  1.6320us         1  1.6320us  1.6320us  1.6320us  [CUDA memcpy HtoD]

The slower initialise times on the Bede node are interesting. The reasons for this are not clear, but steps can be taken to improve the performance on all of the systems.

By default, Unified Memory allocates pages in the CPU memory and migrates them over to the GPU on demand. It is possible, using the cudaMemPrefetchAsync command to preallocate memory on a particular GPU / device.

The relevant modified code to our earlier sgemm code is:

int device = -1; // For GPU number
// unified memory for a,b,c
cudaMallocManaged(&a, m*k * sizeof(float));
cudaMallocManaged(&b, k*n * sizeof(float));
cudaMallocManaged(&c, m*n * sizeof(float));
cudaGetDevice(&device);
cudaMemPrefetchAsync(a, m*k*sizeof(float), device, NULL);
cudaMemPrefetchAsync(b, k*n*sizeof(float), device, NULL);
cudaMemPrefetchAsync(c, m*n*sizeof(float), device, NULL);

Running the modified version of the code on our test platforms shows the change:

On a system with a Quadro P4000 GPU and Skylake 6138 processor (2.0 GHz) - Driver Version: 460.32.03 CUDA Version: 11.2

Numblks x 625 Blksize x 32
Initialise : 0.049530 sec .
sgemm : 3.599292 sec .
Gflops : 4445.652504

           Type  Time(%)      Time     Calls       Avg       Min       Max  Name
GPU activities:   98.65%  3.59913s         1  3.59913s  3.59913s  3.59913s  maxwell_sgemm_128x128_nn
                   1.00%  36.340ms         2  18.170ms  18.168ms  18.172ms  initmatrix(int, int, float*)
                   0.36%  13.072ms         1  13.072ms  13.072ms  13.072ms  initident(int, int, float*)

On a system with a v100 and Cascade Lake 5218 (2.30GHz) - Driver Version: 460.32.03 CUDA Version: 11.2

Numblks x 625 Blksize x 32
Initialise : 0.006248 sec .
sgemm : 1.582949 sec .
Gflops : 10108.474267

           Type  Time(%)      Time     Calls       Avg       Min       Max  Name
GPU activities:   99.51%  1.23181s         1  1.23181s  1.23181s  1.23181s  volta_sgemm_128x64_nn
                   0.35%  4.3074ms         2  2.1537ms  2.1522ms  2.1552ms  initmatrix(int, int, float*)
                   0.14%  1.7876ms         1  1.7876ms  1.7876ms  1.7876ms  initident(int, int, float*)

On login1 of Bede - AC922 - v100 with Power 9 (3.8GHz) - Driver Version: 440.95.01 CUDA Version: 10.2

Numblks x 625 Blksize x 32
Initialise : 0.073072 sec .
sgemm : 1.130069 sec .
Gflops : 14159.493073

           Type  Time(%)      Time     Calls       Avg       Min       Max  Name
GPU activities:   99.49%  1.12983s         1  1.12983s  1.12983s  1.12983s  volta_sgemm_128x32_sliced1x4_nn
                   0.36%  4.0385ms         2  2.0193ms  2.0175ms  2.0210ms  initmatrix(int, int, float*)
                   0.16%  1.7935ms         1  1.7935ms  1.7935ms  1.7935ms  initident(int, int, float*)

The initialisation times are dramatically reduced, indeed they are sufficiently low that we are running into timing accuracy of the clock_gettime function. It is best to look at the times recorded directly by nvprof, where (as expected), there is little difference between the results on the two v100 systems.

Since unified memory “works” from the GPU side of things and we are interested in the performance difference that Bede’s fast CPU-GPU connections might provide, the above experiments were repeated with the matrices a, b, and c created on the CPU and then migrated across on demand to the GPU.

The results are interesting and do appear to show that there are performance advantages on Bede.

On a system with a Quadro P4000 GPU and Skylake 6138 processor (2.0 GHz) - Driver Version: 460.32.03 CUDA Version: 11.2

Initialise : 25.894818 sec .

sgemm : 4.593165 sec .
Gflops : 3483.698256

           Type  Time(%)      Time     Calls       Avg       Min       Max  Name
GPU activities:  100.00%  4.59301s         1  4.59301s  4.59301s  4.59301s  maxwell_sgemm_128x128_nn
...
Device "Quadro P4000 (0)"
  Count  Avg Size  Min Size  Max Size  Total Size  Total Time  Name
  81282  57.669KB  4.0000KB  0.9961MB  4.470348GB  485.5766ms  Host To Device
     14  73.143KB  4.0000KB  476.00KB  1.000000MB  91.07200us  Device To Host
   6262         -         -         -           -   1.632346s  Gpu page fault groups
Total CPU Page faults: 13739

On a system with a v100 and Cascade Lake 5218 (2.30GHz) - Driver Version: 460.32.03 CUDA Version: 11.2

Initialise : 23.655406 sec .
sgemm : 2.265335 sec .
Gflops : 7063.503525

           Type  Time(%)      Time     Calls       Avg       Min       Max  Name
GPU activities:  100.00%  2.26486s         1  2.26486s  2.26486s  2.26486s  volta_sgemm_128x64_nn
...
Device "Tesla V100-PCIE-16GB (0)"
  Count  Avg Size  Min Size  Max Size  Total Size  Total Time  Name
  74761  56.784KB  4.0000KB  0.9961MB  4.048588GB  565.0188ms  Host To Device
     12  80.000KB  4.0000KB  476.00KB  960.0000KB  92.25500us  Device To Host
   3272         -         -         -           -   1.172802s  Gpu page fault groups
Total CPU Page faults: 13738

On login1 of Bede - AC922 - v100 with Power 9 (3.8GHz) - Driver Version: 440.95.01 CUDA Version: 10.2

Initialise : 18.209140 sec .
sgemm : 1.926488 sec .
Gflops : 8305.890756

           Type  Time(%)      Time     Calls       Avg       Min       Max  Name
GPU activities:  100.00%  1.92625s         1  1.92625s  1.92625s  1.92625s  volta_sgemm_128x32_sliced1x4_nn
...
Device "Tesla V100-SXM2-32GB (0)"
  Count  Avg Size  Min Size  Max Size  Total Size  Total Time  Name
  39141  119.52KB  64.000KB  960.00KB  4.461243GB  220.4134ms  Host To Device
      7  137.14KB  64.000KB  448.00KB  960.0000KB  39.52000us  Device To Host
   2085         -         -         -           -  877.0859ms  Gpu page fault groups
Total CPU Page faults: 13739

The sgemm performance is impacted by the time required to fetch the data, but the interesting numbers are the times spent in paging from Host to Device:

Device "Quadro P4000 (0)"
  Count  Avg Size  Min Size  Max Size  Total Size  Total Time  Name
  81282  57.669KB  4.0000KB  0.9961MB  4.470348GB  485.5766ms  Host To Device

Device "Tesla V100-PCIE-16GB (0)"
  Count  Avg Size  Min Size  Max Size  Total Size  Total Time  Name
  74761  56.784KB  4.0000KB  0.9961MB  4.048588GB  565.0188ms  Host To Device

Device "Tesla V100-SXM2-32GB (0)"
  Count  Avg Size  Min Size  Max Size  Total Size  Total Time  Name
  39141  119.52KB  64.000KB  960.00KB  4.461243GB  220.4134ms  Host To Device

We can see that the time taken for moving the matrix data from the CPU to the GPU is more than halved on the Bede node. This does suggest there is some validity in the assumption that the faster sub-systems are exploited seamlessly.

Future Work#

More carefuly designed experiments with Unified Memory are required. Unified Memory exploits automatically the faster coupling between the CPU and GPU on the AC922 than is found on GPU systems with x86_64 processors. Furthermore, Unified Memory can be oversubscribed and Unified Memory can be exploited both between the CPU and GPU and in multi-GPU programming within a node. Unified Memory can be exploited easily in Tensorflow. The PyTorch support appears to not be as advanced.

Some interesting results on using Unified Memory with the RAPIDS software layer for data analytics and PyTorch for deep learning can be found in the 30 minute video https://developer.nvidia.com/gtc/2019/video/s9726/video (registration and login is required).

Many important questions remain, such as the extent to which the IBM Large Memory Support API makes it easier to exploit the CPU memory on the GPU, but as a first step, Unified Memory appears to offer users a quick way to exploit Bede’s best features.