ScaLAPACK is the distributed version of LAPACK. The interface of most functions is almost similar. However, not much documentation and example code is available for scalapack in C++, which is why I’m writing this blog post to document my learnings. Hopefully this will be useful for others too. The full source code can be found in this repo. This program calls fortran routines from CPP and therefore all array storage is column major.

This post is part of a larger post where I’ve implemented and benchmarked synchronous and asynchronous block LU deocomposition. That post can be found here. This intel resource is also helpful for this purpose.

Table of Contents

Scalapack protips

There are certain terminologies that are pretty widely used in scalapack. They are as follows:

  • Scalapack docs assume that a matrix of K rows or columns is distributed over a process grid of dimensions p x q.
  • LOCr :: LOCr(K) denotes the number of elements of K that a process would receive if K were distributed over the p processes of its process column.
  • LOCc :: LOCc(K) denotes the number of elements of K that a process would receive if K were distributed over the q processes of its process row.
  • The values of LOCc and LOCr can be determined using a call to the numroc function.
  • IMPORTANT :: None of these functions have C interfaces the way there are for LAPACK via LAPACKE. Therefore, you must take care to pass all variables by address, not by value and store all your data in FORTRAN-style, i.e. column-major format not row-major.

Getting process block size with numroc

The numroc function is useful in almost every scalapack function. It computes the number of rows and columns of a distributed matrix owned by the process (the return value). Here’s an explanation alongwith the prototype:

int numroc_(
    const int *n, // (global) the number of rows/cols in dist matrix
    const int *nb, // (global) block size. size of blocks the distributed matrix is split into.
    const int *iproc, // (local input) coord of the process whose local array row is to be determined.
    const int *srcproc, // (global input) coord of the process that has the first row/col of distributed matrix.
    const int *nprocs // (global input) total no. of processes over which the matrix is distributed.

Usage of numroc

Pass the row or column number of the current matrix block in order to obtain the total number of rows/columns of the full matrix that will be contained in the process. For example, say you have a matrix of size 16384 that you want to split into blocks of 512 and distribute them over 9 processes. This will lead to an uneven splitting and will therefore require careful calculation of the number of elements.

The correct way to obtain the total elements that exist on each process would be as follows:

int N = 16384, NB = 512;
int rsrc = 0, csrc = 0;
int nprow = 3, npcol = 3;

int np = numroc_(&N, &NB, &myrow, &rsrc, &nprow);
int nc = numroc_(&N, &NB, &mycol, &csrc, &npcol);

std::cout << "ipr: " << proc_id << " np: " << np << " nc: " << nc << std::endl;

// ipr: 6 np: 5120 nc: 5632
// ipr: 5 np: 5632 nc: 5120
// ipr: 3 np: 5632 nc: 5632
// ipr: 7 np: 5120 nc: 5632
// ipr: 2 np: 5632 nc: 5120
// ipr: 4 np: 5632 nc: 5632
// ipr: 8 np: 5120 nc: 5120
// ipr: 0 np: 5632 nc: 5632
// ipr: 1 np: 5632 nc: 5632

Checking where an element exists with indxg2p

You can check which process a given global co-ordinate exists on using the indxg2l_() function. This function returns the process number of the element. Usage example:

int l = 513, NB = 512, rsrc = 0;
int p = indxg2p_(&l, &NB, &myrow, &rsrc, &nprocs);


Obtain local co-oridnates using indxg2l

You can pass global co-ordinates along with process co-ordinates to the indxg2l_() function in order to obtain the local co-ordinates of the data element. A sample usage is like so:



Scalapack reports errors using the XERBLA error handler. Here’s some resources for this:

Linking and compiling

If you’re using an intel MKL distribution, use their linker options tool for knowing exact link options.

Take care to link with the _lp64 libraries since they treat integers as 32 bit and that’s what you want for scalapack. This link offers some explanation.

Function usage protips

As with other PBLAS or ScaLAPACK functions, this function expects the matrix to be already distributed over the BLACS process grid (and of course the BLACS process grid should be initialized).

The function in scalapack for LU decomposition is pdgetrf_. The C++ prototype of this function is as follows:

void pdgetrf_(
    int *m,   // (global) The number of rows in the distributed matrix sub(A)
    int *n,   // (global) The number of columns in the distributed matrix sub(A)
    // (local) Pointer into the local memory to an array of local size.
    // Contains the local pieces of the distributed matrix sub(A) to be factored.
    double *a,
    int *ia,  // (global) row index in the global matrix A indicating first row matrix sub(A)
    int *ja,  // (global) col index in the global matrix A indicating first col matrix sub(A)
    int *desca, // array descriptor of A
    int *ipiv, // contains the pivoting information. array of size
    int *info // information about execution.

In the above prototype, m signifies the number of rows of the submatrix, meaning the matrix that is present in the current process. Similarly for n in case of cols.

A function descinit_ can be used for initializing the descriptor array. Its prototype is as follows:

void descinit_(int *desc, const int *m,  const int *n, const int *mb, 
    const int *nb, const int *irsrc, const int *icsrc, const int *ictxt, 
    const int *lld, int *info);

In the descinit_, the MB and NB parameters signify the size of the block into which the matrix is divided. Not the size of the block that each process will receive. See the sync_lu code for an example of block cyclic LU decomposition.

The ipiv array is not a synchronized data struture - it will be different for each process. According to the docs, ipiv(i) is the global row local row i was swapped with. This array is tied to the distributed matrix A.

Storage in the arrays

Each local array of a process should store a part of the global matrix. The global matrix is stored in a block cyclic manner and scalapack reads each local array expecting it in a particular format. It is important to be aware of this.

See this explanation on the scalapack site to get a complete understanding. This link has resources on data distribution in scalapack in general.

For example, say you have a 1024 x 1024 matrix which you want to distribute on a 3x3 grid of 9 processes.