ScaLAPACK is the distributed version of LAPACK. The interface of most functions is almost similar. However, not much documentation and example code is available for scalapack in C++, which is why I’m writing this blog post to document my learnings. Hopefully this will be useful for others too. The full source code can be found in this repo. This program calls fortran routines from CPP and therefore all array storage is column major.
This post is part of a larger post where I’ve implemented and benchmarked synchronous and asynchronous block LU deocomposition. That post can be found here. This intel resource is also helpful for this purpose.
Table of Contents
- Scalapack protips
- Function usage protips
- Source code
There are certain terminologies that are pretty widely used in scalapack. They are as follows:
- Scalapack docs assume that a matrix of
Krows or columns is distributed over a process grid of dimensions p x q.
LOCr(K)denotes the number of elements of K that a process would receive if K were distributed over the p processes of its process column.
LOCc(K)denotes the number of elements of K that a process would receive if K were distributed over the q processes of its process row.
- The values of
LOCrcan be determined using a call to the
- IMPORTANT :: None of these functions have C interfaces the way there are for LAPACK via LAPACKE. Therefore, you must take care to pass all variables by address, not by value and store all your data in FORTRAN-style, i.e. column-major format not row-major.
Getting process block size with numroc
numroc function is useful in almost every scalapack function. It computes the number of rows
and columns of a distributed matrix owned by the process (the return value). Here’s an explanation
alongwith the prototype:
int numroc_( const int *n, // (global) the number of rows/cols in dist matrix const int *nb, // (global) block size. size of blocks the distributed matrix is split into. const int *iproc, // (local input) coord of the process whose local array row is to be determined. const int *srcproc, // (global input) coord of the process that has the first row/col of distributed matrix. const int *nprocs // (global input) total no. of processes over which the matrix is distributed. );
Usage of numroc
Pass the row or column number of the current matrix block in order to obtain the total
number of rows/columns of the full matrix that will be contained in the process. For example,
say you have a matrix of size
16384 that you want to split into blocks of
512 and distribute
them over 9 processes. This will lead to an uneven splitting and will therefore require careful
calculation of the number of elements.
The correct way to obtain the total elements that exist on each process would be as follows:
int N = 16384, NB = 512; int rsrc = 0, csrc = 0; int nprow = 3, npcol = 3; int np = numroc_(&N, &NB, &myrow, &rsrc, &nprow); int nc = numroc_(&N, &NB, &mycol, &csrc, &npcol); std::cout << "ipr: " << proc_id << " np: " << np << " nc: " << nc << std::endl; // ipr: 6 np: 5120 nc: 5632 // ipr: 5 np: 5632 nc: 5120 // ipr: 3 np: 5632 nc: 5632 // ipr: 7 np: 5120 nc: 5632 // ipr: 2 np: 5632 nc: 5120 // ipr: 4 np: 5632 nc: 5632 // ipr: 8 np: 5120 nc: 5120 // ipr: 0 np: 5632 nc: 5632 // ipr: 1 np: 5632 nc: 5632
Checking where an element exists with indxg2p
You can check which process a given global co-ordinate exists on using the
This function returns the process number of the element. Usage example:
int l = 513, NB = 512, rsrc = 0; int p = indxg2p_(&l, &NB, &myrow, &rsrc, &nprocs);
Obtain local co-oridnates using indxg2l
You can pass global co-ordinates along with process co-ordinates to the
in order to obtain the local co-ordinates of the data element. A sample usage is like so:
Scalapack reports errors using the XERBLA error handler. Here’s some resources for this:
Linking and compiling
If you’re using an intel MKL distribution, use their linker options tool for knowing exact link options.
Take care to link with the
_lp64 libraries since they treat integers as 32 bit and that’s what you want for scalapack. This link offers some explanation.
Function usage protips
As with other PBLAS or ScaLAPACK functions, this function expects the matrix to be already distributed over the BLACS process grid (and of course the BLACS process grid should be initialized).
The function in scalapack for LU decomposition is
pdgetrf_. The C++ prototype of this function is
void pdgetrf_( int *m, // (global) The number of rows in the distributed matrix sub(A) int *n, // (global) The number of columns in the distributed matrix sub(A) // (local) Pointer into the local memory to an array of local size. // Contains the local pieces of the distributed matrix sub(A) to be factored. double *a, int *ia, // (global) row index in the global matrix A indicating first row matrix sub(A) int *ja, // (global) col index in the global matrix A indicating first col matrix sub(A) int *desca, // array descriptor of A int *ipiv, // contains the pivoting information. array of size int *info // information about execution. );
In the above prototype,
m signifies the number of rows of the submatrix, meaning
the matrix that is present in the current process. Similarly for
n in case of cols.
descinit_ can be used for initializing the descriptor array. Its prototype is as
void descinit_(int *desc, const int *m, const int *n, const int *mb, const int *nb, const int *irsrc, const int *icsrc, const int *ictxt, const int *lld, int *info);
NB parameters signify the size of the block into which the
matrix is divided. Not the size of the block that each process will receive. See the
code for an example of block cyclic LU decomposition.
ipiv array is not a synchronized data struture - it will be different for each process.
According to the docs,
ipiv(i) is the global row local row i was swapped with. This array
is tied to the distributed matrix A.
Storage in the arrays
Each local array of a process should store a part of the global matrix. The global matrix is stored in a block cyclic manner and scalapack reads each local array expecting it in a particular format. It is important to be aware of this.
For example, say you have a
1024 x 1024 matrix which you want to distribute on a 3x3 grid of 9 processes.