Distributed LU factorization using Chameleon
In this post I will write the detail the steps I took to reproduce distributed LU factorization using the Chameleon library. It is a linear algebra library based on the starPU runtime system.
Table of Contents
Installing chameleon
Clone the sources from gitlab:
git clone --recursive https://gitlab.inria.fr/solverstack/chameleon.git
Configure with the following for a non-CUDA, MPI-enabled build:
cd chameleon
mkdir build
cd build
cmake .. -DCMAKE_BUILD_TYPE=Debug \
         -DCMAKE_INSTALL_PREFIX=$HOME/gitrepos/hicma/profiling/chameleon/chameleon/build \
         -DCHAMELEON_USE_CUDA=OFF \
         -DCHAMELEON_USE_MPI=ON \
         -DFXT_DIR=/home/1/17M38101/software/fxt-0.3.8 \
         -DSTARPU_DIR=/home/1/17M38101/software/starpu-1.3.2-test \
         -DSTARPU_FIND_COMPONENTS=ON \
         -DCHAMELEON_ENABLE_TRACING=ON
make install
Make sure that you’re using openmpi. For some reason chameleon refuses to work with a starpu that has been compiled with intel-mpi.
Compiling and linking your programs
Make sure you have starpu and starpumpi configured in your pkg-config path. You can then
get the compiler flags with pkg-config --cflags chameleon and linker flags with
pkg-config --libs --static chameleon.
Distributed LU factorization implementations
The chameleon_pzgetrf_nopiv(CHAM_desc_t*, RUNTIME_sequence_t *, RUNTIME_request_t *)
function is used for a distributed LU factorization using Chameleon and starpu underneath. It
implements a right-looking variant of the LU factorization, which is a very common algorithm
made popular by SCALAPACK.
The chameleon_pzgetrf_incpiv() function is a used for a distributed LU using a newer
LU algorithm presented in this paper. It claims to have superior performance compared
to right-looking LU since communication and computation can be overlapped better.