In this post I will write the detail the steps I took to reproduce distributed LU factorization using the Chameleon library. It is a linear algebra library based on the starPU runtime system.

Table of Contents

Installing chameleon

Clone the sources from gitlab:

git clone --recursive https://gitlab.inria.fr/solverstack/chameleon.git

Configure with the following for a non-CUDA, MPI-enabled build:

cd chameleon
mkdir build
cd build
cmake .. -DCMAKE_BUILD_TYPE=Debug \
         -DCMAKE_INSTALL_PREFIX=$HOME/gitrepos/hicma/profiling/chameleon/chameleon/build \
         -DCHAMELEON_USE_CUDA=OFF \
         -DCHAMELEON_USE_MPI=ON \
         -DFXT_DIR=/home/1/17M38101/software/fxt-0.3.8 \
         -DSTARPU_DIR=/home/1/17M38101/software/starpu-1.3.2-test \
         -DSTARPU_FIND_COMPONENTS=ON \
         -DCHAMELEON_ENABLE_TRACING=ON
make install

Make sure that you’re using openmpi. For some reason chameleon refuses to work with a starpu that has been compiled with intel-mpi.

Compiling and linking your programs

Make sure you have starpu and starpumpi configured in your pkg-config path. You can then get the compiler flags with pkg-config --cflags chameleon and linker flags with pkg-config --libs --static chameleon.

Distributed LU factorization implementations

The chameleon_pzgetrf_nopiv(CHAM_desc_t*, RUNTIME_sequence_t *, RUNTIME_request_t *) function is used for a distributed LU factorization using Chameleon and starpu underneath. It implements a right-looking variant of the LU factorization, which is a very common algorithm made popular by SCALAPACK.

The chameleon_pzgetrf_incpiv() function is a used for a distributed LU using a newer LU algorithm presented in this paper. It claims to have superior performance compared to right-looking LU since communication and computation can be overlapped better.