code, travel and explore.

Profiling and benchmarking Python programs

2020-04-21T08:09:00+00:00

The number of ways in which one can profile and benchmark Python programs is daunting. There’s many options out there, and this post is about the ones that I found suitable for profiling and benchmarking PRs that I submit to PyTorch every now and then. Coming from a land of C++ and Ruby, one annoying thing I find about the Python tools is the preference for providing the code to profiled inside a string as an argument to profiling tool, so I try to directly instrument calls within the code wherever possible.

Profiling C extensions

Say you want to know the function profiles of the following PyTorch script, where we want to know where the scatter_ call is spending most of its time:

import torch
import numpy

M=256
N=512
dim = 0

input_one = torch.rand(M, N)
index = torch.tensor(numpy.random.randint(0, M, (M, N)))
res = torch.randn(M, N)

for _i in range(10000):
    res.scatter_(dim, index, input_one)

Using cProfile

The default profiler for Python is cProfile which is a faster version of the profile module. While this is simple to use and does not require any extra dependencies, it does not show profiles of C++ functions at all. You can use it by calling the cProfile.run function and passing it the code to be profiled as a string like so:

import cProfile

# Do something
cProfile.run("res.scatter_(dim,index,input_one)")

You can see the output of the profiler

Using yep

yepis a utility that uses Google’s gperftools underneath and promises to show profiles of C/C++ functions made inside Python C extensions. On Ubuntu/Debian, first install the google-perftools package. Then run pip install yep.

You can set a region to profile as follows:

import yep

yep.start("file_name.prof")
# do something
yep.stop()

This generates a file file_name.prof that be can be analysed using the pprof utility (which can be installed with go get -u github.com/google/pprof). You can then get the top time consuming functions from pprof as follows:

pprof -text -lines file_name.prof

For our same program, profiling the scatter_ loop shows the following output:

File: python3.6
Type: cpu
Showing nodes accounting for 27.51s, 98.81% of 27.84s total
Dropped 151 nodes (cum <= 0.14s)
      flat  flat%   sum%        cum   cum%
     4.45s 15.98% 15.98%     27.49s 98.74%  _ZZZZZN2at6native12_GLOBAL__N_130cpu_scatter_gather_base_kernelILb1EEclERNS_6TensorElRKS4_S7_RKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEbRKNS0_17SCATTE
R_GATHER_OPEENKUlvE_clEvENKUlvE2_clEvENKUlRKT_E_clISt8functionIFvPfSR_EEEEDaSN_ENKUlPPcPKllE_clESV_SX_l /home/sameer/gitrepos/pytorch/build/aten/src/ATen/native/cpu/ScatterGatherKernel.cpp.AVX2.cpp:375
     2.84s 10.20% 26.19%      2.84s 10.20%  _ZNK2at6native12_GLOBAL__N_1UlPT_PT0_E2_clIffEEDaS3_S5_ /home/sameer/gitrepos/pytorch/build/aten/src/ATen/native/cpu/ScatterGatherKernel.cpp.AVX2.cpp:171
     2.54s  9.12% 35.31%      2.54s  9.12%  std::forward /usr/include/c++/7/bits/move.h:74
     1.91s  6.86% 42.17%      5.07s 18.21%  _ZNSt17_Function_handlerIFvPfS0_EN2at6native12_GLOBAL__N_1UlPT_PT0_E2_EE9_M_invokeERKSt9_Any_dataOS0_SE_ /usr/include/c++/7/bits/std_function.h:317
     1.39s  4.99% 47.16%     20.25s 72.74%  std::function::operator() /usr/include/c++/7/bits/std_function.h:706
     1.16s  4.17% 51.33%      1.16s  4.17%  std::forward /usr/include/c++/7/bits/move.h:73
     1.14s  4.09% 55.42%     11.48s 41.24%  _ZNSt17_Function_handlerIFvPfS0_EN2at6native12_GLOBAL__N_1UlPT_PT0_E2_EE9_M_invokeERKSt9_Any_dataOS0_SE_ /usr/include/c++/7/bits/std_function.h:316
     1.04s  3.74% 59.16%      1.04s  3.74%  _ZNSt14_Function_base13_Base_managerIN2at6native12_GLOBAL__N_1UlPT_PT0_E2_EE14_M_get_pointerERKSt9_Any_data /usr/include/c++/7/bits/std_function.h:176
     0.91s  3.27% 62.43%      0.91s  3.27%  _ZNSt14_Function_base13_Base_managerIN2at6native12_GLOBAL__N_1UlPT_PT0_E2_EE14_M_get_pointerERKSt9_Any_data /usr/include/c++/7/bits/std_function.h:175
     0.90s  3.23% 65.66%      0.90s  3.23%  std::_Any_data::_M_access /usr/include/c++/7/bits/std_function.h:107
     0.87s  3.12% 68.79%      0.87s  3.12%  _ZNK2at6native12_GLOBAL__N_1UlPT_PT0_E2_clIffEEDaS3_S5_ /home/sameer/gitrepos/pytorch/build/aten/src/ATen/native/cpu/ScatterGatherKernel.cpp.AVX2.cpp:170
     0.86s  3.09% 71.88%      0.86s  3.09%  std::function::operator() /usr/include/c++/7/bits/std_function.h:701
     0.79s  2.84% 74.71%      0.79s  2.84%  [libtorch_cpu.so]

Some notes on yep

If you change the shared object file that your program was running and call pprof on the same .prof file, the program will show nonsensical functions since it only maps the function hex code to the hex code from the shared object file.

Analyzing performance regressions

Analysis of performance regressions requires comparing the same interfaces over different implementations.

Time regression analysis

The simplest performance regression can be in terms of time of execution. Using the ipython magic command is a great way to know mean and standard deviation of multiple executions of the same lines of code. Using this within a script requires usage of embedded ipython. The timeit magic method allows for timing code, and when used with the -o option will also return the object containing information about the recent timing run.

Installing latest MPI

2020-04-12T07:23:00+00:00

Installing the latest openMPI can be a challenge if you want to correctly optimize all parameters properly. Here are the right ways of doing so:

gdrcopy (https://github.com/NVIDIA/gdrcopy.git).Nothing special here, it figures out most of he things. Do a make INSTALL prefix=<somewhere GDR>.
UCX (git@github.com:uccs/ucx.git). Pick the version you want (latest 1.8) and git checkout the corresponding branch. First ./autogen.sh then ../../configure --prefix=<somewhere UCX> --disable-debug --with-cuda --with-avx --with-gdrcopy=<somewhere GDR> --enable-mt --with-hwloc.
Finally OMPI (git@github.com:open-mpi/ompi.git). Similarly to UCX, pick the version you want (I stick with master most of the time except if it obviously broken), then ./autogen.sh and then ../../configure --prefix=<somewhere OMPI> --enable-picky --disable-debug --enable-contrib-no-build=vt --enable-mpirun-prefix-by-default --with-cma --enable-ipv6 --disable-oshmem --disable-spc --with-ucx=<somewhere UCX> --with-cuda

Day hike to Mitsutoge from Tokyo

2020-04-04T07:14:00+00:00

Mt. Mitsunoge is great hike not too far from Tokyo. The full hike is about 20 KM long and takes about 5.5 hours of walking to finish. I came across this while browsing for things to do around the city on ridgeline images.

I had a spare day remaining on an old seishun 18 ticket (and not much time to finish it) and therefore decided to take most of the route to and from Tokyo by train. I’d recommened going by bus if you’re not in a situtation like I was.

Travel Route

Access

The JR Chuo Line

Tokyo to Matsumoto with the Seishun18

2020-04-04T05:52:00+00:00

Introduction

We wanted to head to Matsumoto for a short 2-day trip to escape the hustle and bustle of Tokyo. However taking the direct Limited Express train from Shinjuku to Matsumoto is prohitively expensive (about 6000 JPY one way) and therefore decided to use the Seishun18 pass from Tokyo which cost us about 2410 JPY one way.

The only downside of using the Seishun18 is that it takes longer and you need three transfers. This short post is about which trains we took for this particular trip.

The Journey

Setting up Docker containers for testing pytorch

2020-03-30T10:18:00+00:00

Introduction

Frequently we run into issues while developing pytorch that fail only for a particular build configuration that is very hard to reproduce on your local machine. Facebook uses docker containers for running CI setups for various build configurations, which you can also use for building your own local Docker images in order to reproduce the issue easily. This post is how you can use the Docker functionality on QGPU in order to build such a Docker image on QGPU1.

First steps

The first step is to ask an admin (Pearu/Sameer/Dharhas) to add you to the docker group on QGPU1. Once you’re on this group, find the Amazon ECR API access keys on the facebook quip document. Docker and nvidia-docker are already installed so you need not install them.

Then install the Amazon ECR client for your user from here. Then run aws ecr get-login which will prompt you for login credentials, and then subsequently run for automatic login.

In order to know which docker image you need, you must know its full name first. The name can be found from the circleCI build. The amazon ECR name can be a little different so just find the name from the output of the aws ecr describe-repositories command. Use in this manner to find which repo you need (typically the name of your failing build on circleCI):

aws ecr describe-repositories | grep -C 3 xenial

From the output of this pickup the exact repositoryName value and replace it in the repo_name variable in the below ruby script in order to get the full string that be the name of your docker image:

require 'json'

repo_name = "pytorch/pytorch-linux-xenial-cuda10.2-cudnn7-py3-gcc7"

puts "logging in..."
`aws ecr get-login --no-include-email | bash`

images = JSON.parse `aws ecr describe-images --repository-name #{repo_name}`
image_tag = images['imageDetails'][-1]['imageTags'][0]
repo_info = JSON.parse `aws ecr describe-repositories --repository-names #{repo_name}`
uri = repo_info['repositories'][0]['repositoryUri']

puts "Docker image: #{uri}:#{image_tag}"

You can then create your own docker image using a Dockerfile like so:

# Insert the SHA key after the image name.
FROM <insert docker image name here>

ENV MAX_JOBS=20

RUN conda install -y hypothesis

RUN cd workspace \
    && git submodule sync --recursive \
    && git submodule update --init --recursive \
    && TORCH_CUDA_ARCH_LIST=Turing python setup.py install --cmake

The default docker container is built for a different GPU in some cases so it is important to specify the TORCH_CUDA_ARCH_LIST env variable.

Interpreting the results of the STREAM benchmark

2020-02-27T00:00:00+00:00

Table of Contents

Introduction
Machine Balance
STREAM kernels
TRIAD
Useful links

Introduction

The STREAM benchmark is considered an important benchmark for understanding the memory bandwidth and access latency of a particular computer. This benchmark was conceptualized in the 1995 paper by John McCalpin.

Machine Balance

At the heart of the benchmark lies a definition of ‘machine balance’. Previous to STREAM, machine balance was simply defined as the number of floating point operations per clock cycle to the number of memory operations per clock cycle. This is known as the ‘balance’ since it shows the time taken for executing useful work (floating point operations) vs. work that is absolutely necessary for performing the useful work but is always a bottleneck in performance (memory access latency).

However, this definition fails to capture the complexity of hierarchical memory structures that use multiple layers of cache and parallelization strategies such as pipelining and prefetching. This is because the number of floating point operations per cycle can greatly vary depending on the location of the data that is being operated on. The peak will be reached when the data resides in registers, whereas for data being accessed from memory, the number of cycles taken to execute a single floating point operation will be much higher due to latency.

If this is the case, one might wonder why taking an average of this simple definition is not adequate since working with a long-enough array will engage the registers and the RAM too, and should give an estimate of the average number of floating point ops per cycle.

The STREAM benchmark refines the definition of ‘machine balance’ and defines it as the PEAK floating point operations per cycle divided by the number of sustained memory operations per cycle.

STREAM kernels

The benchmark is broken up into a number of kernels, each employing a different set of instructions per kernel operation.

SUM

The STREAM SUM kernel computes a vector operation A(i) = B(i) + C(i). This operation involves 24 bytes of data and 1 floating point addition operation. There are two loads and one store per iteration.

TRIAD

The STREAM TRIAD kernel basically computes a vector operation A(i) = B(i) + s * C(i). This operation involves two loads, one store and one FMA instruction per kernel execution. If vectorized it will perform a number of such kernel operations per loop iteration.

Assuming that we are working with doubles, each iteration uses 24 bytes in reads and writes.

Running the benchmark

The following is the methodology to run STREAM on a A64FX chip used in the FUGAKU supercomputer.

Download the source from here and for FUGAKU use the following compile command using the FUJITSU compiler on a compute node:

fcc -Nclang -O3 -fopenmp -DSTREAM_ARRAY_SIZE=4194304 stream.c

In the above command we set STREAM_ARRAY_SIZE to 4194304 since the STREAM benchmark specifies that the size of each array should be either 4x the size of the sum of the lowest level cache or 1 million, whichever is larger. We are running this test for a single core, so only one of the 4 L2 caches on the chip is used, which is 8 MB. Assuming we’re using doubles, that would be ( 8 * 1024 * 1024 / 8 ) * 4 = 419304.

In order to figure out the single threaded bandwidth performance, set OMP_NUM_THREADS=1 and run the executable. The following are the results on the A64FX:

-------------------------------------------------------------
Function    Best Rate MB/s  Avg time     Min time     Max time
Copy:           21619.0     0.074010     0.074009     0.074011
Scale:          54700.6     0.029265     0.029250     0.029288
Add:            73861.8     0.032498     0.032493     0.032500
Triad:          64291.6     0.037334     0.037330     0.037337
-------------------------------------------------------------

It can be seen that the TRIAD, which is the most complex of the 4 kernels shows a peak bandwidth utilization of about 64 Gbps for a single core.

Discussion on the results

The above results for the peak memory bandwidth are only for demonstrating the peak achievable memory bandwidth for certain kernels. In practice such speeds are usually never reached.

In the STREAM paper, Dr. McCalpin says that the TRIAD benchmark is the standard used for calculating the machine balance of the system. Why use the TRIAD even though ADD seems to be utilizing more bandwidth on a single core?

Useful links

https://stackoverflow.com/questions/56086993/what-does-stream-memory-bandwidth-benchmark-really-measure
https://sites.utexas.edu/jdm4372/tag/stream-benchmark/
https://blogs.fau.de/hager/archives/8263
https://stackoverflow.com/questions/39260020/why-is-skylake-so-much-better-than-broadwell-e-for-single-threaded-memory-throug
https://software.intel.com/content/www/us/en/develop/articles/optimizing-memory-bandwidth-on-stream-triad.html

PyTorch TensorIterator Internals

2020-01-19T07:00:00+00:00

Table of Contents

Introduction
History of TensorIterator
- TH iterators
- Limitations of TH iterators
Basics of TensorIterator
Performing iterations
- Iteration details
  - Using kernels for iterations
  - Setting tensor iteration dimensions
Conclusion

Introduction

PyTorch is one of the leading frameworks for deep learning. Its core data structure is Tensor, a multi-dimensional array implementation with many advanced features like auto-differentiation. PyTorch is a massive codebase (approx. 12 GB of files and about a million lines of C++, Python and CUDA code), and having a method for iterating over tensors in a very efficient manner that is independent of data type, dimension, striding and hardware is a critical feature that can lead to a very massive simplification of the codebase and make distributed development much faster and smoother. The TensorIterator C++ class within PyTorch is a complex yet useful class that is used for iterating over the elements of a tensor over any dimension and implicitly parallelizing various operations in a device independent manner.

It does this through a C++ API that is independent of type and device of the tensor, freeing the programmer of having to worry about the datatype or device when writing iteration logic for PyTorch tensors. For those coming from the NumPy universe, NpyIter is a close cousin of TensorIterator.

This post is a deep dive into how TensorIterator works, and is an essential part of learning to contribute to the PyTorch codebase since iterations over tensors in the C++ codebase are extremely commonplace. This post is aimed at someone who wants to contribute to PyTorch, and you should at least be familiar with some of the basic terminologies of the PyTorch codebase that can be found in Edward Yang’s excellent blog post on PyTorch internals. Although TensorIterator can be used for both CPUs and accelerators, this post has been written keeping in mind usage on the CPU. Although there can be some dissimilarities between the two, the overall concepts are the same.

History of TensorIterator

TH iterators

TensorIterator was devised to simplify the implementation of PyTorch’s tensor operations over the TH implementation. TH uses preprocessor macros to write type-independent loops over tensors, instead of C++ templates. For example, consider this simple TH loop for computing the product of all the numbers in a particular dimension (find the code here):

TH_TENSOR_DIM_APPLY2(scalar_t, t, scalar_t, r_, dimension,
    accreal prod = 1;
    int64_t i;
    for(i = 0; i < t_size; i++)
        prod *= t_data[i*t_stride];
    *r__data = (scalar_t)prod;
);

The above loop works by following a particular convention for the naming of the types and variables. You specify the input type and output type of your tensors in the first and third arguments. scalar_t is a type that can generically be used for denoting a PyTorch scalar type such as float, double, long etc. Internally, PyTorch uses the scalar_t for compiling the file multiple times for different definitions of scalar_t (as in for different data types like float, int, etc.). The input tensor and output tensors are specified in the second and fourth arguments (in this case t and r_), and the dimension that we want to iterate over is specified as the fifth argument (dimension).

We then follow these arguments with the main body of the iterator (which is accepted as the sixth argument into the macro), and denote the data, stride and size of the particular tensor dimension by using variables that are suffixed by _data, _stride and _size respectively after the variable name that represents the tensor inside the iterator body. For example, the size of the input tensor is denoted as t_size in the above example and the pointer to the data of the output tensor is denoted as r__data. The accreal in the second line is custom type that specifies a real number that is an accumulator (in this case for accumulating the product).

Internally, the TH_TENSOR_DIM_APPLY2 macro is expanded for generating various dispatch calls depending on the type of the tensor that needs to be iterated over. The implementation of TH_TENSOR_DIM_APPLY2 can be found here.

Limitations of TH iterators

Apart from the obvious complication that arises due to maintaining a codebase that is so dependent on such insanely complex macro expansions, TH iterators have some fundamental shortcomings. For one thing, they cannot be used for writing iterators in a device independent manner - you will need separate iterators for CPU and CUDA. Also, parallelization does not happen implicitly inside the iterator, you need to write the parallel looping logic yourself. Moreover, at a deeper level TH iterators do not collapse the dimensions of the tensor (as we’ll see later in this post) therefore leading to looping that might not be as cache-optimized as possible.

These limitations led to the creation of TensorIterator, which is used by the ATen tensor implementation for overcoming some of the shortcomings of the previous TH iterators.

Basics of TensorIterator

A TensorIterator can be created using the default constructor. You must then add the tensors that you want as inputs or outputs. A good example can be found from the TensorIterator::binary_op() method that allows you to create TensorIterator objects for performing point-wise binary operations between two tensors. The important parts look like so:

auto iter = TensorIterator();

iter.add_output(out);
iter.add_input(a);
iter.add_input(b);

iter.build();

As you can see, you add a tensor called out as the output tensors and a and b as the input tensors. Calling build is then mandatory for creating the object and letting the class perform other optimizations like collapsing dimensions.

Performing iterations

Broadly, iterations using TensorIterator can be classified as point-wise iterations or reduction iterations. This plays a fundamental role in how iterations using TensorIterator are parallelized - point-wise iterations can be freely parallelized along any dimension and grain size while reduction operations have to be either parallelized along dimensions that you’re not iterating over or by performing bisect and reduce operations along the dimension being iterated. Parallelization can also happen using vectorized operations.

Iteration details

The simplest iteration operation can be performed using the for_each function. This function has two overloads: one takes a function object which iterates over a single dimension (loop_t); the other takes a function object which iterates over two dimensions simultaneously (loop2d_t). Find their definitions here. The former can iterate over a loop of a single dimension whereas the latter can do so over two dimensions. The simplest way of using for_each is to pass it a lambda of type loop_t (or loop2d_t). A code snippet using it this way would look like so:

auto iter = TensorIterator();
iter.add_output(out);
iter.add_input(a);
iter.dont_resize_outputs(); // call if out is allocated.
iter.dont_compute_common_dtype(); // call if inputs/outputs are of a different type.
iter.build();

auto loop = [&](char **data, const int64_t* strides, int64_t n) {
    auto * out_data_bytes = data[0];
    auto * in_data_bytes = data[1];
    
    // assume float data type for this example.
    for (int i = 0; i < n; i++) {
      *reinterpret_cast<float*>(out_data_bytes) +=
        *reinterpret_cast<float*>(in_data_bytes);
        
      out_data_bytes += strides[0];
      in_data_bytes += strides[1];
    }
}

iter.for_each(loop);

In the above example, the char** data gives a pointer to the data within the tensor in the same order that you specify when you build the iterator. Note that in order to make the implementation agnostic of any particular data type, you will always receive the pointer typecast to char (think of it as a bunch of bytes).

The second argument is int64_t* strides which is an array containing the strides of each tensor in the dimension that you’re iterating over. We can add this stride to the pointer received in order to reach the next element in the tensor. The last argument is int64_t n which is the size of the dimension being iterated over.

for_each implicitly parallelizes the operation by executing loop in parallel if the number of iterations is more than the value of internal::GRAIN_SIZE, which is a value that is determined as the ‘right amount’ of data to iterate over in order to gain a significant speedup using multi-threaded execution. If you want to explicitly specify that your operation must run in serial, then use the serial_for_each loop.

Using kernels for iterations

Frequently we want to create a kernel that applies a simple point-wise function onto entire tensors. TensorIterator provides various such generic kernels that can be used for iterating over the elements of a tensor without having to worry about the stride, data type of the operands or details of the parallelism.

For example, say we want to build a function that performs the point-wise addition of two tensors and stores the result in a third tensor, we can use the cpu_kernel function. Note that in this example we assume a tensor of float but you can use the AT_DISPATCH_ALL_TYPES_AND2 macro.

TensorIterator iter;
iter.add_input(a_tensor);
iter.add_input(b_tensor);
iter.add_output(c_tensor);
iter.build();
cpu_kernel(iter, [] (float a, float b) -> float {
  return a + b;
});

Writing the kernel in this way ensures that the value returned by the lambda passed to cpu_kernel will populate the corresponding place in the target output tensor.

Setting tensor iteration dimensions

The value of the sizes and strides will determine which dimension of the tensor you will iterate over. TensorIterator performs optimizations to make sure that at least most of the iterations happen on contiguos data to take advantage of hierarchical cache-based memory architectures (think dimension coalescing and reordering for maximum data locality).

Now a multi-dimensional tensor will have multiple stride values depending on the dimension you want to iterate over, so TensorIterator will directly compute the strides that get passed into the loop by by itself within the build() function. How exactly it computes the dimension to iterate over is something that should be properly understood in order to use TensorIterator effectively.

If you’re performing a reduction operation (see the sum code in ReduceOps.cpp), TensorIterator will figure out the dimensions that will be reduced depending on the shape of the input and output tensor, which determines how the input will be broadcast over the output. If you’re performing a simple pointwise operation between two tensors (like a addcmul from PointwiseOps.cpp) the iteration will happen over the entire tensor, without providing a choice of the dimension. This will allow TensorIterator to freely parallelize the computation, without guarantees of the order of execution (since it does not matter anyway).

For something like a cumulative sum operation, where you want be able to choose the dimension to reduce but iterate over multiple non-reduced dimensions (possibly in parallel), you must first re-stride the tensors, and then use these tensors for creating a TensorIterator. In order to understand how this bit works, lets go over the code for the kernel that executes the cumsum function.

The important bits of this function are like so:

auto self_sizes = ensure_nonempty_vec(self.sizes().vec());
self_sizes[dim] = 1;

auto result_restrided = restride_dim(result, dim, self_sizes);
auto self_restrided = restride_dim(self, dim, self_sizes);

auto iter = TensorIterator();
iter.dont_compute_common_dtype();
iter.dont_resize_outputs();
iter.add_output(result_restrided);
iter.add_input(self_restrided);
iter.build();

You can see that we first change the size of the tensors to 1 on the reduction dimension so that the dimension collapsing logic inside TensorIterator#build will know which dimension to skip. Setting the dimension in this way is akin to telling TensorIterator to skip the dimension. We then restride the tensors using restride_dim and then use the restrided tensors for building the TensorIterator. You can set any size for inputs/outputs, then TensorIterator with check whether it can come up with a common broadcasted size

Conclusion

This post was a very short introduction to what TensorIterator is actually capable of. If you want to learn more about how it works and what goes into things like collapsing the tensor size for optimizing memory access, a good place to start would be the build() function in TensorIterator.cpp. Also have a look at this blog post from the PyTorch team on using TensorIterator.

Distributed LU factorization using Chameleon

2019-09-16T23:45:00+00:00

In this post I will write the detail the steps I took to reproduce distributed LU factorization using the Chameleon library. It is a linear algebra library based on the starPU runtime system.

Table of Contents

Installing chameleon
Compiling and linking your programs
Distributed LU factorization implmentation

Installing chameleon

Clone the sources from gitlab:

git clone --recursive https://gitlab.inria.fr/solverstack/chameleon.git

Configure with the following for a non-CUDA, MPI-enabled build:

cd chameleon
mkdir build
cd build
cmake .. -DCMAKE_BUILD_TYPE=Debug \
         -DCMAKE_INSTALL_PREFIX=$HOME/gitrepos/hicma/profiling/chameleon/chameleon/build \
         -DCHAMELEON_USE_CUDA=OFF \
         -DCHAMELEON_USE_MPI=ON \
         -DFXT_DIR=/home/1/17M38101/software/fxt-0.3.8 \
         -DSTARPU_DIR=/home/1/17M38101/software/starpu-1.3.2-test \
         -DSTARPU_FIND_COMPONENTS=ON \
         -DCHAMELEON_ENABLE_TRACING=ON
make install

Make sure that you’re using openmpi. For some reason chameleon refuses to work with a starpu that has been compiled with intel-mpi.

Compiling and linking your programs

Make sure you have starpu and starpumpi configured in your pkg-config path. You can then get the compiler flags with pkg-config --cflags chameleon and linker flags with pkg-config --libs --static chameleon.

Distributed LU factorization implementations

The chameleon_pzgetrf_nopiv(CHAM_desc_t*, RUNTIME_sequence_t *, RUNTIME_request_t *) function is used for a distributed LU factorization using Chameleon and starpu underneath. It implements a right-looking variant of the LU factorization, which is a very common algorithm made popular by SCALAPACK.

The chameleon_pzgetrf_incpiv() function is a used for a distributed LU using a newer LU algorithm presented in this paper. It claims to have superior performance compared to right-looking LU since communication and computation can be overlapped better.

Ruby wrappers for the XND project

2019-09-08T09:12:00+00:00

Table of Contents

Introduction
Ndtypes
- Usage
- Implementation
Xnd
- Basic Usage
- Implementation
Gumath

Introduction

Lack of stable and reliable scientific computing software has been a persistent problem for the Ruby community, making it hard for enthusiastic Ruby developers to use Ruby in everything from their web applications to their data analysis projects. One of the most important components of any successful scientific software stack is a well maintained and flexible array computation library that can act as a fast and simple way of storing in-memory data and interfacing it with various fast and battle-tested libraries like LAPACK and BLAS.

Various projects have attempted to make such libraries in the past (and some are still thriving and maintained). Some of the notable ones are numo, nmatrix, and more recently, numruby. These projects attempt to provide a simple Ruby-like API for creating and manipulating arrays of various types. All of them are able to easily interface with libraries like ATLAS, FFTW and LAPACK.

However, all of the above projects fall short in two major aspects:

Lack of extensibility to adapt to modern use cases (read Machine Learning).
Lack of a critical mass of developers to maintain a robust and fast array library.

The first problem is mainly due to the fact that they do not support very robust type systems. The available data types are limited and are extend to more complex uses. Modern use cases like Machine Learning require much more robust data types, as has been demonstrated by the tensor implementations of various frameworks like Tensorflow and PyTorch.

The second problem is due to the fact that all of the aforementioned projects are community efforts that are maintained part-time by developers simply out of a sense of purpose and passion. Sustaining such complex projects for extended periods of time without expectation of any support is simply unfeasible even for the most driven engineers.

This is where the XND project comes in. The XND project is a project for building a common library that is able to meet the needs of the various data analysis and machine learning frameworks that have had to build their own array objects and programming languages. It is built with the premise of extending arrays with new types and various device types (CPUs, GPUs etc.) without loss of performance and ease of use.

The XND project as a whole is a product of three C libraries : ndtypes, xnd and gumath. They have been made such that they can work as standalone C libraries that can be interfaced with any language binding (currently supporting Ruby and Python). Ndtypes is used for defining the shape of data within memory, XND is a data container that holds that data and gumath provides a multiple dispatch mechanism for performing computations on data held in XND containers. We will elaborate on each of these in the post below.

The XND project presents the perfect answer to Ruby’s lack of a mature array computation ecosystem. It is highly extensible, allows defining data types in almost any combination with a simple and intuitive interface, is built with performance in mind and is backed by a team consisting of experts who have vast experience in this domain for the Python scientific computing stack.

The biggest backer of XND as of now is Quansight, and I as a part-time engineer am responsible for maintaining the Ruby wrapper for XND. This post is a rather long and detailed introduction to the XND ruby wrapper including various use cases and benchmarks. There will also be some details on the implementation of the wrapper and how it differs from the python wrapper (which existed before the Ruby wrapper). Read on for further details.

All the source code can be found in the xnd-ruby repo.

Ndtypes

Ndtypes is the library that is used for defining the shape of data.

Run gem install ndtypes --pre for easily installing ndtypes onto your machine. It has been tested with Ruby 2.4.1 so far. The gem install will download the C sources and compile them by itself.

Usage

Basic initialization

The ndtypes Ruby wrapper provides a simple interface to the ndtypes C library for creating complex data shapes with extreme simplicity. For example, for creating an array of 10 int64 digits, all we need to do is create an instance of the NDT class:

t = NDT.new "10 * int64"

Not only can you create arrays, but also very complex types, for example a nested record (xnd terminology for a Ruby Hash) with the values as arrays of type float32 of size 25 each:

t = NDT.new "{x: 25 * float32, y: {a: 25 * float64, 25 * float64}}"

Concrete Vs. Abstract Types

Ndtypes distinguishes types depending on whether they are abstract or concrete. Abstract types can have symbolic values like dimension or type variables and are used for type checking. Concrete types additionally have full memory layout information like alignment and data size.

Some operations can be only performed on abstract types.

Typedefs

One can also define typedefs using the NDT#typedef function and then use them in place of the original type. Here’s an example of using typedefs to define a graph type:

NDT.typedef "node", "int32"
NDT.typedef "cost", "int32"
NDT.typedef "graph", "var * var * (node, cost)"

Usage via The C API

Most of the C API functions of ndtypes deal with creating NDT Ruby objects or obtaining internal struct data of an NDT Ruby object. The complete specification can be found in the ruby_ndtypes.h file. This is the file you should include if you want to use the C API in any of your libraries.

Implementation

The Ruby wrapper is a wrapper over the libndtypes library. The NDT Ruby object is a wrapper over a C struct of type NdtObject that has the following definition:

typedef struct NdtObject {
  const ndt_t *ndt;                   /* type */
} NdtObject;

This simple struct stores a pointer to a struct of type const ndt_t * that is provided by libndtypes for representing an ndtype. The const ndt_t structs are allocated by various libndtypes functions like ndt_from_string() or ndt_alloc().

Internally libndtypes uses a reference counting mechanism for keeping track of ndt_t allocations that need to be destroyed. The reference count can be incremented using ndt_incref() or decremented using ndt_decref(). Once the refcount reaches the 0 the object is automatically destroyed by libndtypes. Of course, ndt_t structs allocated via calls to functions like ndt_alloc() already come with a refcount of 1.

Xnd

XND is the main storage library of the project. It uses types defined by ndtypes for defining the shape of data and allows users to read and write data into buffers that are of the shape of the data passed to it by ndtypes. It is responsible for maintaining the memory consistency of data and has provisions for various operations such as slicing, copying and interfacing data with 3rd party libraries like Apache Arrow. It also serves as a memory buffer for the functions that are defined within gumath (explained later in this post).

Similar to the ndtypes wrapper, the xnd Ruby wrapper can be installed with a call to gem install xnd --pre.

Basic Usage

The xnd Ruby wrapper is extremely simple to use and provides a single class XND for the user that interfaces with libxnd. In the simplest case, one can create an XND object as follows:

x = XND.new [1,2,3,4]
# => #<XND:47340720296980>
#	 type= 4 * int64
#	 value= [1, 2, 3, 4]

Since we have not specified the data type, it will be inferred as int64 since we are supplying an array composed entirely of integers. This can be seen using the XND#dtype function, which will return the NDT object that holds the type of this XND object:

x.dtype
# => #<NDTypes:47340721833280>
#	int64 

While XND#dtype gives the general type of the object, a more precise description of the data type (including shape etc.) can be obtained using the type method:

x.type
# => #<NDTypes:47340721846240>
#  4 * int64

The value within the XND object can be obtained as a Ruby Array (or Hash if it is a NDT record) using the XND#value method:

x.value
# => [1, 2, 3, 4]

We can also perform operations for checking equality between XND objects using the == or != operators:

a = XND.new [1,2,3,4]
x == a
# => true

A nice thing about XND is that it returns copy-free ‘views’ of data when you perform a slicing operation. So say we define a 2D tensor tensor_2d like this:

tensor_2d = XND.new(
  [
    [1,2,3,4,5],
    [1,2,3,4,5],
    [1,2,3,4,5],
    [1,2,3,4,5],
    [1,2,3,4,5]
  ]
)
tensor_2d.inspect
# => #<XND:47340720946720>
#	 type= 5 * 5 * int64
#	 value= [[1, 2, 3, 4, 5], [1, 2, 3, 4, 5], [1, 2, 3, 4, 5], [1, 2, 3, 4, 5], [1, 2, 3, 4, 5]]

We can obtain a slice (say the 2nd column) of the tensor using Ruby Range. Note that using INF is a shorthand for specifying the entire axis (usually denoted as 0..-1):

vector_view = tensor_2d[INF, 2]
# => #<XND:47340720380980>
#	 type= 5 * int64
#	 value= [3, 3, 3, 3, 3]

When using slices, XND will always return a ‘view’ of the original XND object. Changes made to this slice will reflect on the original XND object as well:

vector_view[2] = 666
tensor_2d.inspect
# => #<XND:47340720946720>
#	 type= 5 * 5 * int64
#	 value= [[1, 2, 3, 4, 5], [1, 2, 3, 4, 5], [1, 2, 666, 4, 5], [1, 2, 3, 4, 5], [1, 2, 3, 4, 5]]

However, the type of the view and the original object differ as they should:

vector_view.type
# => #<NDTypes:47340720381100>
#	5 * int64 
tensor_2d.type
# => #<NDTypes:47340720939360>
#	5 * 5 * int64

If you want a separate storage space for the view (i.e. do not want changes to the view to reflect on the parent object), you should use the XND#dup method and make a copy. You can also allocate a data container without storing any data into it using the XND.empty method as can be seen in the following examples.

Data Type Support

As a result of the flexibility provided by the ndtypes type definition interface, xnd is able to provide type support for far more flexible data shapes than simply for arrays with fixed dimensions. For example, you can use records for storing Ruby Hashes and performing operations on them:

require 'xnd'

x = XND.empty "{x: complex64, y: bytes, z: string}"
v = { 'x' => 1+20i, 'y' => "abc".b, 'z' => "any" }
x['x'] = v['x']
x['y'] = v['y']
x['z'] = v['z']
x
# => #<XND:47340721378580>
#	 type= {x : complex64, y : bytes, z : string}
#	 value= {"x"=>(1.0+20.0i), "y"=>"abc", "z"=>"any"}

Missing Values

XND also supports optional data (represented by nil). It can be created as follows:

x = XND.empty "2 * 4 * ?float64"
v = [[10.0, nil, 2.0, 100.12], [nil, nil, 6.0, 7.0]]
x[INF] = v # assign full slice
# => [[10.0, nil, 2.0, 100.12], [nil, nil, 6.0, 7.0]] 

Usage via The C API

The primary function of the XND Ruby C API is for creating and querying XND Ruby objects. The full API can be found in the ruby_xnd.h file.

Implementation

The implementation of the Ruby wrapper differs from the Python wrapper largely due to nature of the garbage collection algorithms employed by both these languages: Ruby uses a mark-and-sweep GC while Python uses a reference counted GC. Therefore, Ruby objects created within the C extension have to somehow be kept ‘alive’ such that the GC does not deallocate them thinking that they have gone out of scope and are no longer useful.

For this purpose we utilize a ‘GC guard’ structure (inspired by the implementation of @mrkn’s matplotlib.rb gem). The GC guard is basically a global Ruby Hash that has the Ruby object created within the C extension as a key and something random as a value. We use a Hash because it provides lookups in O(1) time and we don’t care about the value because we only want to save the object in some kind of a global store so that Ruby is aware of its presence (in case of NDT we use true for the value). XND uses three different GC guards for various internal objects, which can be found in the gc_guard.h file.

Gumath

While ndtypes and xnd allow us to define types and memory storage, gumath allows us to actually do something with them. The basic idea behind gumath is that it is a library that allows defining functions for various data types stored within an XND object and allows the user to transparently call them using a high level interface that uses multiple dispatch for calling the relevant function on the appropriate type. The Ruby interface is a wrapper over the libgumath C library.

Some functions (known as kernels) come bundled with libgumath and others can be written fairly easily. Similar to the xnd and ndtypes wrappers, the gumath Ruby wrapper can be installed with a call to gem install gumath --pre.

Usage

The Gumath class is a top level namespace for various modules that serve as namespaces for functions that come rolled in with the libgumath C library. These modules will keep expanding as more interfaces are added to libgumath. The Gumath::Functions module contains various such functions that are provided by libgumath by default.

Gumath functions accept XND objects as arguments and output XND objects with the result of the function. An example of a simple element-wise multiply kernel is the following:

require 'xnd'
require 'gumath'

x = XND.new [2,3,4,5,6,7,8,9], dtype: "float64"
y = XND.new [1,2,3,4,5,6,7,8], dtype: "float64"
z = Gumath::Functions.multiply x, y
# => #<XND:47340721458320>
#	 type= 8 * float64
#	 value= [2.0, 6.0, 12.0, 20.0, 30.0, 42.0, 56.0, 72.0]

Usage via The C API

Since the main purpose of the gumath C API is to allow adding kernels to a Ruby module, it provides a single function of the prototype:

int rb_gumath_add_functions(VALUE module, const gm_tbl_t *tbl);

The module parameter is a Ruby object, and tbl is a function table of gumath kernels.

Implementation

Compared to xnd and ndtypes, the gumath Ruby wrapper is much simpler since its primary function is to take functions from libgumath and add them as module functions to Ruby modules.

When the library is initially loaded using a call to require, the relevant libgumath kernels provided by default are loaded into the Ruby interpreter by interfacing each kernel with a Ruby object. Further details on the working of the method dispatch within Ruby can be found in the CONTRIBUTING file.

The most important part of the C implementation is the GufuncObject class which is a Ruby class defined within the C API that helps interface with a single gumath function. This class is basically a wrapper over a C struct GufuncObject that can be found in the gufunc_object.h file.

The struct has the following definition:

typedef struct {
  const gm_tbl_t *table;          /* kernel table */
  char *name;                     /* function name */
  uint32_t flags;                 /* memory target */
  VALUE identity;                 /* identity element */
} GufuncObject;

The table pointer is the pointer to the definition of the function within libgumath that holds information about the function that is used by gm_apply for making the actual call to the function with the data. name is a string holding the name of the function. flags signify whether the function is a CPU function or a CUDA function (or for that matter any other device that might be added in the future). identity is a Ruby object used for identifying this function. It is initially set to nil.

Automatic Kernel Generation

Writing kernels can be painstaking if you’re not familiar with the various functionalities that libgumath provides for this purpose. Therefore we also provide a kernel generator called xndtools that allows writing gumath kernels by simply providing the function that needs to wrapped. However, this functionality has not yet been tested for Ruby.

Making research posters using latex and emacs.

2018-11-15T07:54:25+00:00

Using the a0poster package, you can use Latex for designing reserach posters.

Helpful commands

Various programming constructs

Variables

Add variables or new commands using \newcommand. For example \newcommand{\sidel}{6}.

Link: https://stackoverflow.com/questions/1211888/is-there-any-way-i-can-define-a-variable-in-latex

Adding a title to your poster

Simply use a separate set of minipage elements and put the title within those.

Changing fonts of the section headings

Use the titlesec package for this purpose.

A typical configuration for titlesec looks like so:

\usepackage{titlesec}

\titleformat*{\section}{\LARGE\bfseries}
\titleformat*{\subsection}{\Large\bfseries}
\titleformat*{\subsubsection}{\large\bfseries}
\titleformat*{\paragraph}{\large\bfseries}
\titleformat*{\subparagraph}{\large\bfseries}

Overlaying text on an image

Use the ‘overpic’ package for this. This this link for details: https://tex.stackexchange.com/questions/20792/how-to-superimpose-latex-on-a-picture

Overpic full docs: http://mirrors.ibiblio.org/CTAN/macros/latex/contrib/overpic/overpic.pdf

Drawing boxes filled with colors

Simpy define a command crule from the rule command. Definition and usage:

\newcommand\crule[3][black]{\textcolor{#1}{\rule{#2}{#3}}}

\crule{1cm}{1cm} \crule[blue]{1cm}{1cm} \crule[red!50!white!100]{1cm}{1cm}

Link: https://tex.stackexchange.com/questions/106984/how-to-draw-a-square-of-1cm-in-latex-filled-with-color

Splitting into multiple rows and columns

Using minipage for splitting a document into boxes is recommended. The first argument it accepts decides how the alignment of the minipage will be. t is top-aligned and b is bottom-aligned.

Graphics with tikz

LU decomposition diagram

Drawing two boxes with L & U: https://tex.stackexchange.com/questions/317230/lu-factorization-of-a-matrix-with-plot?newreg=991a708140a2446882fdd9bd3c445af9

Drawing an arrow between tikzpicture objects: https://tex.stackexchange.com/questions/260587/an-arrow-between-two-tikzpictures

Dependency graphs

Inspiration can be taken from this state machine tutorial for drawing dependency graphs. Basically put things inside a tikzpicture block. Use the \node command for defining a node and the \path command for connecting these nodes.

Drawing things on pictures

Using tikz one can annonate pictures with various things.

Link: https://tex.stackexchange.com/questions/9559/drawing-on-an-image-with-tikz

Useful links

Getting started PDF: https://www.tug.org/pracjourn/2008-3/morales/morales.pdf
Very good sample template poster: https://www.latextemplates.com/template/a0poster-portrait-poster

code, travel and explore.

Profiling and benchmarking Python programs

Profiling C extensions

Using cProfile

Using yep

Some notes on yep

Analyzing performance regressions

Time regression analysis

Further Reading

Installing latest MPI

Day hike to Mitsutoge from Tokyo

Travel Route

Access

Tokyo to Matsumoto with the Seishun18

Introduction

The Journey

Setting up Docker containers for testing pytorch

Introduction

First steps

Further reading

Interpreting the results of the STREAM benchmark

Introduction

Machine Balance

STREAM kernels

SUM

TRIAD

Running the benchmark

Discussion on the results

Useful links

PyTorch TensorIterator Internals

Introduction

History of TensorIterator

TH iterators

Limitations of TH iterators

Basics of TensorIterator

Performing iterations

Iteration details

Using kernels for iterations

Setting tensor iteration dimensions

Conclusion

Distributed LU factorization using Chameleon

Installing chameleon

Compiling and linking your programs

Distributed LU factorization implementations

Ruby wrappers for the XND project

Introduction

Ndtypes

Usage

Basic initialization

Concrete Vs. Abstract Types

Typedefs

Usage via The C API

Implementation

Xnd

Basic Usage

Data Type Support

Missing Values

Usage via The C API

Implementation

Gumath

Usage

Usage via The C API

Implementation

Automatic Kernel Generation

Making research posters using latex and emacs.

Helpful commands

Various programming constructs

Variables

Adding a title to your poster

Changing fonts of the section headings

Overlaying text on an image

Drawing boxes filled with colors

Splitting into multiple rows and columns

Graphics with tikz

LU decomposition diagram

Dependency graphs

Drawing things on pictures

Useful links