PyTorch TensorIterator Internals
Table of Contents
Introduction
PyTorch is one of the leading frameworks for deep learning. Its core data structure is
Tensor, a multi-dimensional array implementation with many advanced features like auto-differentiation. PyTorch is a massive
codebase (approx. 12 GB of files and about a million lines of
C++, Python and CUDA code), and having a method for iterating over tensors in a very efficient manner that is independent of
data type, dimension, striding and hardware is a critical feature that can lead to a very
massive simplification of the codebase and make distributed development much faster and
smoother. The TensorIterator
C++ class within PyTorch is a complex yet useful class that
is used for iterating over the elements of a tensor over any dimension and implicitly
parallelizing various operations in a device independent manner.
It does this through
a C++ API that is independent of type and device of the tensor, freeing the programmer
of having to worry about the datatype or device when writing iteration logic for PyTorch
tensors. For those coming from the NumPy universe, NpyIter
is a close cousin of TensorIterator
.
This post is a deep dive into how TensorIterator
works, and is
an essential part of learning to contribute to the PyTorch codebase since iterations
over tensors in the C++ codebase are extremely commonplace. This post is aimed at someone
who wants to contribute to PyTorch, and you should at least be familiar with some of the
basic terminologies of the PyTorch codebase that can be found in Edward Yang’s
excellent blog post on PyTorch internals.
Although TensorIterator
can be used for both CPUs and accelerators, this post has been
written keeping in mind usage on the CPU. Although there can be some dissimilarities between
the two, the overall concepts are the same.
History of TensorIterator
TH iterators
TensorIterator was devised to simplify the implementation of PyTorch’s tensor operations over the TH
implementation. TH
uses preprocessor macros to write type-independent loops over tensors, instead of C++ templates. For example, consider this simple TH
loop
for computing the product of all the numbers in a particular dimension (find the code
here):
TH_TENSOR_DIM_APPLY2(scalar_t, t, scalar_t, r_, dimension,
accreal prod = 1;
int64_t i;
for(i = 0; i < t_size; i++)
prod *= t_data[i*t_stride];
*r__data = (scalar_t)prod;
);
The above loop works by following a particular convention for the naming of the
types and variables. You specify the input type and output type of your tensors in the first
and third arguments. scalar_t
is a type that can generically be used for denoting a PyTorch
scalar type such as float
, double
, long
etc. Internally, PyTorch uses the scalar_t
for compiling the file multiple times for different definitions of scalar_t
(as in for different
data types like float
, int
, etc.). The input tensor and output tensors are
specified in the second and fourth arguments (in this case t
and r_
), and the dimension that
we want to iterate over is specified as the fifth argument (dimension
).
We then follow these arguments with the main body of the iterator (which is accepted as the sixth
argument into the macro), and denote the data, stride and size of the particular tensor dimension
by using variables that are suffixed by _data
, _stride
and _size
respectively after the
variable name that represents the tensor inside the iterator body. For example, the size of the
input tensor is denoted as t_size
in the above example and the pointer to the data of the output
tensor is denoted as r__data
. The accreal
in the second line is custom type that specifies
a real number that is an accumulator (in this case for accumulating the product).
Internally, the TH_TENSOR_DIM_APPLY2
macro is expanded for generating various dispatch calls
depending on the type of the tensor that needs to be iterated over. The implementation of
TH_TENSOR_DIM_APPLY2
can be found here.
Limitations of TH iterators
Apart from the obvious complication that arises due to maintaining a codebase that is so dependent
on such insanely complex macro expansions, TH iterators have some fundamental shortcomings. For
one thing, they cannot be used for writing iterators in a device independent manner - you will
need separate iterators for CPU and CUDA. Also, parallelization does not happen implicitly
inside the iterator, you need to write the parallel looping logic yourself. Moreover, at a deeper
level TH
iterators do not collapse the dimensions of the tensor (as we’ll see later in this
post) therefore leading to looping that might not be as cache-optimized as possible.
These limitations led to the creation of TensorIterator
, which is used by the
ATen
tensor implementation for overcoming some of the shortcomings of the previous TH
iterators.
Basics of TensorIterator
A TensorIterator
can be created using the default constructor. You must then add the tensors
that you want as inputs or outputs. A good example can be found from the TensorIterator::binary_op()
method that
allows you to create TensorIterator
objects for performing point-wise binary operations
between two tensors. The important parts look like so:
auto iter = TensorIterator();
iter.add_output(out);
iter.add_input(a);
iter.add_input(b);
iter.build();
As you can see, you add a tensor called out
as the output tensors and a
and b
as the
input tensors. Calling build
is then mandatory for creating the object and letting
the class perform other optimizations like collapsing dimensions.
Performing iterations
Broadly, iterations using TensorIterator
can be classified as point-wise iterations
or reduction iterations. This plays a fundamental role in how iterations using TensorIterator
are parallelized - point-wise iterations can be freely parallelized along any dimension
and grain size while reduction operations have to be either parallelized along dimensions
that you’re not iterating over or by performing bisect and reduce operations along the
dimension being iterated. Parallelization can also happen using vectorized operations.
Iteration details
The simplest iteration operation can be performed using the
for_each
function. This function has two overloads: one takes a function object which iterates over a
single dimension (loop_t
); the other takes a function object which iterates over two
dimensions simultaneously (loop2d_t
). Find their definitions here. The former can iterate over a loop
of a single dimension whereas the latter can do so over two dimensions. The simplest
way of using for_each
is to pass it a lambda of type loop_t
(or loop2d_t
).
A code snippet using it this way would look like so:
auto iter = TensorIterator();
iter.add_output(out);
iter.add_input(a);
iter.dont_resize_outputs(); // call if out is allocated.
iter.dont_compute_common_dtype(); // call if inputs/outputs are of a different type.
iter.build();
auto loop = [&](char **data, const int64_t* strides, int64_t n) {
auto * out_data_bytes = data[0];
auto * in_data_bytes = data[1];
// assume float data type for this example.
for (int i = 0; i < n; i++) {
*reinterpret_cast<float*>(out_data_bytes) +=
*reinterpret_cast<float*>(in_data_bytes);
out_data_bytes += strides[0];
in_data_bytes += strides[1];
}
}
iter.for_each(loop);
In the above example, the char** data
gives a pointer to the data within the
tensor in the same order that you specify when you build the iterator. Note
that in order to make the implementation agnostic of any particular data type, you
will always receive the pointer typecast to char
(think of it as a bunch of bytes).
The second argument is int64_t* strides
which is an array containing the strides of
each tensor in the dimension that you’re iterating over. We can add this stride to the
pointer received in order to reach the next element in the tensor. The last argument is
int64_t n
which is the size of the dimension being iterated over.
for_each
implicitly parallelizes the operation by executing loop
in parallel
if the number of iterations is more than the value of internal::GRAIN_SIZE
, which is a value
that is determined as the ‘right amount’ of data to iterate over in order to gain a significant
speedup using multi-threaded execution. If you want to explicitly specify that your
operation must run in serial, then use the serial_for_each
loop.
Using kernels for iterations
Frequently we want to create a kernel that applies a simple point-wise function onto entire tensors.
TensorIterator
provides various such generic kernels that can be used for iterating over the elements
of a tensor without having to worry about the stride, data type of the operands or details
of the parallelism.
For example, say we want to build a function that performs the point-wise addition
of two tensors and stores the result in a third tensor, we can use the cpu_kernel
function. Note that in this example we assume a tensor of float
but you can
use the AT_DISPATCH_ALL_TYPES_AND2
macro.
TensorIterator iter;
iter.add_input(a_tensor);
iter.add_input(b_tensor);
iter.add_output(c_tensor);
iter.build();
cpu_kernel(iter, [] (float a, float b) -> float {
return a + b;
});
Writing the kernel in this way ensures that the value returned by the lambda passed to
cpu_kernel
will populate the corresponding place in the target output tensor.
Setting tensor iteration dimensions
The value of the sizes and strides will determine which dimension of the tensor you will iterate over.
TensorIterator
performs optimizations to make sure that at least
most of the iterations happen on contiguos data to take advantage of hierarchical cache-based
memory architectures (think dimension coalescing and reordering for maximum data locality).
Now a multi-dimensional tensor will have multiple stride values depending on the dimension
you want to iterate over, so TensorIterator
will directly compute the strides that
get passed into the loop by
by itself within the build()
function. How exactly it computes the dimension
to iterate over is something that should be properly understood in order to use TensorIterator
effectively.
If you’re performing a reduction operation (see the sum code in ReduceOps.cpp),
TensorIterator
will figure out the dimensions that will be reduced depending
on the shape of the input and output tensor, which determines how the input will be broadcast
over the output. If you’re
performing a simple pointwise operation between two tensors (like a addcmul
from
PointwiseOps.cpp)
the iteration will happen over the entire tensor, without providing a choice of the dimension.
This will allow TensorIterator to freely parallelize the computation, without guarantees of
the order of execution (since it does not matter anyway).
For something like a cumulative sum operation, where you want be able to choose the dimension
to reduce but iterate over multiple non-reduced dimensions (possibly in parallel), you
must first re-stride the tensors, and then use these tensors
for creating a TensorIterator
. In order to understand how this bit works, lets go over
the code for the kernel that executes the cumsum function.
The important bits of this function are like so:
auto self_sizes = ensure_nonempty_vec(self.sizes().vec());
self_sizes[dim] = 1;
auto result_restrided = restride_dim(result, dim, self_sizes);
auto self_restrided = restride_dim(self, dim, self_sizes);
auto iter = TensorIterator();
iter.dont_compute_common_dtype();
iter.dont_resize_outputs();
iter.add_output(result_restrided);
iter.add_input(self_restrided);
iter.build();
You can see that we first change the size of the tensors to 1
on the
reduction dimension so that the dimension collapsing logic inside
TensorIterator#build
will know which dimension to skip.
Setting the dimension in this way is akin to telling TensorIterator
to skip the dimension. We then restride the tensors using restride_dim
and
then use the restrided tensors for building the TensorIterator
. You can
set any size for inputs/outputs, then TensorIterator
with check whether it
can come up with a common broadcasted size
Conclusion
This post was a very short introduction to what TensorIterator
is actually
capable of. If you want to learn more about how it works and what goes into
things like collapsing the tensor size for optimizing memory access, a good
place to start would be the build()
function in
TensorIterator.cpp.
Also have a look at this blog post from the PyTorch team
on using TensorIterator.