<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.9.0">Jekyll</generator><link href="https://v0dro.in//feed.xml" rel="self" type="application/atom+xml" /><link href="https://v0dro.in//" rel="alternate" type="text/html" /><updated>2021-05-25T06:57:55+00:00</updated><id>https://v0dro.in//feed.xml</id><title type="html">code, travel and explore.</title><subtitle>A place where I share my experiences with travel, programming and learning new things in general.</subtitle><entry><title type="html">Profiling and benchmarking Python programs</title><link href="https://v0dro.in//blog/2020/04/21/profiling-and-benchmarking-python-programs/" rel="alternate" type="text/html" title="Profiling and benchmarking Python programs" /><published>2020-04-21T08:09:00+00:00</published><updated>2020-04-21T08:09:00+00:00</updated><id>https://v0dro.in//blog/2020/04/21/profiling-and-benchmarking-python-programs</id><content type="html" xml:base="https://v0dro.in//blog/2020/04/21/profiling-and-benchmarking-python-programs/">&lt;p&gt;The number of ways in which one can profile and benchmark Python programs
is daunting. There’s many options out there, and this post is about the ones
that I found suitable for profiling and benchmarking PRs that I submit to
PyTorch every now and then. Coming from a land of C++ and Ruby, one annoying
thing I find about the Python tools is the preference for providing the
code to profiled inside a string as an argument to profiling tool, so
I try to directly instrument calls within the code wherever possible.&lt;/p&gt;

&lt;h1 id=&quot;profiling-c-extensions&quot;&gt;Profiling C extensions&lt;/h1&gt;

&lt;p&gt;Say you want to know the function profiles of the following PyTorch script,
where we want to know where the &lt;code&gt;scatter_&lt;/code&gt; call is spending most of its time:&lt;/p&gt;
&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;torch&lt;/span&gt;
&lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;numpy&lt;/span&gt;

&lt;span class=&quot;n&quot;&gt;M&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;256&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;N&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;512&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;dim&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;

&lt;span class=&quot;n&quot;&gt;input_one&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;torch&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;rand&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;M&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;N&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;index&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;torch&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;tensor&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;numpy&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;random&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;randint&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;M&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;M&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;N&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)))&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;res&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;torch&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;randn&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;M&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;N&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;

&lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;_i&lt;/span&gt; &lt;span class=&quot;ow&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;range&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;10000&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;res&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;scatter_&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;dim&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;index&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;input_one&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h2 id=&quot;using-cprofile&quot;&gt;Using cProfile&lt;/h2&gt;

&lt;p&gt;The default profiler for Python is &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;cProfile&lt;/code&gt; which is a faster version of the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;profile&lt;/code&gt; module.
While this is simple to use and does not require any extra dependencies, it does not show profiles
of C++ functions at all. You can use it by calling the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;cProfile.run&lt;/code&gt; function and passing it
the code to be profiled as a string like so:&lt;/p&gt;
&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;cProfile&lt;/span&gt;

&lt;span class=&quot;c1&quot;&gt;# Do something
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;cProfile&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;run&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;res.scatter_(dim,index,input_one)&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;You can see the output of the profiler&lt;/p&gt;

&lt;h2 id=&quot;using-yep&quot;&gt;Using yep&lt;/h2&gt;

&lt;p&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;yep&lt;/code&gt;is a &lt;a href=&quot;https://pypi.org/project/yep/&quot;&gt;utility&lt;/a&gt; that uses Google’s gperftools underneath and promises to
show profiles of C/C++ functions made inside Python C extensions. On Ubuntu/Debian, first install the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;google-perftools&lt;/code&gt;
package. Then run &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;pip install yep&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;You can set a region to profile as follows:&lt;/p&gt;
&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;yep&lt;/span&gt;

&lt;span class=&quot;n&quot;&gt;yep&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;start&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;file_name.prof&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;# do something
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;yep&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;stop&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;This generates a file &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;file_name.prof&lt;/code&gt; that be can be analysed using the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;pprof&lt;/code&gt;
&lt;a href=&quot;https://github.com/google/pprof&quot;&gt;utility&lt;/a&gt; (which can be installed with &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;go get -u github.com/google/pprof&lt;/code&gt;). You can
then get the top time consuming functions from &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;pprof&lt;/code&gt; as follows:&lt;/p&gt;
&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;pprof -text -lines file_name.prof
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;For our same program, profiling the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;scatter_&lt;/code&gt; loop shows the following output:&lt;/p&gt;
&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;File: python3.6
Type: cpu
Showing nodes accounting for 27.51s, 98.81% of 27.84s total
Dropped 151 nodes (cum &amp;lt;= 0.14s)
      flat  flat%   sum%        cum   cum%
     4.45s 15.98% 15.98%     27.49s 98.74%  _ZZZZZN2at6native12_GLOBAL__N_130cpu_scatter_gather_base_kernelILb1EEclERNS_6TensorElRKS4_S7_RKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEbRKNS0_17SCATTE
R_GATHER_OPEENKUlvE_clEvENKUlvE2_clEvENKUlRKT_E_clISt8functionIFvPfSR_EEEEDaSN_ENKUlPPcPKllE_clESV_SX_l /home/sameer/gitrepos/pytorch/build/aten/src/ATen/native/cpu/ScatterGatherKernel.cpp.AVX2.cpp:375
     2.84s 10.20% 26.19%      2.84s 10.20%  _ZNK2at6native12_GLOBAL__N_1UlPT_PT0_E2_clIffEEDaS3_S5_ /home/sameer/gitrepos/pytorch/build/aten/src/ATen/native/cpu/ScatterGatherKernel.cpp.AVX2.cpp:171
     2.54s  9.12% 35.31%      2.54s  9.12%  std::forward /usr/include/c++/7/bits/move.h:74
     1.91s  6.86% 42.17%      5.07s 18.21%  _ZNSt17_Function_handlerIFvPfS0_EN2at6native12_GLOBAL__N_1UlPT_PT0_E2_EE9_M_invokeERKSt9_Any_dataOS0_SE_ /usr/include/c++/7/bits/std_function.h:317
     1.39s  4.99% 47.16%     20.25s 72.74%  std::function::operator() /usr/include/c++/7/bits/std_function.h:706
     1.16s  4.17% 51.33%      1.16s  4.17%  std::forward /usr/include/c++/7/bits/move.h:73
     1.14s  4.09% 55.42%     11.48s 41.24%  _ZNSt17_Function_handlerIFvPfS0_EN2at6native12_GLOBAL__N_1UlPT_PT0_E2_EE9_M_invokeERKSt9_Any_dataOS0_SE_ /usr/include/c++/7/bits/std_function.h:316
     1.04s  3.74% 59.16%      1.04s  3.74%  _ZNSt14_Function_base13_Base_managerIN2at6native12_GLOBAL__N_1UlPT_PT0_E2_EE14_M_get_pointerERKSt9_Any_data /usr/include/c++/7/bits/std_function.h:176
     0.91s  3.27% 62.43%      0.91s  3.27%  _ZNSt14_Function_base13_Base_managerIN2at6native12_GLOBAL__N_1UlPT_PT0_E2_EE14_M_get_pointerERKSt9_Any_data /usr/include/c++/7/bits/std_function.h:175
     0.90s  3.23% 65.66%      0.90s  3.23%  std::_Any_data::_M_access /usr/include/c++/7/bits/std_function.h:107
     0.87s  3.12% 68.79%      0.87s  3.12%  _ZNK2at6native12_GLOBAL__N_1UlPT_PT0_E2_clIffEEDaS3_S5_ /home/sameer/gitrepos/pytorch/build/aten/src/ATen/native/cpu/ScatterGatherKernel.cpp.AVX2.cpp:170
     0.86s  3.09% 71.88%      0.86s  3.09%  std::function::operator() /usr/include/c++/7/bits/std_function.h:701
     0.79s  2.84% 74.71%      0.79s  2.84%  [libtorch_cpu.so]
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h2 id=&quot;some-notes-on-yep&quot;&gt;Some notes on yep&lt;/h2&gt;

&lt;p&gt;If you change the shared object file that your program was running and call &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;pprof&lt;/code&gt; on the same &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;.prof&lt;/code&gt; file,
the program will show nonsensical functions since it only maps the function hex code to the hex code from the 
shared object file.&lt;/p&gt;

&lt;h1 id=&quot;analyzing-performance-regressions&quot;&gt;Analyzing performance regressions&lt;/h1&gt;

&lt;p&gt;Analysis of performance regressions requires comparing the same interfaces over different implementations.&lt;/p&gt;

&lt;h2 id=&quot;time-regression-analysis&quot;&gt;Time regression analysis&lt;/h2&gt;

&lt;p&gt;The simplest performance regression can be in terms of time of execution. Using the ipython magic command
is a great way to know mean and standard deviation of multiple executions of the same lines of code. Using
this within a script requires usage of embedded ipython. The &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;timeit&lt;/code&gt; magic method allows for timing
code, and when used with the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;-o&lt;/code&gt; option will also return the object containing information about the
recent timing run.&lt;/p&gt;

&lt;h1 id=&quot;further-reading&quot;&gt;Further Reading&lt;/h1&gt;

&lt;ul&gt;
  &lt;li&gt;C extentions with PySpy: https://www.benfrederickson.com/profiling-native-python-extensions-with-py-spy/&lt;/li&gt;
  &lt;li&gt;Yep home page: https://pypi.org/project/yep/&lt;/li&gt;
  &lt;li&gt;Speedscope homepage: https://github.com/jlfwong/speedscope&lt;/li&gt;
  &lt;li&gt;Pyspy homepage: https://github.com/benfred/py-spy&lt;/li&gt;
  &lt;li&gt;Google perftools: https://github.com/gperftools/gperftools&lt;/li&gt;
  &lt;li&gt;Yep blog post:  https://www.camillescott.org/2013/12/06/yep/&lt;/li&gt;
  &lt;li&gt;Timeit -o: https://ipython.readthedocs.io/en/stable/interactive/magics.html?highlight=timeit#magic-timeit&lt;/li&gt;
&lt;/ul&gt;</content><author><name></name></author><summary type="html">The number of ways in which one can profile and benchmark Python programs is daunting. There’s many options out there, and this post is about the ones that I found suitable for profiling and benchmarking PRs that I submit to PyTorch every now and then. Coming from a land of C++ and Ruby, one annoying thing I find about the Python tools is the preference for providing the code to profiled inside a string as an argument to profiling tool, so I try to directly instrument calls within the code wherever possible.</summary></entry><entry><title type="html">Installing latest MPI</title><link href="https://v0dro.in//blog/2020/04/12/installing-latest-mpi/" rel="alternate" type="text/html" title="Installing latest MPI" /><published>2020-04-12T07:23:00+00:00</published><updated>2020-04-12T07:23:00+00:00</updated><id>https://v0dro.in//blog/2020/04/12/installing-latest-mpi</id><content type="html" xml:base="https://v0dro.in//blog/2020/04/12/installing-latest-mpi/">&lt;p&gt;Installing the latest openMPI can be a challenge if you want to correctly optimize
all parameters properly. Here are the right ways of doing so:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;gdrcopy (https://github.com/NVIDIA/gdrcopy.git).Nothing special here, it figures out most of 
he things. Do a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;make INSTALL prefix=&amp;lt;somewhere GDR&amp;gt;&lt;/code&gt;.&lt;/li&gt;
  &lt;li&gt;UCX (git@github.com:uccs/ucx.git). Pick the version you want (latest 1.8) and git checkout the
corresponding branch. First &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;./autogen.sh&lt;/code&gt; then
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;../../configure --prefix=&amp;lt;somewhere UCX&amp;gt; --disable-debug --with-cuda --with-avx --with-gdrcopy=&amp;lt;somewhere GDR&amp;gt; --enable-mt --with-hwloc&lt;/code&gt;.&lt;/li&gt;
  &lt;li&gt;Finally OMPI (git@github.com:open-mpi/ompi.git). Similarly to UCX, pick the version you want
(I stick with master most of the time except if it obviously broken), then &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;./autogen.sh&lt;/code&gt; and
then &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;../../configure --prefix=&amp;lt;somewhere OMPI&amp;gt; --enable-picky --disable-debug --enable-contrib-no-build=vt --enable-mpirun-prefix-by-default --with-cma --enable-ipv6 --disable-oshmem --disable-spc --with-ucx=&amp;lt;somewhere UCX&amp;gt; --with-cuda&lt;/code&gt;&lt;/li&gt;
&lt;/ol&gt;</content><author><name></name></author><summary type="html">Installing the latest openMPI can be a challenge if you want to correctly optimize all parameters properly. Here are the right ways of doing so:</summary></entry><entry><title type="html">Day hike to Mitsutoge from Tokyo</title><link href="https://v0dro.in//blog/2020/04/04/day-hike-to-mitsutoge-from-tokyo/" rel="alternate" type="text/html" title="Day hike to Mitsutoge from Tokyo" /><published>2020-04-04T07:14:00+00:00</published><updated>2020-04-04T07:14:00+00:00</updated><id>https://v0dro.in//blog/2020/04/04/day-hike-to-mitsutoge-from-tokyo</id><content type="html" xml:base="https://v0dro.in//blog/2020/04/04/day-hike-to-mitsutoge-from-tokyo/">&lt;p&gt;Mt. Mitsunoge is great hike not too far from Tokyo. The full hike is about 20 KM long and takes
about 5.5 hours of walking to finish. I came across this while
browsing for things to do around the city on &lt;a href=&quot;https://ridgelineimages.com/hiking/mt-mitsutoge/&quot;&gt;ridgeline images&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;I had a spare day remaining on an old seishun 18 ticket (and not much time to finish
it) and therefore decided to take most of the route to and from Tokyo by train. I’d recommened
going by bus if you’re not in a situtation like I was.&lt;/p&gt;

&lt;h1 id=&quot;travel-route&quot;&gt;Travel Route&lt;/h1&gt;

&lt;h1 id=&quot;access&quot;&gt;Access&lt;/h1&gt;

&lt;p&gt;The JR Chuo Line&lt;/p&gt;</content><author><name></name></author><summary type="html">Mt. Mitsunoge is great hike not too far from Tokyo. The full hike is about 20 KM long and takes about 5.5 hours of walking to finish. I came across this while browsing for things to do around the city on ridgeline images.</summary></entry><entry><title type="html">Tokyo to Matsumoto with the Seishun18</title><link href="https://v0dro.in//blog/2020/04/04/tokyo-to-matsumoto-with-the-seishun18/" rel="alternate" type="text/html" title="Tokyo to Matsumoto with the Seishun18" /><published>2020-04-04T05:52:00+00:00</published><updated>2020-04-04T05:52:00+00:00</updated><id>https://v0dro.in//blog/2020/04/04/tokyo-to-matsumoto-with-the-seishun18</id><content type="html" xml:base="https://v0dro.in//blog/2020/04/04/tokyo-to-matsumoto-with-the-seishun18/">&lt;h1 id=&quot;introduction&quot;&gt;Introduction&lt;/h1&gt;

&lt;p&gt;We wanted to head to Matsumoto for a short 2-day trip to escape the hustle and bustle of
Tokyo. However taking the direct Limited Express train from Shinjuku to Matsumoto is
prohitively expensive (about 6000 JPY one way) and therefore decided to use the Seishun18
pass from Tokyo which cost us about 2410 JPY one way.&lt;/p&gt;

&lt;p&gt;The only downside of using the Seishun18 is that it takes longer and you need three transfers.
This short post is about which trains we took for this particular trip.&lt;/p&gt;

&lt;h1 id=&quot;the-journey&quot;&gt;The Journey&lt;/h1&gt;</content><author><name></name></author><summary type="html">Introduction</summary></entry><entry><title type="html">Setting up Docker containers for testing pytorch</title><link href="https://v0dro.in//blog/2020/03/30/setting-up-docker-containers-for-testing-pytorch/" rel="alternate" type="text/html" title="Setting up Docker containers for testing pytorch" /><published>2020-03-30T10:18:00+00:00</published><updated>2020-03-30T10:18:00+00:00</updated><id>https://v0dro.in//blog/2020/03/30/setting-up-docker-containers-for-testing-pytorch</id><content type="html" xml:base="https://v0dro.in//blog/2020/03/30/setting-up-docker-containers-for-testing-pytorch/">&lt;h1 id=&quot;introduction&quot;&gt;Introduction&lt;/h1&gt;

&lt;p&gt;Frequently we run into issues while developing pytorch that fail only
for a particular build configuration that is very hard to reproduce
on your local machine. Facebook uses docker containers for running
CI setups for various build configurations, which you can also use
for building your own local Docker images in order to reproduce the
issue easily. This post is how you can use the Docker functionality
on QGPU in order to build such a Docker image on QGPU1.&lt;/p&gt;

&lt;h1 id=&quot;first-steps&quot;&gt;First steps&lt;/h1&gt;

&lt;p&gt;The first step is to ask an admin (Pearu/Sameer/Dharhas) to add you
to the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;docker&lt;/code&gt; group on QGPU1. Once you’re on this group, find the
Amazon ECR API access keys on the facebook quip document. &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Docker&lt;/code&gt;
and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;nvidia-docker&lt;/code&gt; are already installed so you need not install
them.&lt;/p&gt;

&lt;p&gt;Then install the Amazon ECR client for your user from &lt;a href=&quot;https://docs.aws.amazon.com/AmazonECS/latest/developerguide/ECS_CLI_installation.html&quot;&gt;here&lt;/a&gt;.
Then run &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;aws ecr get-login&lt;/code&gt; which will prompt you for login
credentials, and then subsequently run for automatic login.&lt;/p&gt;

&lt;p&gt;In order to know which docker image you need, you must know its full
name first. The name can be found from the circleCI build. The amazon
ECR name can be a little different so just find the name from the
output of the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;aws ecr describe-repositories&lt;/code&gt; command. Use
in this manner to find which repo you need (typically the name
of your failing build on circleCI):&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;aws ecr describe-repositories | &lt;span class=&quot;nb&quot;&gt;grep&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-C&lt;/span&gt; 3 xenial
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;From the output of this pickup the exact &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;repositoryName&lt;/code&gt; value
and replace it in the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;repo_name&lt;/code&gt; variable in the below ruby script
in order to get the full string that be the name of your docker image:&lt;/p&gt;

&lt;div class=&quot;language-ruby highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nb&quot;&gt;require&lt;/span&gt; &lt;span class=&quot;s1&quot;&gt;'json'&lt;/span&gt;

&lt;span class=&quot;n&quot;&gt;repo_name&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;pytorch/pytorch-linux-xenial-cuda10.2-cudnn7-py3-gcc7&quot;&lt;/span&gt;

&lt;span class=&quot;nb&quot;&gt;puts&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;logging in...&quot;&lt;/span&gt;
&lt;span class=&quot;sb&quot;&gt;`aws ecr get-login --no-include-email | bash`&lt;/span&gt;

&lt;span class=&quot;n&quot;&gt;images&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;no&quot;&gt;JSON&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;parse&lt;/span&gt; &lt;span class=&quot;sb&quot;&gt;`aws ecr describe-images --repository-name &lt;/span&gt;&lt;span class=&quot;si&quot;&gt;#{&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;repo_name&lt;/span&gt;&lt;span class=&quot;si&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;sb&quot;&gt;`&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;image_tag&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;images&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;s1&quot;&gt;'imageDetails'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;][&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;-&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;][&lt;/span&gt;&lt;span class=&quot;s1&quot;&gt;'imageTags'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;][&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;repo_info&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;no&quot;&gt;JSON&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;parse&lt;/span&gt; &lt;span class=&quot;sb&quot;&gt;`aws ecr describe-repositories --repository-names &lt;/span&gt;&lt;span class=&quot;si&quot;&gt;#{&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;repo_name&lt;/span&gt;&lt;span class=&quot;si&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;sb&quot;&gt;`&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;uri&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;repo_info&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;s1&quot;&gt;'repositories'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;][&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;][&lt;/span&gt;&lt;span class=&quot;s1&quot;&gt;'repositoryUri'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;

&lt;span class=&quot;nb&quot;&gt;puts&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;Docker image: &lt;/span&gt;&lt;span class=&quot;si&quot;&gt;#{&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;uri&lt;/span&gt;&lt;span class=&quot;si&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;si&quot;&gt;#{&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;image_tag&lt;/span&gt;&lt;span class=&quot;si&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;You can then create your own docker image using a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Dockerfile&lt;/code&gt; like so:&lt;/p&gt;
&lt;div class=&quot;language-Dockerfile highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;c&quot;&gt;# Insert the SHA key after the image name.&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;FROM&lt;/span&gt;&lt;span class=&quot;s&quot;&gt; &amp;lt;insert docker image name here&amp;gt;&lt;/span&gt;

&lt;span class=&quot;k&quot;&gt;ENV&lt;/span&gt;&lt;span class=&quot;s&quot;&gt; MAX_JOBS=20&lt;/span&gt;

&lt;span class=&quot;k&quot;&gt;RUN &lt;/span&gt;conda &lt;span class=&quot;nb&quot;&gt;install&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-y&lt;/span&gt; hypothesis

&lt;span class=&quot;k&quot;&gt;RUN &lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;cd &lt;/span&gt;workspace &lt;span class=&quot;se&quot;&gt;\
&lt;/span&gt;    &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; git submodule &lt;span class=&quot;nb&quot;&gt;sync&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;--recursive&lt;/span&gt; &lt;span class=&quot;se&quot;&gt;\
&lt;/span&gt;    &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; git submodule update &lt;span class=&quot;nt&quot;&gt;--init&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;--recursive&lt;/span&gt; &lt;span class=&quot;se&quot;&gt;\
&lt;/span&gt;    &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;TORCH_CUDA_ARCH_LIST&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;Turing python setup.py &lt;span class=&quot;nb&quot;&gt;install&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;--cmake&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;The default docker container is built for a different GPU in some cases so
it is important to specify the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;TORCH_CUDA_ARCH_LIST&lt;/code&gt; env variable.&lt;/p&gt;

&lt;h1 id=&quot;further-reading&quot;&gt;Further reading&lt;/h1&gt;

&lt;ul&gt;
  &lt;li&gt;https://github.com/pytorch/pytorch/wiki/Docker-image-build-on-CircleCI&lt;/li&gt;
&lt;/ul&gt;</content><author><name></name></author><summary type="html">Introduction</summary></entry><entry><title type="html">Interpreting the results of the STREAM benchmark</title><link href="https://v0dro.in//blog/2020/02/27/interpreting-stream-benchmarks/" rel="alternate" type="text/html" title="Interpreting the results of the STREAM benchmark" /><published>2020-02-27T00:00:00+00:00</published><updated>2020-02-27T00:00:00+00:00</updated><id>https://v0dro.in//blog/2020/02/27/interpreting-stream-benchmarks</id><content type="html" xml:base="https://v0dro.in//blog/2020/02/27/interpreting-stream-benchmarks/">&lt;!-- markdown-toc start - Don't edit this section. Run M-x markdown-toc-refresh-toc --&gt;
&lt;p&gt;&lt;strong&gt;Table of Contents&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;a href=&quot;#introduction&quot;&gt;Introduction&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#machine-balance&quot;&gt;Machine Balance&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#stream-kernels&quot;&gt;STREAM kernels&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#triad&quot;&gt;TRIAD&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#useful-links&quot;&gt;Useful links&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;!-- markdown-toc end --&gt;

&lt;h1 id=&quot;introduction&quot;&gt;Introduction&lt;/h1&gt;

&lt;p&gt;The STREAM benchmark is considered an important benchmark for understanding the memory
bandwidth and access latency of a particular computer. This benchmark was conceptualized
in the 1995 &lt;a href=&quot;http://www.cs.virginia.edu/~mccalpin/papers/bandwidth/bandwidth.html&quot;&gt;paper by John McCalpin&lt;/a&gt;.&lt;/p&gt;

&lt;h1 id=&quot;machine-balance&quot;&gt;Machine Balance&lt;/h1&gt;

&lt;p&gt;At the heart of the benchmark lies a definition of ‘machine balance’. Previous to STREAM,
machine balance was simply defined as the number of floating point operations per clock
cycle to the number of memory operations per clock cycle. This is known as the ‘balance’
since it shows the time taken for executing useful work (floating point operations) vs.
work that is absolutely necessary for performing the useful work but is always a bottleneck
in performance (memory access latency).&lt;/p&gt;

&lt;p&gt;However, this definition fails to capture the complexity
of hierarchical memory structures that use multiple layers of cache and parallelization
strategies such as pipelining and prefetching. This is because the number of floating point
operations per cycle can greatly vary depending on the location of the data that is being
operated on. The peak will be reached when the data resides in registers, whereas for
data being accessed from memory, the number of cycles taken to execute a single floating
point operation will be much higher due to latency.&lt;/p&gt;

&lt;p&gt;If this is the case, one might wonder why taking an average of this simple definition is
not adequate since working with a long-enough array will engage the registers and the
RAM too, and should give an estimate of the average number of floating point ops per cycle.
&lt;!-- explain why over here --&gt;&lt;/p&gt;

&lt;p&gt;The STREAM benchmark refines the definition of ‘machine balance’ and defines it as the PEAK
floating point operations per cycle divided by the number of sustained memory operations per
cycle.&lt;/p&gt;

&lt;h1 id=&quot;stream-kernels&quot;&gt;STREAM kernels&lt;/h1&gt;

&lt;p&gt;The benchmark is broken up into a number of kernels, each employing a different set
of instructions per kernel operation.&lt;/p&gt;

&lt;h2 id=&quot;sum&quot;&gt;SUM&lt;/h2&gt;

&lt;p&gt;The STREAM SUM kernel computes a vector operation &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;A(i) = B(i) + C(i)&lt;/code&gt;.
This operation involves 24 bytes of data and 1 floating point addition
operation. There are two loads and one store per iteration.&lt;/p&gt;

&lt;h2 id=&quot;triad&quot;&gt;TRIAD&lt;/h2&gt;

&lt;p&gt;The STREAM TRIAD kernel basically computes a vector operation &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;A(i) = B(i) + s * C(i)&lt;/code&gt;.
This operation involves two loads, one store and one FMA instruction per kernel execution.
If vectorized it will perform a number of such kernel operations per loop iteration.&lt;/p&gt;

&lt;p&gt;Assuming that we are working with doubles, each iteration uses 24 bytes in reads and writes.&lt;/p&gt;

&lt;h1 id=&quot;running-the-benchmark&quot;&gt;Running the benchmark&lt;/h1&gt;

&lt;p&gt;The following is the methodology to run STREAM on a A64FX chip used in the FUGAKU supercomputer.&lt;/p&gt;

&lt;p&gt;Download the source from &lt;a href=&quot;https://github.com/jeffhammond/STREAM&quot;&gt;here&lt;/a&gt; and for FUGAKU use the
following compile command using the FUJITSU compiler on a compute node:&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;fcc &lt;span class=&quot;nt&quot;&gt;-Nclang&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-O3&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-fopenmp&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-DSTREAM_ARRAY_SIZE&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;4194304 stream.c
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;In the above command we set &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;STREAM_ARRAY_SIZE&lt;/code&gt; to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;4194304&lt;/code&gt; since the STREAM benchmark
specifies that the size of each array should be either 4x the size of the sum of the
lowest level cache or 1 million, whichever is larger. We are running this test for
a single core, so only one of the 4 L2 caches on the chip is used, which is 8 MB.
Assuming we’re using doubles, that would be ( 8 * 1024 * 1024 / 8 ) * 4 = 419304.&lt;/p&gt;

&lt;p&gt;In order to figure out the single threaded bandwidth performance, set &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;OMP_NUM_THREADS=1&lt;/code&gt; and
run the executable. The following are the results on the A64FX:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;-------------------------------------------------------------
Function    Best Rate MB/s  Avg time     Min time     Max time
Copy:           21619.0     0.074010     0.074009     0.074011
Scale:          54700.6     0.029265     0.029250     0.029288
Add:            73861.8     0.032498     0.032493     0.032500
Triad:          64291.6     0.037334     0.037330     0.037337
-------------------------------------------------------------
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;It can be seen that the TRIAD, which is the most complex of the 4 kernels
shows a peak bandwidth utilization of about 64 Gbps for a single core.&lt;/p&gt;

&lt;h1 id=&quot;discussion-on-the-results&quot;&gt;Discussion on the results&lt;/h1&gt;

&lt;p&gt;The above results for the peak memory bandwidth are only for demonstrating the
peak achievable memory bandwidth for certain kernels. In practice such speeds
are usually never reached.&lt;/p&gt;

&lt;p&gt;In the STREAM paper, Dr. McCalpin says that the TRIAD benchmark is the standard
used for calculating the machine balance of the system. Why use the TRIAD even
though ADD seems to be utilizing more bandwidth on a single core?&lt;/p&gt;

&lt;h1 id=&quot;useful-links&quot;&gt;Useful links&lt;/h1&gt;

&lt;ul&gt;
  &lt;li&gt;https://stackoverflow.com/questions/56086993/what-does-stream-memory-bandwidth-benchmark-really-measure&lt;/li&gt;
  &lt;li&gt;https://sites.utexas.edu/jdm4372/tag/stream-benchmark/&lt;/li&gt;
  &lt;li&gt;https://blogs.fau.de/hager/archives/8263&lt;/li&gt;
  &lt;li&gt;https://stackoverflow.com/questions/39260020/why-is-skylake-so-much-better-than-broadwell-e-for-single-threaded-memory-throug&lt;/li&gt;
  &lt;li&gt;https://software.intel.com/content/www/us/en/develop/articles/optimizing-memory-bandwidth-on-stream-triad.html&lt;/li&gt;
&lt;/ul&gt;</content><author><name></name></author><summary type="html">Table of Contents</summary></entry><entry><title type="html">PyTorch TensorIterator Internals</title><link href="https://v0dro.in//blog/2020/01/19/pytorch-tensor-iterator-internals/" rel="alternate" type="text/html" title="PyTorch TensorIterator Internals" /><published>2020-01-19T07:00:00+00:00</published><updated>2020-01-19T07:00:00+00:00</updated><id>https://v0dro.in//blog/2020/01/19/pytorch-tensor-iterator-internals</id><content type="html" xml:base="https://v0dro.in//blog/2020/01/19/pytorch-tensor-iterator-internals/">&lt;!--
.. Title: PyTorch TensorIterator Internals
.. slug: pytorch-tensoriterator-internals
.. date: 2020-03-12 22:39:56 UTC-05:00
.. tags: 
.. category: 
.. link: 
.. description: 
.. type: text
--&gt;

&lt;!-- markdown-toc start - Don't edit this section. Run M-x markdown-toc-generate-toc again --&gt;
&lt;p&gt;&lt;strong&gt;Table of Contents&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;a href=&quot;#introduction&quot;&gt;Introduction&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#history-of-tensoriterator&quot;&gt;History of TensorIterator&lt;/a&gt;
    &lt;ul&gt;
      &lt;li&gt;&lt;a href=&quot;#th-iterators&quot;&gt;TH iterators&lt;/a&gt;&lt;/li&gt;
      &lt;li&gt;&lt;a href=&quot;#limitations-of-th-iterators&quot;&gt;Limitations of TH iterators&lt;/a&gt;&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#basics-of-tensoriterator&quot;&gt;Basics of TensorIterator&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#performing-iterations&quot;&gt;Performing iterations&lt;/a&gt;
    &lt;ul&gt;
      &lt;li&gt;&lt;a href=&quot;#iteration-details&quot;&gt;Iteration details&lt;/a&gt;
        &lt;ul&gt;
          &lt;li&gt;&lt;a href=&quot;#using-kernels-for-iterations&quot;&gt;Using kernels for iterations&lt;/a&gt;&lt;/li&gt;
          &lt;li&gt;&lt;a href=&quot;#setting-tensor-iteration-dimensions&quot;&gt;Setting tensor iteration dimensions&lt;/a&gt;&lt;/li&gt;
        &lt;/ul&gt;
      &lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#conclusion&quot;&gt;Conclusion&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;!-- markdown-toc end --&gt;

&lt;h1 id=&quot;introduction&quot;&gt;Introduction&lt;/h1&gt;

&lt;p&gt;PyTorch is one of the leading frameworks for deep learning. Its core data structure is
Tensor, a multi-dimensional array implementation with many advanced features like auto-differentiation. PyTorch is a massive
codebase (approx. &lt;a href=&quot;https://www.openhub.net/p/pytorch&quot;&gt;12 GB of files and about a million lines&lt;/a&gt; of
C++, Python and CUDA code), and having a method for iterating over tensors in a very efficient manner that is independent of
data type, dimension, striding and hardware is a critical feature that can lead to a very
massive simplification of the codebase and make distributed development much faster and
smoother. The &lt;a href=&quot;https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/TensorIterator.cpp&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;TensorIterator&lt;/code&gt;&lt;/a&gt; C++ class within PyTorch is a complex yet useful class that
is used for iterating over the elements of a tensor over any dimension and implicitly
parallelizing various operations in a device independent manner.&lt;/p&gt;

&lt;p&gt;It does this through
a C++ API that is independent of type and device of the tensor, freeing the programmer
of having to worry about the datatype or device when writing iteration logic for PyTorch
tensors. For those coming from the NumPy universe, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;NpyIter&lt;/code&gt; is a close cousin of &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;TensorIterator&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;This post is a deep dive into how &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;TensorIterator&lt;/code&gt; works, and is
an essential part of learning to contribute to the PyTorch codebase since iterations
over tensors in the C++ codebase are extremely commonplace. This post is aimed at someone
who wants to contribute to PyTorch, and you should at least be familiar with some of the
basic terminologies of the PyTorch codebase that can be found in Edward Yang’s 
excellent &lt;a href=&quot;http://blog.ezyang.com/2019/05/pytorch-internals/**&quot;&gt;blog post&lt;/a&gt; on PyTorch internals.
Although &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;TensorIterator&lt;/code&gt; can be used for both CPUs and accelerators, this post has been
written keeping in mind usage on the CPU. Although there can be some dissimilarities between
the two, the overall concepts are the same.&lt;/p&gt;

&lt;h1 id=&quot;history-of-tensoriterator&quot;&gt;History of TensorIterator&lt;/h1&gt;

&lt;h2 id=&quot;th-iterators&quot;&gt;TH iterators&lt;/h2&gt;

&lt;p&gt;TensorIterator was devised to simplify the implementation of PyTorch’s tensor operations over the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;TH&lt;/code&gt; implementation. &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;TH&lt;/code&gt; uses preprocessor macros to write type-independent loops over tensors, instead of C++ templates. For example, consider this simple &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;TH&lt;/code&gt; loop
for computing the product of all the numbers in a particular dimension (find the code 
&lt;a href=&quot;https://github.com/pytorch/pytorch/blob/master/aten/src/TH/generic/THTensorMoreMath.cpp#L350&quot;&gt;here&lt;/a&gt;):&lt;/p&gt;

&lt;pre&gt;&lt;code class=&quot;language-C&quot;&gt;TH_TENSOR_DIM_APPLY2(scalar_t, t, scalar_t, r_, dimension,
    accreal prod = 1;
    int64_t i;
    for(i = 0; i &amp;lt; t_size; i++)
        prod *= t_data[i*t_stride];
    *r__data = (scalar_t)prod;
);
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;The above loop works by following a particular convention for the naming of the
types and variables. You specify the input type and output type of your tensors in the first
and third arguments. &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;scalar_t&lt;/code&gt; is a type that can generically be used for denoting a PyTorch
scalar type such as &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;float&lt;/code&gt;, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;double&lt;/code&gt;, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;long&lt;/code&gt; etc. Internally, PyTorch uses the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;scalar_t&lt;/code&gt; 
for compiling the file multiple times for different definitions of &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;scalar_t&lt;/code&gt; (as in for different
data types like &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;float&lt;/code&gt;, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;int&lt;/code&gt;, etc.). The input tensor and output tensors are
specified in the second and fourth arguments (in this case &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;t&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;r_&lt;/code&gt;), and the dimension that
we want to iterate over is specified as the fifth argument (&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;dimension&lt;/code&gt;).&lt;/p&gt;

&lt;p&gt;We then follow these arguments with the main body of the iterator (which is accepted as the sixth
argument into the macro), and denote the data, stride and size of the particular tensor dimension
by using variables that are suffixed by &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;_data&lt;/code&gt;, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;_stride&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;_size&lt;/code&gt; respectively after the
variable name that represents the tensor inside the iterator body. For example, the size of the
input tensor is denoted as &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;t_size&lt;/code&gt; in the above example and the pointer to the data of the output
tensor is denoted as &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;r__data&lt;/code&gt;. The &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;accreal&lt;/code&gt; in the second line is custom type that specifies
a real number that is an accumulator (in this case for accumulating the product).&lt;/p&gt;

&lt;p&gt;Internally, the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;TH_TENSOR_DIM_APPLY2&lt;/code&gt; macro is expanded for generating various dispatch calls 
depending on the type of the tensor that needs to be iterated over. The implementation of 
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;TH_TENSOR_DIM_APPLY2&lt;/code&gt; can be found &lt;a href=&quot;https://github.com/pytorch/pytorch/blob/master/aten/src/TH/THTensorDimApply.h#L138&quot;&gt;here&lt;/a&gt;.&lt;/p&gt;

&lt;h2 id=&quot;limitations-of-th-iterators&quot;&gt;Limitations of TH iterators&lt;/h2&gt;

&lt;p&gt;Apart from the obvious complication that arises due to maintaining a codebase that is so dependent
on such insanely complex macro expansions, TH iterators have some fundamental shortcomings. For
one thing, they cannot be used for writing iterators in a device independent manner - you will
need separate iterators for CPU and CUDA. Also, parallelization does not happen implicitly
inside the iterator, you need to write the parallel looping logic yourself. Moreover, at a deeper
level &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;TH&lt;/code&gt; iterators do not collapse the dimensions of the tensor (as we’ll see later in this
post) therefore leading to looping that might not be as cache-optimized as possible.&lt;/p&gt;

&lt;p&gt;These limitations led to the creation of &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;TensorIterator&lt;/code&gt;, which is used by the
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ATen&lt;/code&gt; tensor implementation for overcoming some of the shortcomings of the previous &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;TH&lt;/code&gt;
iterators.&lt;/p&gt;

&lt;h1 id=&quot;basics-of-tensoriterator&quot;&gt;Basics of TensorIterator&lt;/h1&gt;

&lt;p&gt;A &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;TensorIterator&lt;/code&gt; can be created using the default constructor. You must then add the tensors
that you want as inputs or outputs. A good example can be found from the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;TensorIterator::binary_op()&lt;/code&gt;
&lt;a href=&quot;https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/TensorIterator.cpp#L652&quot;&gt;method&lt;/a&gt; that
allows you to create &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;TensorIterator&lt;/code&gt; objects for performing point-wise binary operations
between two tensors. The important parts look like so:&lt;/p&gt;

&lt;div class=&quot;language-cpp highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;k&quot;&gt;auto&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;iter&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;TensorIterator&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;();&lt;/span&gt;

&lt;span class=&quot;n&quot;&gt;iter&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;add_output&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;out&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;iter&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;add_input&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;a&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;iter&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;add_input&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;b&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;

&lt;span class=&quot;n&quot;&gt;iter&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;build&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;();&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;As you can see, you add a tensor called &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;out&lt;/code&gt; as the output tensors and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;a&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;b&lt;/code&gt; as the
input tensors. Calling &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;build&lt;/code&gt; is then mandatory for creating the object and letting
the class perform other optimizations like collapsing dimensions.&lt;/p&gt;

&lt;h1 id=&quot;performing-iterations&quot;&gt;Performing iterations&lt;/h1&gt;

&lt;p&gt;Broadly, iterations using &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;TensorIterator&lt;/code&gt; can be classified as point-wise iterations
or reduction iterations. This plays a fundamental role in how iterations using &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;TensorIterator&lt;/code&gt;
are parallelized - point-wise iterations can be freely parallelized along any dimension
and grain size while reduction operations have to be either parallelized along dimensions
that you’re not iterating over or by performing bisect and reduce operations along the
dimension being iterated. Parallelization can also happen using vectorized operations.&lt;/p&gt;

&lt;h2 id=&quot;iteration-details&quot;&gt;Iteration details&lt;/h2&gt;

&lt;p&gt;The simplest iteration operation can be performed using the 
&lt;a href=&quot;https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/TensorIterator.cpp#L525&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;for_each&lt;/code&gt;&lt;/a&gt; 
function. This function has two overloads: one takes a function object which iterates over a
single dimension (&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;loop_t&lt;/code&gt;); the other takes a function object which iterates over two
dimensions simultaneously (&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;loop2d_t&lt;/code&gt;). Find their definitions &lt;a href=&quot;https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/TensorIterator.h#L166&quot;&gt;here&lt;/a&gt;. The former can iterate over a loop
of a single dimension whereas the latter can do so over two dimensions. The simplest
way of using &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;for_each&lt;/code&gt; is to pass it a lambda of type &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;loop_t&lt;/code&gt; (or &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;loop2d_t&lt;/code&gt;).
A code snippet using it this way would look like so:&lt;/p&gt;

&lt;div class=&quot;language-cpp highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;k&quot;&gt;auto&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;iter&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;TensorIterator&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;();&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;iter&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;add_output&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;out&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;iter&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;add_input&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;a&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;iter&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;dont_resize_outputs&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;();&lt;/span&gt; &lt;span class=&quot;c1&quot;&gt;// call if out is allocated.&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;iter&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;dont_compute_common_dtype&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;();&lt;/span&gt; &lt;span class=&quot;c1&quot;&gt;// call if inputs/outputs are of a different type.&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;iter&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;build&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;();&lt;/span&gt;

&lt;span class=&quot;k&quot;&gt;auto&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;loop&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;amp;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;](&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;char&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;**&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;data&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;const&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;int64_t&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;strides&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;int64_t&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;n&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;auto&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;out_data_bytes&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;data&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;];&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;auto&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;in_data_bytes&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;data&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;];&lt;/span&gt;
    
    &lt;span class=&quot;c1&quot;&gt;// assume float data type for this example.&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;int&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;i&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;i&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;n&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;i&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;++&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
      &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;reinterpret_cast&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;float&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;*&amp;gt;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;out_data_bytes&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+=&lt;/span&gt;
        &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;reinterpret_cast&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;float&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;*&amp;gt;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;in_data_bytes&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
        
      &lt;span class=&quot;n&quot;&gt;out_data_bytes&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;strides&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;];&lt;/span&gt;
      &lt;span class=&quot;n&quot;&gt;in_data_bytes&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;strides&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;];&lt;/span&gt;
    &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;

&lt;span class=&quot;n&quot;&gt;iter&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;for_each&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;loop&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;In the above example, the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;char** data&lt;/code&gt; gives a pointer to the data within the
tensor in the same order that you specify when you build the iterator. Note
that in order to make the implementation agnostic of any particular data type, you
will always receive the pointer typecast to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;char&lt;/code&gt; (think of it as a bunch of bytes).&lt;/p&gt;

&lt;p&gt;The second argument is &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;int64_t* strides&lt;/code&gt; which is an array containing the strides of
each tensor in the dimension that you’re iterating over. We can add this stride to the
pointer received in order to reach the next element in the tensor. The last argument is
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;int64_t n&lt;/code&gt; which is the size of the dimension being iterated over.&lt;/p&gt;

&lt;p&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;for_each&lt;/code&gt; implicitly parallelizes the operation by executing &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;loop&lt;/code&gt; in parallel
if the number of iterations is more than the value of &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;internal::GRAIN_SIZE&lt;/code&gt;, which is a value
that is determined as the ‘right amount’ of data to iterate over in order to gain a significant
speedup using multi-threaded execution. If you want to explicitly specify that your
operation &lt;em&gt;must&lt;/em&gt; run in serial, then use the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;serial_for_each&lt;/code&gt; loop.&lt;/p&gt;

&lt;h3 id=&quot;using-kernels-for-iterations&quot;&gt;Using kernels for iterations&lt;/h3&gt;

&lt;p&gt;Frequently we want to create a kernel that applies a simple point-wise function onto entire tensors.
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;TensorIterator&lt;/code&gt;
provides various such generic kernels that can be used for iterating over the elements
of a tensor without having to worry about the stride, data type of the operands or details
of the parallelism.&lt;/p&gt;

&lt;p&gt;For example, say we want to build a function that performs the point-wise addition
of two tensors and stores the result in a third tensor, we can use the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;cpu_kernel&lt;/code&gt;
function. Note that in this example we assume a tensor of &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;float&lt;/code&gt; but you can
use the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;AT_DISPATCH_ALL_TYPES_AND2&lt;/code&gt; macro.&lt;/p&gt;
&lt;div class=&quot;language-cpp highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;n&quot;&gt;TensorIterator&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;iter&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;iter&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;add_input&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;a_tensor&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;iter&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;add_input&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;b_tensor&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;iter&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;add_output&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;c_tensor&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;iter&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;build&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;();&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;cpu_kernel&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;iter&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;[]&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;float&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;a&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;float&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;b&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&amp;gt;&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;float&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
  &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;a&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;b&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;Writing the kernel in this way ensures that the value returned by the lambda passed to
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;cpu_kernel&lt;/code&gt; will populate the corresponding place in the target output tensor.&lt;/p&gt;

&lt;h3 id=&quot;setting-tensor-iteration-dimensions&quot;&gt;Setting tensor iteration dimensions&lt;/h3&gt;

&lt;p&gt;The value of the sizes and strides will determine which dimension of the tensor you will iterate over.
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;TensorIterator&lt;/code&gt; performs optimizations to make sure that at least
most of the iterations happen on contiguos data to take advantage of hierarchical cache-based
memory architectures (think dimension coalescing and reordering for maximum data locality).&lt;/p&gt;

&lt;p&gt;Now a multi-dimensional tensor will have multiple stride values depending on the dimension
you want to iterate over, so &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;TensorIterator&lt;/code&gt; will directly compute the strides that
get passed into the loop by
by itself within the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;build()&lt;/code&gt; function. How exactly it computes the dimension
to iterate over is something that should be properly understood in order to use &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;TensorIterator&lt;/code&gt;
effectively.&lt;/p&gt;

&lt;p&gt;If you’re performing a reduction operation (see the sum code in &lt;a href=&quot;https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/ReduceOps.cpp#L384&quot;&gt;ReduceOps.cpp&lt;/a&gt;),
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;TensorIterator&lt;/code&gt; will figure out the dimensions that will be reduced depending
on the shape of the input and output tensor, which determines how the input will be broadcast
over the output. If you’re
performing a simple pointwise operation between two tensors (like a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;addcmul&lt;/code&gt; from 
&lt;a href=&quot;https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/PointwiseOps.cpp#L31&quot;&gt;PointwiseOps.cpp&lt;/a&gt;)
the iteration will happen over the entire tensor, without providing a choice of the dimension.
This will allow TensorIterator to freely parallelize the computation, without guarantees of
the order of execution (since it does not matter anyway).&lt;/p&gt;

&lt;p&gt;For something like a cumulative sum operation, where you want be able to choose the dimension
to reduce but iterate over multiple non-reduced dimensions (possibly in parallel), you
must first re-stride the tensors, and then use these tensors 
for creating a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;TensorIterator&lt;/code&gt;. In order to understand how this bit works, lets go over
the code for the &lt;a href=&quot;https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/cpu/ReduceOpsKernel.cpp#L21&quot;&gt;kernel&lt;/a&gt; that executes the &lt;a href=&quot;https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/cpu/ReduceOpsKernel.cpp#L71&quot;&gt;cumsum&lt;/a&gt; function.&lt;/p&gt;

&lt;p&gt;The important bits of this function are like so:&lt;/p&gt;

&lt;div class=&quot;language-cpp highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;k&quot;&gt;auto&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;self_sizes&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ensure_nonempty_vec&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;sizes&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;().&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;vec&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;());&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;self_sizes&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;dim&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;

&lt;span class=&quot;k&quot;&gt;auto&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;result_restrided&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;restride_dim&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;result&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;dim&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;self_sizes&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;auto&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;self_restrided&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;restride_dim&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;dim&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;self_sizes&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;

&lt;span class=&quot;k&quot;&gt;auto&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;iter&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;TensorIterator&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;();&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;iter&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;dont_compute_common_dtype&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;();&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;iter&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;dont_resize_outputs&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;();&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;iter&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;add_output&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;result_restrided&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;iter&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;add_input&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;self_restrided&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;iter&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;build&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;();&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;You can see that we first change the size of the tensors to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;1&lt;/code&gt; on the
reduction dimension so that the dimension collapsing logic inside
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;TensorIterator#build&lt;/code&gt; will know which dimension to skip.
Setting the dimension in this way is akin to telling &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;TensorIterator&lt;/code&gt;
to skip the dimension. We then restride the tensors using &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;restride_dim&lt;/code&gt; and
then use the restrided tensors for building the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;TensorIterator&lt;/code&gt;. You can
set any size for inputs/outputs, then &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;TensorIterator&lt;/code&gt; with check whether it
can come up with a common broadcasted size&lt;/p&gt;

&lt;h1 id=&quot;conclusion&quot;&gt;Conclusion&lt;/h1&gt;

&lt;p&gt;This post was a very short introduction to what &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;TensorIterator&lt;/code&gt; is actually
capable of. If you want to learn more about how it works and what goes into
things like collapsing the tensor size for optimizing memory access, a good
place to start would be the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;build()&lt;/code&gt; function in 
&lt;a href=&quot;https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/TensorIterator.cpp#L1030&quot;&gt;TensorIterator.cpp&lt;/a&gt;.
Also have a look at &lt;a href=&quot;https://github.com/pytorch/pytorch/wiki/How-to-use-TensorIterator&quot;&gt;this blog post&lt;/a&gt; from the PyTorch team
on using &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;TensorIterator.&lt;/code&gt;&lt;/p&gt;</content><author><name></name></author><summary type="html"></summary></entry><entry><title type="html">Distributed LU factorization using Chameleon</title><link href="https://v0dro.in//blog/2019/09/16/distributed-lu-factorization-using-chameleon/" rel="alternate" type="text/html" title="Distributed LU factorization using Chameleon" /><published>2019-09-16T23:45:00+00:00</published><updated>2019-09-16T23:45:00+00:00</updated><id>https://v0dro.in//blog/2019/09/16/distributed-lu-factorization-using-chameleon</id><content type="html" xml:base="https://v0dro.in//blog/2019/09/16/distributed-lu-factorization-using-chameleon/">&lt;p&gt;In this post I will write the detail the steps I took to reproduce distributed
LU factorization using the Chameleon library. It is a linear algebra library
based on the starPU runtime system.&lt;/p&gt;

&lt;!-- markdown-toc start - Don't edit this section. Run M-x markdown-toc-generate-toc again --&gt;
&lt;p&gt;&lt;strong&gt;Table of Contents&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;a href=&quot;#installing-chameleon&quot;&gt;Installing chameleon&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#compiling-and-linking-your-programs&quot;&gt;Compiling and linking your programs&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#distributed-lu-factorization-implmentation&quot;&gt;Distributed LU factorization implmentation&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;!-- markdown-toc end --&gt;

&lt;h1 id=&quot;installing-chameleon&quot;&gt;Installing chameleon&lt;/h1&gt;

&lt;p&gt;Clone the sources from gitlab:&lt;/p&gt;
&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;git clone --recursive https://gitlab.inria.fr/solverstack/chameleon.git
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Configure with the following for a non-CUDA, MPI-enabled build:&lt;/p&gt;
&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;cd chameleon
mkdir build
cd build
cmake .. -DCMAKE_BUILD_TYPE=Debug \
         -DCMAKE_INSTALL_PREFIX=$HOME/gitrepos/hicma/profiling/chameleon/chameleon/build \
         -DCHAMELEON_USE_CUDA=OFF \
         -DCHAMELEON_USE_MPI=ON \
         -DFXT_DIR=/home/1/17M38101/software/fxt-0.3.8 \
         -DSTARPU_DIR=/home/1/17M38101/software/starpu-1.3.2-test \
         -DSTARPU_FIND_COMPONENTS=ON \
         -DCHAMELEON_ENABLE_TRACING=ON
make install
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;Make sure that you’re using openmpi. For some reason chameleon refuses to work with a starpu
that has been compiled with intel-mpi.&lt;/p&gt;

&lt;h1 id=&quot;compiling-and-linking-your-programs&quot;&gt;Compiling and linking your programs&lt;/h1&gt;

&lt;p&gt;Make sure you have starpu and starpumpi configured in your pkg-config path. You can then
get the compiler flags with &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;pkg-config --cflags chameleon&lt;/code&gt; and linker flags with
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;pkg-config --libs --static chameleon&lt;/code&gt;.&lt;/p&gt;

&lt;h1 id=&quot;distributed-lu-factorization-implementations&quot;&gt;Distributed LU factorization implementations&lt;/h1&gt;

&lt;p&gt;The &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;chameleon_pzgetrf_nopiv(CHAM_desc_t*, RUNTIME_sequence_t *, RUNTIME_request_t *)&lt;/code&gt;
function is used for a distributed LU factorization using Chameleon and starpu underneath. It
implements a right-looking variant of the LU factorization, which is a very common algorithm
made popular by SCALAPACK.&lt;/p&gt;

&lt;p&gt;The &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;chameleon_pzgetrf_incpiv()&lt;/code&gt; function is a used for a distributed LU using a newer
LU algorithm presented in &lt;a href=&quot;&quot;&gt;this paper&lt;/a&gt;. It claims to have superior performance compared
to right-looking LU since communication and computation can be overlapped better.&lt;/p&gt;</content><author><name></name></author><summary type="html">In this post I will write the detail the steps I took to reproduce distributed LU factorization using the Chameleon library. It is a linear algebra library based on the starPU runtime system.</summary></entry><entry><title type="html">Ruby wrappers for the XND project</title><link href="https://v0dro.in//blog/2019/09/08/ruby-wrappers-for-the-xnd-project/" rel="alternate" type="text/html" title="Ruby wrappers for the XND project" /><published>2019-09-08T09:12:00+00:00</published><updated>2019-09-08T09:12:00+00:00</updated><id>https://v0dro.in//blog/2019/09/08/ruby-wrappers-for-the-xnd-project</id><content type="html" xml:base="https://v0dro.in//blog/2019/09/08/ruby-wrappers-for-the-xnd-project/">&lt;!-- markdown-toc start - Don't edit this section. Run M-x markdown-toc-generate-toc again --&gt;
&lt;p&gt;&lt;strong&gt;Table of Contents&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;a href=&quot;#introduction&quot;&gt;Introduction&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#ndtypes&quot;&gt;Ndtypes&lt;/a&gt;
    &lt;ul&gt;
      &lt;li&gt;&lt;a href=&quot;#usage&quot;&gt;Usage&lt;/a&gt;
        &lt;ul&gt;
          &lt;li&gt;&lt;a href=&quot;#basic-initialization&quot;&gt;Basic initialization&lt;/a&gt;&lt;/li&gt;
          &lt;li&gt;&lt;a href=&quot;#concrete-vs-abstract-types&quot;&gt;Concrete Vs. Abstract Types&lt;/a&gt;&lt;/li&gt;
          &lt;li&gt;&lt;a href=&quot;#typedefs&quot;&gt;Typedefs&lt;/a&gt;&lt;/li&gt;
          &lt;li&gt;&lt;a href=&quot;#usage-via-the-c-api&quot;&gt;Usage via The C API&lt;/a&gt;&lt;/li&gt;
        &lt;/ul&gt;
      &lt;/li&gt;
      &lt;li&gt;&lt;a href=&quot;#implementation&quot;&gt;Implementation&lt;/a&gt;&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#xnd&quot;&gt;Xnd&lt;/a&gt;
    &lt;ul&gt;
      &lt;li&gt;&lt;a href=&quot;#basic-usage&quot;&gt;Basic Usage&lt;/a&gt;
        &lt;ul&gt;
          &lt;li&gt;&lt;a href=&quot;#data-type-support&quot;&gt;Data Type Support&lt;/a&gt;&lt;/li&gt;
          &lt;li&gt;&lt;a href=&quot;#missing-values&quot;&gt;Missing Values&lt;/a&gt;&lt;/li&gt;
          &lt;li&gt;&lt;a href=&quot;#usage-via-the-c-api&quot;&gt;Usage via The C API&lt;/a&gt;&lt;/li&gt;
        &lt;/ul&gt;
      &lt;/li&gt;
      &lt;li&gt;&lt;a href=&quot;#implementation&quot;&gt;Implementation&lt;/a&gt;&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#gumath&quot;&gt;Gumath&lt;/a&gt;
    &lt;ul&gt;
      &lt;li&gt;&lt;a href=&quot;#usage&quot;&gt;Usage&lt;/a&gt;
        &lt;ul&gt;
          &lt;li&gt;&lt;a href=&quot;#usage-via-the-c-api&quot;&gt;Usage via The C API&lt;/a&gt;&lt;/li&gt;
        &lt;/ul&gt;
      &lt;/li&gt;
      &lt;li&gt;&lt;a href=&quot;#implementation&quot;&gt;Implementation&lt;/a&gt;&lt;/li&gt;
      &lt;li&gt;&lt;a href=&quot;#automatic-kernel-generation&quot;&gt;Automatic Kernel Generation&lt;/a&gt;&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;!-- markdown-toc end --&gt;

&lt;h1 id=&quot;introduction&quot;&gt;Introduction&lt;/h1&gt;

&lt;p&gt;Lack of stable and reliable scientific computing software has been a persistent problem
for the Ruby community, making it hard for enthusiastic Ruby developers to use Ruby in
everything from their web applications to their data analysis projects. One of the most important
components of any successful scientific software stack is a well maintained and flexible
array computation library that can act as a fast and simple way of storing in-memory data
and interfacing it with various fast and battle-tested libraries like LAPACK and BLAS.&lt;/p&gt;

&lt;p&gt;Various projects have attempted to make such libraries in the past (and some are still thriving
and maintained). Some of the notable ones are &lt;a href=&quot;https://github.com/ruby-numo&quot;&gt;numo&lt;/a&gt;, &lt;a href=&quot;https://github.com/SciRuby/nmatrix&quot;&gt;nmatrix&lt;/a&gt;, and more recently, &lt;a href=&quot;https://github.com/SciRuby/numruby&quot;&gt;numruby&lt;/a&gt;.
These projects attempt to provide a simple Ruby-like API for creating and manipulating arrays
of various types. All of them are able to easily interface with libraries like ATLAS, FFTW
and LAPACK.&lt;/p&gt;

&lt;p&gt;However, all of the above projects fall short in two major aspects:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;Lack of extensibility to adapt to modern use cases (read Machine Learning).&lt;/li&gt;
  &lt;li&gt;Lack of a critical mass of developers to maintain a robust and fast array library.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The first problem is mainly due to the fact that they do not support very robust type systems.
The available data types are limited and are extend to more complex uses. Modern use cases like
Machine Learning require much more robust data types, as has been demonstrated by the tensor
implementations of various frameworks like Tensorflow and PyTorch.&lt;/p&gt;

&lt;p&gt;The second problem is due to the fact that all of the aforementioned projects are community
efforts that are maintained part-time by developers simply out of a sense of purpose and
passion. Sustaining such complex projects for extended periods of time without expectation
of any support is simply unfeasible even for the most driven engineers.&lt;/p&gt;

&lt;p&gt;This is where the XND project comes in. The &lt;a href=&quot;https://xnd.io/&quot;&gt;XND project&lt;/a&gt; is a project for
building a common library that is able to meet the needs of the various data analysis and
machine learning frameworks that have had to build their own array objects and programming 
languages. It is built with the premise of extending arrays with new types and various 
device types (CPUs, GPUs etc.) without loss of performance and ease of use.&lt;/p&gt;

&lt;p&gt;The XND project as a whole is a product of three C libraries : ndtypes, xnd and gumath. They
have been made such that they can work as standalone C libraries that can be interfaced
with any language binding (currently supporting Ruby and Python). Ndtypes is used for defining
the shape of data within memory, XND is a data container that holds that data and gumath provides
a multiple dispatch mechanism for performing computations on data held in XND containers. We will
elaborate on each of these in the post below.&lt;/p&gt;

&lt;p&gt;The XND project presents the perfect answer to Ruby’s lack of a mature array computation ecosystem. 
It is highly extensible, allows defining data types in almost any combination with a simple and
intuitive interface, is built with performance in mind and is backed by a team consisting of 
experts who have vast experience in this domain for the Python scientific computing stack.&lt;/p&gt;

&lt;p&gt;The biggest backer of XND as of now is Quansight, and I as a part-time engineer am responsible 
for maintaining the Ruby wrapper for XND. This post is a rather long and detailed introduction 
to the XND ruby wrapper including various use cases and benchmarks. There will also be some
details on the implementation of the wrapper and how it differs from the python wrapper (which
existed before the Ruby wrapper). Read on for further details.&lt;/p&gt;

&lt;p&gt;All the source code can be found in the &lt;a href=&quot;https://github.com/xnd-project/xnd-ruby&quot;&gt;xnd-ruby&lt;/a&gt; repo.&lt;/p&gt;

&lt;h1 id=&quot;ndtypes&quot;&gt;Ndtypes&lt;/h1&gt;

&lt;p&gt;Ndtypes is the library that is used for defining the shape of data.&lt;/p&gt;

&lt;p&gt;Run &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;gem install ndtypes --pre&lt;/code&gt; for easily installing ndtypes onto your machine. It has
been tested with Ruby 2.4.1 so far. The &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;gem install&lt;/code&gt; will download the C sources and compile
them by itself.&lt;/p&gt;

&lt;h2 id=&quot;usage&quot;&gt;Usage&lt;/h2&gt;

&lt;h3 id=&quot;basic-initialization&quot;&gt;Basic initialization&lt;/h3&gt;

&lt;p&gt;The ndtypes Ruby wrapper provides a simple interface to the ndtypes C library for creating
complex data shapes with extreme simplicity. For example, for creating an array of 10 &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;int64&lt;/code&gt;
digits, all we need to do is create an instance of the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;NDT&lt;/code&gt; class:&lt;/p&gt;
&lt;div class=&quot;language-ruby highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;n&quot;&gt;t&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;no&quot;&gt;NDT&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;new&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;10 * int64&quot;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Not only can you create arrays, but also very complex types, for example a nested record (xnd 
terminology for a Ruby &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Hash&lt;/code&gt;) with the values as arrays of type &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;float32&lt;/code&gt; of size 25 each:&lt;/p&gt;
&lt;div class=&quot;language-ruby highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;n&quot;&gt;t&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;no&quot;&gt;NDT&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;new&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;{x: 25 * float32, y: {a: 25 * float64, 25 * float64}}&quot;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h3 id=&quot;concrete-vs-abstract-types&quot;&gt;Concrete Vs. Abstract Types&lt;/h3&gt;

&lt;p&gt;Ndtypes distinguishes types depending on whether they are abstract or concrete. Abstract types
can have symbolic values like dimension or type variables and are used for type checking. Concrete 
types additionally have full memory layout information like alignment and data size.&lt;/p&gt;

&lt;p&gt;Some operations can be only performed on abstract types.&lt;/p&gt;

&lt;h3 id=&quot;typedefs&quot;&gt;Typedefs&lt;/h3&gt;

&lt;p&gt;One can also define typedefs using the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;NDT#typedef&lt;/code&gt; function and then use them in place of
the original type. Here’s an example of using typedefs to define a graph type:&lt;/p&gt;
&lt;div class=&quot;language-ruby highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;no&quot;&gt;NDT&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;typedef&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;node&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;int32&quot;&lt;/span&gt;
&lt;span class=&quot;no&quot;&gt;NDT&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;typedef&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;cost&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;int32&quot;&lt;/span&gt;
&lt;span class=&quot;no&quot;&gt;NDT&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;typedef&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;graph&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;var * var * (node, cost)&quot;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h3 id=&quot;usage-via-the-c-api&quot;&gt;Usage via The C API&lt;/h3&gt;

&lt;p&gt;Most of the C API functions of ndtypes deal with creating &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;NDT&lt;/code&gt; Ruby objects or obtaining
internal struct data of an &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;NDT&lt;/code&gt; Ruby object. The complete specification can be found
in the &lt;a href=&quot;https://github.com/xnd-project/xnd-ruby/blob/master/ndtypes/ext/ruby_ndtypes/ruby_ndtypes.h&quot;&gt;ruby_ndtypes.h&lt;/a&gt; file. This is the file you should include if you want to use the
C API in any of your libraries.&lt;/p&gt;

&lt;h2 id=&quot;implementation&quot;&gt;Implementation&lt;/h2&gt;

&lt;p&gt;The Ruby wrapper is a wrapper over the libndtypes library. The &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;NDT&lt;/code&gt; Ruby object is a wrapper
over a C struct of type &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;NdtObject&lt;/code&gt; that has the following definition:&lt;/p&gt;
&lt;div class=&quot;language-c highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;k&quot;&gt;typedef&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;struct&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;NdtObject&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
  &lt;span class=&quot;k&quot;&gt;const&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ndt_t&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ndt&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;                   &lt;span class=&quot;cm&quot;&gt;/* type */&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;NdtObject&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;This simple struct stores a pointer to a struct of type &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;const ndt_t *&lt;/code&gt; that is provided
by libndtypes for representing an ndtype. The &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;const ndt_t&lt;/code&gt; structs are allocated by
various libndtypes functions like &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ndt_from_string()&lt;/code&gt; or &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ndt_alloc()&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Internally libndtypes uses a reference counting mechanism for keeping track of &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ndt_t&lt;/code&gt; allocations
that need to be destroyed. The reference count can be incremented using &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ndt_incref()&lt;/code&gt; or
decremented using &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ndt_decref()&lt;/code&gt;. Once the refcount reaches the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;0&lt;/code&gt; the object is automatically
destroyed by libndtypes. Of course, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ndt_t&lt;/code&gt; structs allocated via calls to functions like
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ndt_alloc()&lt;/code&gt; already come with a refcount of 1.&lt;/p&gt;

&lt;h1 id=&quot;xnd&quot;&gt;Xnd&lt;/h1&gt;

&lt;p&gt;XND is the main storage library of the project. It uses types defined by ndtypes for defining
the shape of data and allows users to read and write data into buffers that are of the shape
of the data passed to it by ndtypes. It is responsible for maintaining the memory consistency
of data and has provisions for various operations such as slicing, copying and interfacing
data with 3rd party libraries like Apache Arrow. It also serves as a memory buffer for the
functions that are defined within gumath (explained later in this post).&lt;/p&gt;

&lt;p&gt;Similar to the ndtypes wrapper, the xnd Ruby wrapper can be installed with a call to
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;gem install xnd --pre&lt;/code&gt;.&lt;/p&gt;

&lt;h2 id=&quot;basic-usage&quot;&gt;Basic Usage&lt;/h2&gt;

&lt;p&gt;The xnd Ruby wrapper is extremely simple to use and provides a single class &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;XND&lt;/code&gt; for the
user that interfaces with libxnd. In the simplest case, one can create an XND object as follows:&lt;/p&gt;
&lt;div class=&quot;language-ruby highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;n&quot;&gt;x&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;no&quot;&gt;XND&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;new&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;3&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;4&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;# =&amp;gt; #&amp;lt;XND:47340720296980&amp;gt;&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;#	 type= 4 * int64&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;#	 value= [1, 2, 3, 4]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;Since we have not specified the data type, it will be inferred as &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;int64&lt;/code&gt; since we are supplying
an array composed entirely of integers. This can be seen using the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;XND#dtype&lt;/code&gt; function, which
will return the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;NDT&lt;/code&gt; object that holds the type of this &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;XND&lt;/code&gt; object:&lt;/p&gt;
&lt;div class=&quot;language-ruby highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;n&quot;&gt;x&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;dtype&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;# =&amp;gt; #&amp;lt;NDTypes:47340721833280&amp;gt;&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;#	int64 &lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;While &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;XND#dtype&lt;/code&gt; gives the general type of the object, a more precise description of the data
type (including shape etc.) can be obtained using the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;type&lt;/code&gt; method:&lt;/p&gt;
&lt;div class=&quot;language-ruby highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;n&quot;&gt;x&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;type&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;# =&amp;gt; #&amp;lt;NDTypes:47340721846240&amp;gt;&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;#  4 * int64&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;The value within the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;XND&lt;/code&gt; object can be obtained as a Ruby Array (or Hash if it is a NDT record)
using the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;XND#value&lt;/code&gt; method:&lt;/p&gt;
&lt;div class=&quot;language-ruby highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;n&quot;&gt;x&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;value&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;# =&amp;gt; [1, 2, 3, 4]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;We can also perform operations for checking equality between &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;XND&lt;/code&gt; objects using the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;==&lt;/code&gt; or &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;!=&lt;/code&gt; operators:&lt;/p&gt;
&lt;div class=&quot;language-ruby highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;n&quot;&gt;a&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;no&quot;&gt;XND&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;new&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;3&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;4&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;x&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;==&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;a&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;# =&amp;gt; true&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;A nice thing about &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;XND&lt;/code&gt; is that it returns copy-free ‘views’ of data when you perform a slicing
operation. So say we define a 2D tensor &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;tensor_2d&lt;/code&gt; like this:&lt;/p&gt;
&lt;div class=&quot;language-ruby highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;n&quot;&gt;tensor_2d&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;no&quot;&gt;XND&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;new&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;
  &lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;
    &lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;3&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;4&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;5&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;],&lt;/span&gt;
    &lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;3&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;4&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;5&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;],&lt;/span&gt;
    &lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;3&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;4&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;5&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;],&lt;/span&gt;
    &lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;3&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;4&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;5&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;],&lt;/span&gt;
    &lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;3&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;4&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;5&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;
  &lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;tensor_2d&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;inspect&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;# =&amp;gt; #&amp;lt;XND:47340720946720&amp;gt;&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;#	 type= 5 * 5 * int64&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;#	 value= [[1, 2, 3, 4, 5], [1, 2, 3, 4, 5], [1, 2, 3, 4, 5], [1, 2, 3, 4, 5], [1, 2, 3, 4, 5]]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;We can obtain a slice (say the 2nd column) of the tensor using Ruby Range. Note that using &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;INF&lt;/code&gt; is a shorthand for specifying the entire axis (usually denoted as &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;0..-1&lt;/code&gt;):&lt;/p&gt;
&lt;div class=&quot;language-ruby highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;n&quot;&gt;vector_view&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;tensor_2d&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;no&quot;&gt;INF&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;# =&amp;gt; #&amp;lt;XND:47340720380980&amp;gt;&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;#	 type= 5 * int64&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;#	 value= [3, 3, 3, 3, 3]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;When using slices, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;XND&lt;/code&gt; will always return a ‘view’ of the original &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;XND&lt;/code&gt; object. Changes
made to this slice will reflect on the original XND object as well:&lt;/p&gt;
&lt;div class=&quot;language-ruby highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;n&quot;&gt;vector_view&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;666&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;tensor_2d&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;inspect&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;# =&amp;gt; #&amp;lt;XND:47340720946720&amp;gt;&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;#	 type= 5 * 5 * int64&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;#	 value= [[1, 2, 3, 4, 5], [1, 2, 3, 4, 5], [1, 2, 666, 4, 5], [1, 2, 3, 4, 5], [1, 2, 3, 4, 5]]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;However, the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;type&lt;/code&gt; of the view and the original object differ as they should:&lt;/p&gt;
&lt;div class=&quot;language-ruby highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;n&quot;&gt;vector_view&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;type&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;# =&amp;gt; #&amp;lt;NDTypes:47340720381100&amp;gt;&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;#	5 * int64 &lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;tensor_2d&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;type&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;# =&amp;gt; #&amp;lt;NDTypes:47340720939360&amp;gt;&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;#	5 * 5 * int64&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;If you want a separate storage space for the view (i.e. do not want changes to the view
to reflect on the parent object), you should use the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;XND#dup&lt;/code&gt; method and make a copy. You
can also allocate a data container without storing any data into it using the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;XND.empty&lt;/code&gt;
method as can be seen in the following examples.&lt;/p&gt;

&lt;h3 id=&quot;data-type-support&quot;&gt;Data Type Support&lt;/h3&gt;

&lt;p&gt;As a result of the flexibility provided by the ndtypes type definition interface, xnd is able
to provide type support for far more flexible data shapes than simply for arrays with fixed
dimensions. For example, you can use records for storing Ruby Hashes and performing operations
on them:&lt;/p&gt;
&lt;div class=&quot;language-ruby highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nb&quot;&gt;require&lt;/span&gt; &lt;span class=&quot;s1&quot;&gt;'xnd'&lt;/span&gt;

&lt;span class=&quot;n&quot;&gt;x&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;no&quot;&gt;XND&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;empty&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;{x: complex64, y: bytes, z: string}&quot;&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;v&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt; &lt;span class=&quot;s1&quot;&gt;'x'&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&amp;gt;&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;+&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;20&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;i&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s1&quot;&gt;'y'&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&amp;gt;&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;abc&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;b&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s1&quot;&gt;'z'&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&amp;gt;&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;any&quot;&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;x&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;s1&quot;&gt;'x'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;v&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;s1&quot;&gt;'x'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;x&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;s1&quot;&gt;'y'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;v&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;s1&quot;&gt;'y'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;x&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;s1&quot;&gt;'z'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;v&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;s1&quot;&gt;'z'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;x&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;# =&amp;gt; #&amp;lt;XND:47340721378580&amp;gt;&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;#	 type= {x : complex64, y : bytes, z : string}&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;#	 value= {&quot;x&quot;=&amp;gt;(1.0+20.0i), &quot;y&quot;=&amp;gt;&quot;abc&quot;, &quot;z&quot;=&amp;gt;&quot;any&quot;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h3 id=&quot;missing-values&quot;&gt;Missing Values&lt;/h3&gt;

&lt;p&gt;XND also supports optional data (represented by &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;nil&lt;/code&gt;). It can be created as follows:&lt;/p&gt;
&lt;div class=&quot;language-ruby highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;n&quot;&gt;x&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;no&quot;&gt;XND&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;empty&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;2 * 4 * ?float64&quot;&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;v&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;[[&lt;/span&gt;&lt;span class=&quot;mf&quot;&gt;10.0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;kp&quot;&gt;nil&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mf&quot;&gt;2.0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mf&quot;&gt;100.12&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;],&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;kp&quot;&gt;nil&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;kp&quot;&gt;nil&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mf&quot;&gt;6.0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mf&quot;&gt;7.0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]]&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;x&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;no&quot;&gt;INF&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;v&lt;/span&gt; &lt;span class=&quot;c1&quot;&gt;# assign full slice&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;# =&amp;gt; [[10.0, nil, 2.0, 100.12], [nil, nil, 6.0, 7.0]] &lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h3 id=&quot;usage-via-the-c-api-1&quot;&gt;Usage via The C API&lt;/h3&gt;

&lt;p&gt;The primary function of the XND Ruby C API is for creating and querying XND Ruby objects.
The full API can be found in the &lt;a href=&quot;https://github.com/xnd-project/xnd-ruby/blob/master/xnd/ext/ruby_xnd/ruby_xnd.h&quot;&gt;ruby_xnd.h&lt;/a&gt; file.&lt;/p&gt;

&lt;h2 id=&quot;implementation-1&quot;&gt;Implementation&lt;/h2&gt;

&lt;p&gt;The implementation of the Ruby wrapper differs from the Python wrapper largely due to nature
of the garbage collection algorithms employed by both these languages: Ruby uses a mark-and-sweep
GC while Python uses a reference counted GC. Therefore, Ruby objects created within the C extension
have to somehow be kept ‘alive’ such that the GC does not deallocate them thinking that they
have gone out of scope and are no longer useful.&lt;/p&gt;

&lt;p&gt;For this purpose we utilize a ‘GC guard’ structure (inspired by the implementation of 
&lt;a href=&quot;https://github.com/mrkn/&quot;&gt;@mrkn&lt;/a&gt;’s &lt;a href=&quot;https://github.com/mrkn/matplotlib.rb&quot;&gt;matplotlib.rb&lt;/a&gt; gem). The GC guard is basically a global Ruby Hash that has the
Ruby object created within the C extension as a key and something random as a value. We
use a Hash because it provides lookups in &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;O(1)&lt;/code&gt; time and we don’t care about the value
because we only want to save the object in some kind of a global store so that Ruby is 
aware of its presence (in case of NDT we use &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;true&lt;/code&gt; for the value). XND uses three different
GC guards for various internal objects, which can be found in the &lt;a href=&quot;https://github.com/xnd-project/xnd-ruby/blob/master/xnd/ext/ruby_xnd/gc_guard.h&quot;&gt;gc_guard.h&lt;/a&gt; file.&lt;/p&gt;

&lt;h1 id=&quot;gumath&quot;&gt;Gumath&lt;/h1&gt;

&lt;p&gt;While ndtypes and xnd allow us to define types and memory storage, gumath allows us to actually
do something with them. The basic idea behind gumath is that it is a library that allows defining
functions for various data types stored within an &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;XND&lt;/code&gt; object and allows the user to transparently
call them using a high level interface that uses multiple dispatch for calling the relevant function
on the appropriate type. The Ruby interface is a wrapper over the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;libgumath&lt;/code&gt; C library.&lt;/p&gt;

&lt;p&gt;Some functions (known as kernels) come bundled with libgumath and others can be written fairly
easily. Similar to the xnd and ndtypes wrappers, the gumath Ruby wrapper can be installed with a call to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;gem install gumath --pre&lt;/code&gt;.&lt;/p&gt;

&lt;h2 id=&quot;usage-1&quot;&gt;Usage&lt;/h2&gt;

&lt;p&gt;The &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Gumath&lt;/code&gt; class is a top level namespace for various modules that serve as namespaces for
functions that come rolled in with the libgumath C library. These modules will keep expanding
as more interfaces are added to libgumath. The &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Gumath::Functions&lt;/code&gt; module contains various
such functions that are provided by libgumath by default.&lt;/p&gt;

&lt;p&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Gumath&lt;/code&gt; functions accept &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;XND&lt;/code&gt; objects as arguments and output &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;XND&lt;/code&gt; objects with the result
of the function. An example of a simple element-wise multiply kernel is the following:&lt;/p&gt;
&lt;div class=&quot;language-ruby highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nb&quot;&gt;require&lt;/span&gt; &lt;span class=&quot;s1&quot;&gt;'xnd'&lt;/span&gt;
&lt;span class=&quot;nb&quot;&gt;require&lt;/span&gt; &lt;span class=&quot;s1&quot;&gt;'gumath'&lt;/span&gt;

&lt;span class=&quot;n&quot;&gt;x&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;no&quot;&gt;XND&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;new&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;3&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;4&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;5&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;6&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;7&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;8&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;9&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;],&lt;/span&gt; &lt;span class=&quot;ss&quot;&gt;dtype: &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;float64&quot;&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;y&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;no&quot;&gt;XND&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;new&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;3&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;4&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;5&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;6&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;7&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;8&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;],&lt;/span&gt; &lt;span class=&quot;ss&quot;&gt;dtype: &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;float64&quot;&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;z&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;no&quot;&gt;Gumath&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;no&quot;&gt;Functions&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;multiply&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;x&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;y&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;# =&amp;gt; #&amp;lt;XND:47340721458320&amp;gt;&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;#	 type= 8 * float64&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;#	 value= [2.0, 6.0, 12.0, 20.0, 30.0, 42.0, 56.0, 72.0]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h3 id=&quot;usage-via-the-c-api-2&quot;&gt;Usage via The C API&lt;/h3&gt;

&lt;p&gt;Since the main purpose of the gumath C API is to allow adding kernels to a Ruby module,
it provides a single function of the prototype:&lt;/p&gt;
&lt;div class=&quot;language-c highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;kt&quot;&gt;int&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;rb_gumath_add_functions&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;VALUE&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;module&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;const&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;gm_tbl_t&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;tbl&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;The &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;module&lt;/code&gt; parameter is a Ruby object, and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;tbl&lt;/code&gt; is a function table of gumath kernels.&lt;/p&gt;

&lt;h2 id=&quot;implementation-2&quot;&gt;Implementation&lt;/h2&gt;

&lt;p&gt;Compared to xnd and ndtypes, the gumath Ruby wrapper is much simpler
since its primary function is to take functions from libgumath and add them as module 
functions to Ruby modules.&lt;/p&gt;

&lt;p&gt;When the library is initially loaded using a call to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;require&lt;/code&gt;, the relevant libgumath kernels
provided by default are loaded into the Ruby interpreter by interfacing each kernel with a
Ruby object. Further details on the working of the method dispatch within Ruby can be found
in the &lt;a href=&quot;https://github.com/xnd-project/xnd-ruby/blob/master/gumath/CONTRIBUTING.md&quot;&gt;CONTRIBUTING&lt;/a&gt; file.&lt;/p&gt;

&lt;p&gt;The most important part of the C implementation is the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;GufuncObject&lt;/code&gt; class which is a Ruby
class defined within the C API that helps interface with a single gumath function. This class
is basically a wrapper over a C struct &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;GufuncObject&lt;/code&gt; that can be found in the &lt;a href=&quot;https://github.com/xnd-project/xnd-ruby/blob/master/gumath/ext/ruby_gumath/gufunc_object.h&quot;&gt;gufunc_object.h&lt;/a&gt;
file.&lt;/p&gt;

&lt;p&gt;The struct has the following definition:&lt;/p&gt;
&lt;div class=&quot;language-c highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;k&quot;&gt;typedef&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;struct&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
  &lt;span class=&quot;k&quot;&gt;const&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;gm_tbl_t&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;table&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;          &lt;span class=&quot;cm&quot;&gt;/* kernel table */&lt;/span&gt;
  &lt;span class=&quot;kt&quot;&gt;char&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;name&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;                     &lt;span class=&quot;cm&quot;&gt;/* function name */&lt;/span&gt;
  &lt;span class=&quot;kt&quot;&gt;uint32_t&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;flags&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;                 &lt;span class=&quot;cm&quot;&gt;/* memory target */&lt;/span&gt;
  &lt;span class=&quot;n&quot;&gt;VALUE&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;identity&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;                 &lt;span class=&quot;cm&quot;&gt;/* identity element */&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;GufuncObject&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;The &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;table&lt;/code&gt; pointer is the pointer to the definition of the function within libgumath that
holds information about the function that is used by &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;gm_apply&lt;/code&gt; for making the actual call to
the function with the data. &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;name&lt;/code&gt; is a string holding the name of the function. &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;flags&lt;/code&gt;
signify whether the function is a CPU function or a CUDA function (or for that matter any
other device that might be added in the future). &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;identity&lt;/code&gt; is a Ruby object used for identifying
this function. It is initially set to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;nil&lt;/code&gt;.&lt;/p&gt;

&lt;h2 id=&quot;automatic-kernel-generation&quot;&gt;Automatic Kernel Generation&lt;/h2&gt;

&lt;p&gt;Writing kernels can be painstaking if you’re not familiar with the various functionalities
that libgumath provides for this purpose. Therefore we also provide a kernel generator
called &lt;a href=&quot;https://xnd.readthedocs.io/en/latest/xndtools/index.html#kernel-generator&quot;&gt;xndtools&lt;/a&gt; that allows writing gumath kernels by simply providing the function
that needs to wrapped. However, this functionality has not yet been tested for Ruby.&lt;/p&gt;</content><author><name></name></author><summary type="html">Table of Contents</summary></entry><entry><title type="html">Making research posters using latex and emacs.</title><link href="https://v0dro.in//blog/2018/11/15/making-research-posters-using-latex-and-emacs/" rel="alternate" type="text/html" title="Making research posters using latex and emacs." /><published>2018-11-15T07:54:25+00:00</published><updated>2018-11-15T07:54:25+00:00</updated><id>https://v0dro.in//blog/2018/11/15/making-research-posters-using-latex-and-emacs</id><content type="html" xml:base="https://v0dro.in//blog/2018/11/15/making-research-posters-using-latex-and-emacs/">&lt;p&gt;Using the a0poster package, you can use Latex for designing reserach posters.&lt;/p&gt;

&lt;h1 id=&quot;helpful-commands&quot;&gt;Helpful commands&lt;/h1&gt;

&lt;h2 id=&quot;various-programming-constructs&quot;&gt;Various programming constructs&lt;/h2&gt;

&lt;h3 id=&quot;variables&quot;&gt;Variables&lt;/h3&gt;

&lt;p&gt;Add variables or new commands using &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;\newcommand&lt;/code&gt;. For example &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;\newcommand{\sidel}{6}&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Link: https://stackoverflow.com/questions/1211888/is-there-any-way-i-can-define-a-variable-in-latex&lt;/p&gt;

&lt;h2 id=&quot;adding-a-title-to-your-poster&quot;&gt;Adding a title to your poster&lt;/h2&gt;

&lt;p&gt;Simply use a separate set of &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;minipage&lt;/code&gt; elements and put the title within those.&lt;/p&gt;

&lt;h2 id=&quot;changing-fonts-of-the-section-headings&quot;&gt;Changing fonts of the section headings&lt;/h2&gt;

&lt;p&gt;Use the titlesec package for this
&lt;a href=&quot;https://tex.stackexchange.com/questions/59726/change-size-of-section-subsection-subsubsection-paragraph-and-subparagraph-ti&quot;&gt;purpose&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;A typical configuration for titlesec looks like so:&lt;/p&gt;
&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;\usepackage{titlesec}

\titleformat*{\section}{\LARGE\bfseries}
\titleformat*{\subsection}{\Large\bfseries}
\titleformat*{\subsubsection}{\large\bfseries}
\titleformat*{\paragraph}{\large\bfseries}
\titleformat*{\subparagraph}{\large\bfseries}
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h2 id=&quot;overlaying-text-on-an-image&quot;&gt;Overlaying text on an image&lt;/h2&gt;

&lt;p&gt;Use the ‘overpic’ package for this. This this link for details:
https://tex.stackexchange.com/questions/20792/how-to-superimpose-latex-on-a-picture&lt;/p&gt;

&lt;p&gt;Overpic full docs: http://mirrors.ibiblio.org/CTAN/macros/latex/contrib/overpic/overpic.pdf&lt;/p&gt;

&lt;h2 id=&quot;drawing-boxes-filled-with-colors&quot;&gt;Drawing boxes filled with colors&lt;/h2&gt;

&lt;p&gt;Simpy define a command &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;crule&lt;/code&gt; from the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;rule&lt;/code&gt; command. Definition and usage:&lt;/p&gt;
&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;\newcommand\crule[3][black]{\textcolor{#1}{\rule{#2}{#3}}}

\crule{1cm}{1cm} \crule[blue]{1cm}{1cm} \crule[red!50!white!100]{1cm}{1cm}
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;Link:
https://tex.stackexchange.com/questions/106984/how-to-draw-a-square-of-1cm-in-latex-filled-with-color&lt;/p&gt;

&lt;h2 id=&quot;splitting-into-multiple-rows-and-columns&quot;&gt;Splitting into multiple rows and columns&lt;/h2&gt;

&lt;p&gt;Using &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;minipage&lt;/code&gt; for splitting a document into boxes is recommended. The first argument it
accepts decides how the alignment of the minipage will be. &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;t&lt;/code&gt; is top-aligned and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;b&lt;/code&gt; is
bottom-aligned.&lt;/p&gt;

&lt;h2 id=&quot;graphics-with-tikz&quot;&gt;Graphics with tikz&lt;/h2&gt;

&lt;h3 id=&quot;lu-decomposition-diagram&quot;&gt;LU decomposition diagram&lt;/h3&gt;

&lt;p&gt;Drawing two boxes with L &amp;amp; U:
https://tex.stackexchange.com/questions/317230/lu-factorization-of-a-matrix-with-plot?newreg=991a708140a2446882fdd9bd3c445af9&lt;/p&gt;

&lt;p&gt;Drawing an arrow between tikzpicture objects:
https://tex.stackexchange.com/questions/260587/an-arrow-between-two-tikzpictures&lt;/p&gt;

&lt;h3 id=&quot;dependency-graphs&quot;&gt;Dependency graphs&lt;/h3&gt;

&lt;p&gt;Inspiration can be taken from this state machine &lt;a href=&quot;http://www.texample.net/tikz/examples/state-machine/&quot;&gt;tutorial&lt;/a&gt;
for drawing dependency graphs. Basically put things inside a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;tikzpicture&lt;/code&gt; block. Use the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;\node&lt;/code&gt;
command for defining a node and the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;\path&lt;/code&gt; command for connecting these nodes.&lt;/p&gt;

&lt;h3 id=&quot;drawing-things-on-pictures&quot;&gt;Drawing things on pictures&lt;/h3&gt;

&lt;p&gt;Using tikz one can annonate pictures with various things.&lt;/p&gt;

&lt;p&gt;Link:
https://tex.stackexchange.com/questions/9559/drawing-on-an-image-with-tikz&lt;/p&gt;

&lt;h1 id=&quot;useful-links&quot;&gt;Useful links&lt;/h1&gt;

&lt;ul&gt;
  &lt;li&gt;Getting started PDF: https://www.tug.org/pracjourn/2008-3/morales/morales.pdf&lt;/li&gt;
  &lt;li&gt;Very good sample template poster: https://www.latextemplates.com/template/a0poster-portrait-poster&lt;/li&gt;
&lt;/ul&gt;</content><author><name></name></author><summary type="html">Using the a0poster package, you can use Latex for designing reserach posters.</summary></entry></feed>