The number of ways in which one can profile and benchmark Python programs is daunting. There’s many options out there, and this post is about the ones that I found suitable for profiling and benchmarking PRs that I submit to PyTorch every now and then. Coming from a land of C++ and Ruby, one annoying thing I find about the Python tools is the preference for providing the code to profiled inside a string as an argument to profiling tool, so I try to directly instrument calls within the code wherever possible.

Profiling C extensions

Say you want to know the function profiles of the following PyTorch script, where we want to know where the scatter_ call is spending most of its time:

import torch
import numpy

M=256
N=512
dim = 0

input_one = torch.rand(M, N)
index = torch.tensor(numpy.random.randint(0, M, (M, N)))
res = torch.randn(M, N)

for _i in range(10000):
    res.scatter_(dim, index, input_one)

Using cProfile

The default profiler for Python is cProfile which is a faster version of the profile module. While this is simple to use and does not require any extra dependencies, it does not show profiles of C++ functions at all. You can use it by calling the cProfile.run function and passing it the code to be profiled as a string like so:

import cProfile

# Do something
cProfile.run("res.scatter_(dim,index,input_one)")

You can see the output of the profiler

Using yep

yepis a utility that uses Google’s gperftools underneath and promises to show profiles of C/C++ functions made inside Python C extensions. On Ubuntu/Debian, first install the google-perftools package. Then run pip install yep.

You can set a region to profile as follows:

import yep

yep.start("file_name.prof")
# do something
yep.stop()

This generates a file file_name.prof that be can be analysed using the pprof utility (which can be installed with go get -u github.com/google/pprof). You can then get the top time consuming functions from pprof as follows:

pprof -text -lines file_name.prof

For our same program, profiling the scatter_ loop shows the following output:

File: python3.6
Type: cpu
Showing nodes accounting for 27.51s, 98.81% of 27.84s total
Dropped 151 nodes (cum <= 0.14s)
      flat  flat%   sum%        cum   cum%
     4.45s 15.98% 15.98%     27.49s 98.74%  _ZZZZZN2at6native12_GLOBAL__N_130cpu_scatter_gather_base_kernelILb1EEclERNS_6TensorElRKS4_S7_RKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEbRKNS0_17SCATTE
R_GATHER_OPEENKUlvE_clEvENKUlvE2_clEvENKUlRKT_E_clISt8functionIFvPfSR_EEEEDaSN_ENKUlPPcPKllE_clESV_SX_l /home/sameer/gitrepos/pytorch/build/aten/src/ATen/native/cpu/ScatterGatherKernel.cpp.AVX2.cpp:375
     2.84s 10.20% 26.19%      2.84s 10.20%  _ZNK2at6native12_GLOBAL__N_1UlPT_PT0_E2_clIffEEDaS3_S5_ /home/sameer/gitrepos/pytorch/build/aten/src/ATen/native/cpu/ScatterGatherKernel.cpp.AVX2.cpp:171
     2.54s  9.12% 35.31%      2.54s  9.12%  std::forward /usr/include/c++/7/bits/move.h:74
     1.91s  6.86% 42.17%      5.07s 18.21%  _ZNSt17_Function_handlerIFvPfS0_EN2at6native12_GLOBAL__N_1UlPT_PT0_E2_EE9_M_invokeERKSt9_Any_dataOS0_SE_ /usr/include/c++/7/bits/std_function.h:317
     1.39s  4.99% 47.16%     20.25s 72.74%  std::function::operator() /usr/include/c++/7/bits/std_function.h:706
     1.16s  4.17% 51.33%      1.16s  4.17%  std::forward /usr/include/c++/7/bits/move.h:73
     1.14s  4.09% 55.42%     11.48s 41.24%  _ZNSt17_Function_handlerIFvPfS0_EN2at6native12_GLOBAL__N_1UlPT_PT0_E2_EE9_M_invokeERKSt9_Any_dataOS0_SE_ /usr/include/c++/7/bits/std_function.h:316
     1.04s  3.74% 59.16%      1.04s  3.74%  _ZNSt14_Function_base13_Base_managerIN2at6native12_GLOBAL__N_1UlPT_PT0_E2_EE14_M_get_pointerERKSt9_Any_data /usr/include/c++/7/bits/std_function.h:176
     0.91s  3.27% 62.43%      0.91s  3.27%  _ZNSt14_Function_base13_Base_managerIN2at6native12_GLOBAL__N_1UlPT_PT0_E2_EE14_M_get_pointerERKSt9_Any_data /usr/include/c++/7/bits/std_function.h:175
     0.90s  3.23% 65.66%      0.90s  3.23%  std::_Any_data::_M_access /usr/include/c++/7/bits/std_function.h:107
     0.87s  3.12% 68.79%      0.87s  3.12%  _ZNK2at6native12_GLOBAL__N_1UlPT_PT0_E2_clIffEEDaS3_S5_ /home/sameer/gitrepos/pytorch/build/aten/src/ATen/native/cpu/ScatterGatherKernel.cpp.AVX2.cpp:170
     0.86s  3.09% 71.88%      0.86s  3.09%  std::function::operator() /usr/include/c++/7/bits/std_function.h:701
     0.79s  2.84% 74.71%      0.79s  2.84%  [libtorch_cpu.so]

Some notes on yep

If you change the shared object file that your program was running and call pprof on the same .prof file, the program will show nonsensical functions since it only maps the function hex code to the hex code from the shared object file.

Analyzing performance regressions

Analysis of performance regressions requires comparing the same interfaces over different implementations.

Time regression analysis

The simplest performance regression can be in terms of time of execution. Using the ipython magic command is a great way to know mean and standard deviation of multiple executions of the same lines of code. Using this within a script requires usage of embedded ipython. The timeit magic method allows for timing code, and when used with the -o option will also return the object containing information about the recent timing run.

Profiling and benchmarking Python programs