- 0_tree
- step1.cxx
- step2.cxx
- step3.cxx
- step04.cxx

- 1_traversal
- step1.cxx
- step2.cxx

- 2_kernels
- kernel.h
- vector.h
- exafmm.h
- exafmm2d.h and step1.cxx
- step2.cxx

This program simply populates some bodies with random numbers, creates a hypothetical X and Y axes and figures out the quadrant of each of the bodies.

Each of the nodes of the tree have a maximum of 4 immediate children. We first initialize 100 `struct Body`

objects, and then set the X and Y co-ordinates of each of them to a random number between 0 and 1.

In order to actually build the tree we follow the following steps:

- First get the bounds between which the random numbers lie. That is, we figure out the min and max random number that is present in the bodies.
- We then get a ‘center’ and a ‘radius’. This is useful for creating ‘quadrants’ and partitioning points into different quandrants in later steps. The center is calculated by adding the min and max numbers (which we treat as the diameter) and dividing by 2. This step is necessary since there is no ‘square’ space that can be partitioned into multiple spaces like there was in the lecture series. The way of calculating the radius
`r0`

is a little peculiar. It does not use the distance formula, its main purpose is…. - And then simply count the bodies in each quadrant and display them.

Ruby code:
The body is represented as the Ruby class `Body`

:

There is an interesting way of knowing the quadrant in this code. It goes like this:

Above code basically plays with 0 and 1 and returns a number between 0 and 3 as the correct quadrant number.

This code basically takes the bodies created in the previous step, counts the number of bodies in each quadrant and sorts them by quadrant.

The new steps introduced in this program can be summarized as follows:

- Count the bodies in each quadrant and store the count in an array. The
`size`

array in case of the Ruby implementation. - In the next step we successively add the number of elements in each quadrant so that it gives us the offset value at which elements from a new quadrant will start in the
`bodies`

Array (of course, after it is sorted). - We then sort the bodies according to the quadrant that they belong to. Something peculiar that I notice about this part is that
`counter[quadrant]`

also gets incremented after each iteration for sorting. Why is this the case even though the counters have been set to the correct offsets previously?

This program introduces a new method called `buildTree`

, inside of which we will actually build the tree. It removes some of the sorting logic from `main`

and puts it inside `buildTree`

. The `buildTree`

function performs the following functions:

- Most of the functions relating to sorting etc are same. Only difference is that there is in-place sorting of the
`bodies`

array and the`buffer`

array does not store elements anymore. - A new function introduced is that we re-calculate the center and the radius based on sorted co-ordinates. This is done because we want new center and radii for the children.
- The
`buildTree`

function is called recursively such that the quadrants are divided until a point is reached where the inner most quadrant in the hierarchy does not contain more than 4 elements.

Implementation:

There is an interesting piece of code in the part for calculating new center and radius:

In the above code, there is some bit shifting and interleaving taking place whose prime purpose is to split the quadrant number into X and Y dimension and then using this to calculate the center of the child cell.

Another piece of code is this:

1 2 3 4 5 6 7 8 9 |
counter = Array.new 4, start 1.upto(3) do |i| counter[i] = size[i-1] + counter[i-1] end # sort bodies and store them in buffer buffer = bodies.dup start.upto(finish-1) do |n| quadrant = quadrant_of x0, buffer[n] bodies[counter[quadrant]] = buffer[n] counter[quadrant] += 1 end |

In the above code, the `counter`

variable is first used to store offsets of the elements in different quadrants. In the next loop it is in fact a counter for that stores in the index of the body that is currently under consideration.

In this step we use the code written in the previous steps and actually build the tree. The tree is built recursively by splitting into quadrants and then assigning them to cells based on the quadrant. The ‘tree’ is actually stored in an array.

The cells are stored in a C++ vector called `cells`

.

In the `Cell`

struct, I wonder why the body is stored as a pointer and not a variable.

Implementation in the Ruby code, like saving the size of an Array during a recursive call is slightly different since Ruby does not support pointers, but the data structures and overall code is more or less a direct port.

These codes are for traversal of the tree that was created in the previous step. The full code can be found in 1_traversal.rb file.

This step implements the P2M and M2M passes of the FMM.

One major difference between the C++ and Ruby implementation is that since Ruby does not have pointers, I
have used the array indices of the elements instead. For this purpose there are two attributes in the
`Cell`

class called `first_child_index`

that is responsible for holding the index in the `cells`

array
about the location of the first child of this cell, and the second `first_body_index`

which is responsible for holding the index of the body in the `bodies`

array.

This step does this by introducing a method called `upwardPass`

which iterates through nodes and thier children and computes the P2M and M2M kernels.

This step implements the rest of the kernels i.e. M2L, L2L, L2P and P2P. It also introduces two new methods `downward_pass`

that calculates the local forces from other local forces and L2P interactions and `horizontal_pass`

that calculates the inter-particle interactions and m2l.

No special code as such over here, its just the regular FMM stuff.

This code is quite different from the previous two. While the previous programs were mostly retricted to a single file, this program substantially increases complexity and spreads the implementation across several files. We start using 3 dimensional co-ordinates too.

In this code, we start to make a move towards spherical co-ordinate system to represent the particles in 3D. A few notable algorithms taken from some research papers have been implemented in this code.

Lets describe each file and see what implementation lies inside

The `kernel.h`

header file implemenets all the FMM kernels. It also implements two special functions called `evalMultipole`

and `evalLocal`

that evaluate the multipoles and local expansion for spherical co-ordinates using the actual algorithm that is actually used in exafmm. An implementation of this algorithm can be found on page 16 of the paper “Treecode and fast multipole method for N-body simulation with CUDA” by Yokota sensei. A preliminary implementation of this algorithm can be found in “A Fast Adaptive Multipole Algorithm in Three Dimensions” by Cheng.

The Ruby implementation of this file is in `kernel.rb`

.

I will now describe this algorithm here best I can:

This is a vector that defines the spherical harmonics of degree *n* and order *m*. A primitive version for computing this exists in the paper by Cheng and a newer, faster version in the paper by Yokota.

Spherical harmonics allow us to define series of a function in 3D rather in 1D that is usually the case for things like the expansion of *sin(x)*. They are representations of functions on the surface of a sphere instead of on a circle, which is usually the case with other 2D expansion functions. They are like the Fourier series of the sphere. This article explains the notations used nicely.

The order (*n*) and degree (*m*) correspond to the order and degree of the Legendre polynomial that is used for obtaining the spherical harmonic. *n* is an integer and *m* goes from *0..n*.

For causes of optimization, the values stored inside `ynm`

are not the ones that correspond to the spherical harmonic, but are values that yield optimized results when the actual computation happens.

This file is a new and improved version of the laplace.h file from the exafmm-alpha repo. Due to the enhacements made, the code in this file performs calculations that are significantly more accurate than those in laplace.h.

laplace.h consists of a C++ class inside which all the functions reside, along with a constructor that computes pre-determined values for subsequent computation of the kernels. For example, in the constructor of the `Kernel`

class, there is a line like so:

1 |
Anm[nm] = oddOrEven(n)/std::sqrt(fnmm*fnpm); |

This line is computing the value of as is given by Cheng’s paper (equation 14). This value is used in M2L and L2L kernels later. However, this value is never directly computed in the new and optimized `kernel.h`

file. Instead, it modifies the computation of the `Ynm`

vector such that it no longer becomes necessary to involve the `Anm`

term in any kernel computation.

This function converts cartesian co-ordinates in (X,Y,Z) to spherical co-ordinates involving `radius`

, `theta`

and `phi`

. `radius`

is simply the square root of the norm of the co-ordinates (norm is defined as the sum of squares of the co-ordinates in `vec.h`

).

This algorithm calculates the multipole of a cell. It uses spherical harmonics so that net force of the forces inside a sphere and can be estimated on the surface of the sphere, which can then be treated as a single body for estimating forces.

The optimizations that are presented in the `kernel.h`

version of this file are quite complex to understand since they look quite different from the original equation.

For code that is still sane and easier to read, head over to the laplace.h file in exafmm-alpha. The explanations that follow for now are from this file. We will see how the same functions in `kernel.h`

have been modified to make computation faster and less dependent on large number divisions which reduce the accuracy of the system.

The `evalMultipole`

function basically tries to populate the `Ynm`

array with data that is computed with the following equation:

It starts with evaluating terms that need not be computed for every iteration of `n`

, and computes those terms in the outer loop itself. The terms in the outer loop correspond to the condition `m=n`

. The first of these is the exponential term .

After this is a curious case of computation of some indexes called `npn`

and `nmn`

. These are computed as follows:

The corresponding index calculation for the inner loop is like this:

This indexes the `Ynm`

array. This is done because we are visualizing the Ynm array as a pyramid whose base spans from `-m`

to `m`

and who height is `n`

. A rough visualization of this pyramid would be like so:

```
-m ---------- m
n 10 11 12 13 14
| 6 7 8 9
| 3 4 5
| 1 2
V 0
```

The above formulas will give the indexes for each half of the pyramid. Since the values of one half of the pyramid are conjugates of the other half, we can only iterate from `m=0`

to `m<P`

and use this indexing method for gaining the index of the other half of the pyramid.

Now let us talk about the evaluation of the Associated Legendre Polynomial , where *m* is the order of the differential equation and *n* is the degree. The Associated Legendre Polynomial is the solution to the Associated Legendre Equation. The Legendre polynomial can be expressed in terms of the Rodrigues form for computation without dependence on the simple Legendre Polynomial . However, due to the factorials and rather large divisions that need to be performed to compute the Associated Legendre polynomial in this form, computing this equation for large values of *m* and *n* quickly becomes unstable. Therefore, we use a recurrence relation of the Polynomial in order to compute different values.

The recurrence relation looks like so:

This is expressed in the code with the following line:

1 |
p = (x * (2 * n + 1) * p1 - (n + m) * p2) / (n - m + 1) |

It can be seen that `p`

is equivalent to , `p1`

is equivalent to and `p2`

is equivalent to . This convention is followed everywhere in the code.

Observe that the above equation requires the value of *P* for *n-1* and *n+1* to be computed so that the value for *P* at *n* can be computed. Therefore, we first set *m=m+1* and then compute which can be expressed like this:

The above equation is expressed by the following line in the code:

1 |
p = x * (2 * m + 1) * p1 |

If you read the code closely, you will see that just at the beginning of the `evalMultipole`

function, we initialize `p1 = 1`

the first time the looping is done. This is because when `p1`

at the first instance is identified with `m = 0`

, and we substitute `m=0`

in this equation:

We will get .

When you look at the code initially, there might be some confusion regarding the significance of having to `rho`

terms, `rhom`

and `rhon`

. This is written because each term of `Ynm`

depends on a particular power of `rho`

raised to `n`

. So just before the inner loop, you can see the line `rhon = rhom`

, which basically reduces the number of times that `rho`

needs to be multiplied since the outer loop’s value of `rho`

is already set to what it should be for that particular iteration.

Finally, see that there is a line right after the inner loop which reads like this:

1 |
pn = -pn * fact * y |

This line is for calculating the value of `p1`

or after the first iteration of the loop. Since the second factorial term in the equation basically just deals with odd numbers, the calculation of this term can be simplified by simply incrementing by `2`

with `fact += 2`

. The `y`

term in the above equation is in fact `sin(alpha)`

(defined at the top of this function). This is because, if you see the original equation, you will see that the third term is , and *x* is in fact `cos(alpha)`

. Therefore, using the trigonometric equation, we can say simply substitute the entire term with `y`

.

Now that a background of the basic implementation of `evalMultipole`

has been established, we can move over to understanding the code that is placed inside the kernel.h file of the `exafmm/learning`

branch. This code is more optimized and can compute results with much higher accuracy than the code that is present in the `exafmm-alpha`

repo that we previously saw. The main insipiration for this code come’s from the Treecode paper posted above.

In this code, most of the stuff relating to indexing and calculation of the powers of `rho`

is pretty much the same. However, there are some important changes with regards to the computation of the values that go inside the `Ynm`

array. This change is also reflected in the subsequent kernels.

For instance, this new function derives an important term from Epton’s paper (equation `2.20`

).

The Ruby implementation is here.

A major difference exists between computation of M2M in the `kernel.h`

and `laplace.h`

files.

This file defines a new custom type for storing 1D vectors called `vec`

as a C++ class. It also defines various functions that can be used on vectors like `norm`

, `exp`

and other simple arithmetic.

The Ruby implementation of this file is in `vector.rb`

.

Shows a very simple preliminary implementation of the actuall exafmm code. Mostly useful for understanding purpose only.

This blog post is meant to be a summary of the work that SciRuby did over the summer and also of my experience at the GSOC 2016 mentor’s summit.

For the 2016 edition of GSOC we had 4 students - Lokesh Sharma, Prasun Anand, Gaurav Tamba and Rajith Vidanaarachchi. All four were undergraduate computer engineering students from colleges in India or Sri Lanka at the time of GSOC 2016.

Lokesh worked on making improvements to daru, a Ruby DataFrame library. He made very significant contributions to daru by adding functionality for storing and performing operations on categorical data, and also significantly sped up the sorting and grouping functionality of daru. His work has now been successfully integrated into the main branch and has also been released on rubygems. Lokesh has remained active as a daru contributor and regularly contributes code and replies to Pull Requests and issues. You can find a wrap up of the work he did throughout the summer in this blog post.

Prasun worked on creating a Java backend for NMatrix, a Ruby library for performing linear algebra operations similar to numpy in Python. This project opened the doors for scientific computation on JRuby. Prasun was able to complete all his project objectives, and his work is currently awaiting review because of the sheer size of the Pull Request and the variety of changes to the library that he had to make in order to accomplish his project goals. You can read about his summer’s work here. Prasun will also be speaking at Ruby Conf India 2017 about his GSOC work and scientific computing on JRuby in general.

Gaurav worked on creating a Ruby wrapper for NASA’s SPICE toolkit. A need for this was felt since Gaurav’s mentor John is a rocket scientist and was keen having a Ruby wrapper for a library that he used regularly in his work. This resulted in the spice_rub gem. It exposes a very intuitive Ruby interface to the SPICE toolkit. Gaurav also gave a lightning talk about his work at Deccan Ruby Conf (Pune, India). Blog posts summarizing his work can be found here, here and here.

Rajith worked on growing the Ruby wrapper over symengine. His mentor Abinash was a student with SciRuby for GSOC 2015 and volunteered to mentor Rajith so that Rajith could build upon the work that he had done the previous summer. This resulted in a huge increase in functionality for the symengine.rb ruby gem.

To summarize, all four of our students could execute their chosen tasks within the stipulated time and we did not have to fail anyone. All in all, we mentors had a great time working with the students and hope to keep doing this year on year!

The GSOC 2016 mentor’s summit was fantastic. It was great meeting all the contributors and listening to ideas from projects that I had never heard about previously. I also had the opportunity to conduct an unconference session and share my ideas on Scientific Computation in Ruby with like minded people from other organizations.

Here are some photos that I took at the summit:

]]>GSOC has now come to a close. I have learned a great deal myself in the past 3 months, and thought I would share some of my learnings in this blog post in the interest of future GSOC students and mentors.

**Writing a proposal**

Research your ideas at least for a day before asking your first question. Mentors are volunteers and it’s important to respect the time and effort that they’re putting into FOSS. When you do propose an idea, you should also have a good knowledge of why you’re working on that idea in the first place and what kind of long term impact the realization of that idea can have. Putting this across through your proposal can have a positive on your selection. Know how to ask questions on mailing lists. A properly researched question should show that you have first taken an effort to understand the topic and are asking the question only because you got stuck somewhere.

**Community bonding**

Make sure you figure out what exactly you have to research and learn during the community bonding phase. There’s a lot of things out there that can be learned, but only a few specific things will helpful for your project. Focus on them only. Ask specific questions to your mentor.

**Coding**

Setup a daily schedule for coding and stick to it. Constantly keep in touch with your mentor and make sure they know your progress as well as you do. If you run into previously unseen problems (frequent in programming), tell mentor about this ASAP and work out a solution together.

Don’t burn yourself out in your enthusiasm. Take regular breaks. Overworking does more harm than good.

**Student selection**

Short story:
If you’re unsure about a student, *don’t select him*. It’s better to have a more quality than quantity.

Long story: First and foremost, it is very important to establish some organization-wide procedure that will be followed when selecting a student. As a start, consider making a proposal template that contains all the information and details that the student needs to fill up when submitting the proposal. Have a look at the SciRuby application template as an example.

When students start asking questions on the mailing list, it is important for the org admins to keep a watch on students and get a rough idea of who asks the better questions and who don’t. Community participation is a great measure of understanding whether a student will live upto your expectations or not. A proposal with a bad first draft might just turn out to be great simply because the student is open to feedback and is willing to put it in the effort to work on it.

We have 3 rounds:
First round every mentor rates their *own* student only. In the next round all mentors rate *all* students (students without a mentor and bad proposals drop off).

In each case when rating a student. mentors put in a comment, making sure to tell how a student has interacted in the proposal phase, what his current coding looks like, how responsive he is. Mentors can still push their students to do stuff. We like it when students keep responsive in this phase.

In the 3rd round the *org admins* make the final ranking to set the number of slots. By this stage we are pretty clear about the individuals involved (and note that mentor activity counts). When Google allocates the slots the top-ranked students get in.

**Coding period**

Make sure you communicate with your student that they are supposed to send you *daily* updates of their progress. One paragraph about their work that particular day should suffice.

Let’s demonstrate the basic working of a lexical analyser and parser in action with a demonstration of a very simple addition program. Before you start, please make sure rake, oedipus_lex and racc are installed on your computer.

The most fundamental need of any parser is that it needs string tokens to work with, which we will provide by way of lexical analysis by using the oedipus_lex gem (the logical successor of rexical). Go ahead and create a file `lexer.rex`

with the following code:

1 2 3 4 5 6 7 8 9 |
class AddLexer macro DIGIT /\d+/ rule /#{DIGIT}/ { [:DIGIT, text.to_i] } /.|\n/ { [text, text] } inner def do_parse; end # this is a stub. end # AddLexer |

In the above code, we have defined the lexical analyser using Oedipus Lex’s syntax inside the `AddLexer`

class. Let’s go over each element of the lexer one by one:

**macro**

The macro keyword lets you define macros for certain regular expressions
that you might need to write repeatedly. In the above lexer, the macro `DIGIT`

is a regular expression (`\d+`

) for detecting one or more integers. We place the regular expression inside forward slashes (`/../`

) because oedipus_lex requires it that way. The lexer can handle any valid Ruby regular expression. See the Ruby docs for details on Ruby regexps.

**rule**

The section under the `rule`

keyword defines your rules for the lexical analysis. Now it so happens that we’ve defined a macro for detecting digits, and in order to use that macro in the rules, it must be inside a Ruby string interpolation (`#{..}`

). The line to the right of the `/#{DIGIT}/`

states the action that must be taken if such a regular expression is encountered. Thus the lexer will return a Ruby Array that contains the first element as `:DIGIT`

. The second element uses the `text`

variable. This is a reserved variable in lex that holds the text that the lexer has matched. Similar the second rule will match any character (`.`

) or a newline (`/n`

) and return an `Array`

with `[text, text]`

inside it.

**inner**

Under the `inner`

keyword you can specify any code that you want to occur inside your lexer class. This can be any logic that you want your lexer to execute. The Ruby code under the `inner`

section is copied as-is into the final lexer class. In the above example, we’ve written an empty method called `do_parse`

inside this section. This method is mandatory if you want your lexer to sucessfully execute. We’ll be coupling the lexer with `racc`

shortly, so unless you want to write your own parsing logic, you should leave this method empty.

In order for our addition program to be successful, it needs to know what to do with the tokens that are generated by the lexer. For this purpose, we need racc, an LALR(1) parser generator for Ruby. It is similar to yacc or bison and let’s you specify grammars easily.

Go ahead and create a file called `parser.racc`

in the same folder as the previous `lexer.rex`

and `Rakefile`

, and put the following code inside it:

1 2 3 4 5 6 7 8 9 |
class AddParser rule target: exp { result = 0 } exp: exp '+' exp { result += val[2]; puts result } | DIGIT end ---- header require_relative 'lexer.rex.rb' ---- inner def next_token @lexer.next_token end def prepare_parser file_name @lexer = AddLexer.new @lexer.parse_file file_name end |

As you can see, we’ve put the logic for the parser inside the `AddParser`

class. Yacc’s `$$`

is the `result`

; `$0`

, `$1`

… is an array called `val`

, and `$-1`

, `$-2`

… is an array called `_values`

. Notice that in racc, only the parsing logic exists inside the class and everything else (i.e under `header`

and `inner`

) exists *outside* the class. Let’s go over each part of the parser one by one:

**class AddParser**

This is the core class that contains the parsing logic for the addition parser. Similar to `oedipus_lex`

, it contains a `rule`

section that specifies the grammar. The parser expects tokens in the form of `[:TOKEN_NAME, matched_text]`

. The `:TOKEN_NAME`

must be a symbol. This token name is matched to literal characters in the grammar (`DIGIT`

in the above case). `token`

and `expr`

are varibles. Have a look at this introduction to LALR(1) grammars for further information.

**header**

The `header`

keyword tells racc what code should be put at the top of the parser that it generates. You usually put your `require`

statements here. In this case, we load the lexer class so that the parser can use it for accessing the tokens generated by the lexer. Notice that `header`

has 4 hyphens (`-`

) and a space before it. This is mandatory if your program is to not malfunction.

**inner**

The `inner`

keyword tells racc what should be put *inside* the generated parser class. As you can see there are two methods in the above example - `next_token`

and `prepare_parser`

. The `next_token`

method is mandatory for the parser to function and you must include it in your code. It should contain logic that will return the next token for the parser to consider. Moving on the `prepare_parser`

method, it takes a file name that is to be parsed as an argument (how we pass that argument in will be seen later), and initialzes the lexer. It then calls the `parse_file`

method, which is present in the lexer class by default.

The `next_token`

method in turn uses the `@lexer`

object’s `next_token`

method to get a token generated by the lexer so that it can be used by the parser.

Our lexical analyser and parser are now coupled to work with each other, and we now use them in a Ruby program to parse a file. Create a new file called `adder.rb`

and put the following code in it:

1 2 3 4 5 6 |
require_relative 'parser.racc.rb' file_name = ARGV[0] parser = AddParser.new parser.prepare_parser(file_name) parser.do_parse |

The `prepare_parser`

is the same one that was defined in the `inner`

section of the `parser.racc`

above. The `do_parse`

method called on the parser will signal the parser to start doing it’s job.

In a separate file called `text.txt`

put the following text:

```
2+2
```

Oedipus Lex does not have a command line tool like rexical for generating a lexer from the logic specified, but rather has a bunch of rake tasks defined for doing this job.
So now create a `Rakefile`

in the same folder and put this code inside it:

1 2 3 4 5 6 7 8 9 |
require 'oedipus_lex' Rake.application.rake_require "oedipus_lex" desc "Generate Lexer" task :lexer => "lexer.rex.rb" desc "Generate Parser" task :parser => :lexer do `racc parser.racc -o parser.racc.rb` end |

Running `rake parser`

will generate a two new files - `lexer.rex.rb`

and `parser.racc.rb`

- which will house the classes and logic for the lexer and parser, respectively. You can use your newly written lexer + parser with a `ruby adder.rb text.txt`

command. It should output `4`

as the answer.

You can find all the code in this blogpost here.

]]>Was checking out this video (Contortionist - Language 1) and learned about standard C# tuning on a 6 string bass guitar today. He’s used tuning G# C# F# B E A. Killer bass tone. This wiki says something different about C# standard, though.

Trying out some interval training with this video today. Supposd to be really good.

So there are two types of intervals: harmonic and melodic. Harmonic is when is two or more notes are played at a time and melodic is when two or more notes are played separately.

Intervals are described by some properties:

- Quality: Whether it is perfect, major, minor, augmented or diminished. Perfect intervals, if they’re raised by half step become augmented, if they are lowered by half step they become diminished. If perfect intervals are inverted, they remain perfect intervals. So a perfect fifth interved becomes a perfect fourth, and vice versa a perfect fourth interved becomes a perfect fifth. Minor or major intervals can become augmented or diminshed but never perfect.
- Number: Unison, 2nd, 3rd, 4th, 5th,6th,7th, 8th, etc. Number of the interval is the number of letter names that the letter name spans. For example, C to G is a fifth because it spans 5 letter names C-D-E-F-G.

A dyad is a two note chord.

Aural characterestics of intervals: Consonance category: Perfect fifth and octaves are open consonances. Major and minor thirds and sixths are called soft consonances.

Dissonant category: Minor sevenths (C-Bb) and major seconds (C-D) are called mild dissonances. Minor seconds (like C-Db) and major sevenths (C-B) are called sharp dissonances.

The perfect fourth is characterized as a consonant or distant interval depending on its used in context. If a perfect 4th is part of a second inversion major triad

The major 6th interval can be remebered with ‘My Bonnie Lies…’.

To identify a minor 6th interval, play the first inversion of the triad and then play the 1st and 3rd of the inversion.

To identify a major 6th, play the second inversion of the triad so you get the 1st and 3rd notes at a major 6th interval.

Songs for remembering ascending invtervals:

- Major 2nd - Happy Birthday to You.
- Major 3rd - Oh when the saints go marching.
- Perfect 4th - Star Trek Theme (TNG).
- Perfect 5th - Scarborough Fair. (are we ^going….)
- Major 6th - My Bonnie Lies Over…
- Major 7th - Superman theme
- Octave - The Christmas Song

Searching for options in Japan and started with University of Tokyo. Most of their courses seem to be in Japanese but there are a few in English as well. This page has some starting info about the English courses. Also found a collection of colleges here.

So apparently the process for getting into a Japanese college for Master’s can take two paths. The first is like so:

- Talk to a professor and gain a research assistantship with him/her.
- Give an exam and enroll for a 2 year master’s course if you pass that exam.

The second is directly give the exam, but I’m not sure how that can be done since they all appear to be written examinations that are conducted in Japan.

Having a look at the graudate schools of University of Tokyo, Tokyo Insitute of Technology and Kyoto University today.

**University of Tokyo**

UoT seems to have some special selection process for international applicants (link), though it’s not useful for me. There’s a decent contact page here. They’ve also put up a check list for applications here.

**Tokyo Inst. of Technology**

This also has a good graduate program.Tokyo Inst. of Technology has an international graduate program for overseas applicants. The courses seems to be in English mostly. The school of computer science has also participated in the IGP and accept the IGP(A), IGP(B)3 and IGP(C) types of applicants. I seem to be most qualified for the IGP(A) and IGP(C) applications.

The ‘Education Program of Advanced Information Technology Leaders’ seems to be most relevant to my case. This looks like a good PDF to brief about the program.

All the courses require students to arrange for a Tokyo Tech faculty member to serve as their academic supervisor. This handy web application allows you to do that. They also have the MEXT scholarship for outstanding students.

**University of Kyoto**

Page of dept. of information science.

Continuing my research on Tokyo Inst. of Technology. The PDF I pointed to yesterday brought out an interesting observation - IGP(A) students and IGP(C) students seem to have different course work.

It seems the IGP C program at Tokyo Tech. is best for me. I will research that further today. Most probably I’ll need to do a 6 month research assistantship first. Here’s a list of the research groups of the Computer Sci. deptartment at Tokyo Tech.

**Tokyo Inst. of Technology**

Found a list of faculties under the IGP(C) program here.

Had a look at Kyushu Inst. of Technology today. The program for international students looks good.

Also check out scholarship opportunities at Tokyo Inst. of Technology. Links - 1, 2, 3. There are a bunch of scholarships that can be applied to before you enrol in university. Have a look here.

There’s also the MEXT scholarshipfrom the Japanese government.

Found an interesting FAQ on the UoT website.

Also having a look at JASSO scholarships. Found some great scholarships here.

Found some scholarships. Also, I can also enrol as a privately funded research student at Tokyo Tech.

This is a PDF that talks about privately funded research students.

Also checking out Keio University today. They have a program for internation graduate students. Have a look here.

I also had a look at the Kyoto University IGP. Here’s a listing of Japanese universities.

Found a Computer Engineering IGP at Kyoto University, though I still cant find anything related to HPC. This is a link that has some details on admissions.

More details on Tokyo Tech.’s IGP(A) can be found here. This looks like a good resource for curriculum. This has resources for scholarships without recommendation.

Found a good resource on IGP programs at Tokyo Tech here. Here’s a PPT about IGP(A) in particular. IGP(A) coursework can be found here.

Posting after quite a while!

I’m currently having a look at Linz University, Austria. I came to know one of the research groups there is really good and are making some solid progress in high performant software.

Here’s the admissions page of the dept. of computer science. Here’s more info on admissions. This is a PDF on the Computer Science degree.

The System Software group looks nice.

Checking out the Computer Science program at UIC and that at University of Houston.

This is UICs website. This is the detailed PDF of the MS in CS requirements.

]]>- Laney RB2 amplifier
- Tech 21 Sansamp Bass Driver Programmable DI
- Fender Mexican Standard Jazz Bass (4 string)

I will updating this post as and when I learn something new that I’d like to document or share. Suggestions are welcome. You can email me (see the ‘about’ section) or post a comment below.

As of now I’m tweaking the sansamp and trying to achieve good tone that will compliment the post/prog rock sound of my band Cat Kamikazee. I’m also reading up on different terminologies and use cases on the internet. For instance I found this explanation on DI boxes quite useful. For instance I learned that the ‘XLR Out Pad’ button on the sansamp actually provides a 20 db cut to the soundboard if your signal is too hot.

I am trying to couple the sansamp with a basic overdrive pedal I picked up from a friend. This thread on talkbass is pretty useful for that. The guy who answered the question states that it’s better to place the sansamp last in the chain so that the DI can deliver the output of the sound chain.

So the BLEND knob on the sansamp modulates how much of the dry signal is mixed with the sansamp tube amplifier emulation circutry. Can be useful when chaining effects pedals with the sansamp by reducing the blend and letting more of the dry signal pass through. Btw the *bass*, *treble* and *level* controls remain active irrespective of the position of BLEND.

One thing that was a little confusing was the whole thing about ‘harmonic partials’. I found a pretty informative thread about the same on this TalkBass thread.

Here’s an interesting piece on compressors.

Some more useful links I came across over the course of the past few days:

- https://theproaudiofiles.com/amp-overdrive-vs-pedal-overdrive/
- http://www.offbeatband.com/2009/08/the-difference-between-gain-volume-level-and-loudness/

Found an interesting and informative piece on bass pedals here. It’s a good walkthrough of different pedal types and their functionality and purpose.

I wanted to check out some overdrive pedals today but was soon sinking in a sea of terminologies. One thing that intrigued me is the difference between an overdrive, distortion and fuzz. I found a pretty informative article on this topic. The author has the following to say about these 3 different but seemingly similar things.

I had a look at the Darkglass b3k and b7k pedals too. They look like promising overdrive pedals. I’ll explore the b3k more since the only difference between the 3 and the 7 is that the 7 also functions as a DI box and has an EQ, while the 3 doesn’t. I already have a DI with a 2 band EQ in the sansamp.

One thing that I noticed when tweaking my sansamp is the level of ‘distortion’ in my tone varies a LOT when you change the bass or treble keeping the drive at the same level. Why does this happen?

Trying to dive further into distortion today. Found this article kind of useful. It relates mostly to lead guitar tones, but I think it applies in a general case too. I learned about symmetric and asymmetric clipping in that article.

According to the article, symmetric clipping is more focused and clear, because it is only generating one set of harmonic overtones. Since asymmetric clipping can be hard-clipped on one side, and soft-clipped on the other, it has the potential to create very thick complex sounds. This means that if you want plenty of overtones, but do not want a lot of gain, asymmetric clipping can be useful. For full-blown distortion symmetric clipping is usually more suitable, since high-gain tones are already very harmonically complex. *Typically asymmetric clipping will have a predominant first harmonic, which the symmetric clipping will not* (that’s probably why in this video, the SD1 sounds brigther than than the TS-9). High gain distortion tones sound best with most of the distortion coming from the pre-amp, so try to use a fairly neutral pickup or even a slightly ‘bright’ pickup.

The follow up to the above post talks about EQ in relation with distortion. It has stuff on pre and post EQ distortion and how it can affect the overall tone. If you place the EQ before the distortion, you can actually shape which frequencies will be clipped. However if you place it after the distortion then the EQ will only act for shaping the already distorted tone. Pre-dist EQ is more useful in most cases since it let’s you control the frequencies for clipping.

It also says that humbucking pickups have a mid-boost that is more focused by the lower part of the frequency range. Single coil pickups on the other hand have a mid-boost focused by the upper part of the frequency range. Single coils generally have clearer, more articulate bass end.

Read something about bass DI in this article today.

Posting after quite a while!

Reading about the use of compression for bass guitars. Found this article which explains why we need compression in the first place.

Also, my band’s installation of Main Stage 3 has started giving some really weird problems. More about that soon.

Coming back to Main Stage. For some reason, pressing Space Bar for play/pause reduces the default sampling rate and makes the tracks sound weird. We need to go to preferences and increase the sampling rate to 48 kHz again (that’s what our backing tracks are recorded at). I think its something to do with the key mappings, but I’m not sure. Will need to check it out.

It also so happens that after the space bar has been pressed and the issue with the sampling rate is resolved, the samples (which come from a M-Audio M-Track) start emitting a strange crackling sound. This sounds persists only if the headphones are connected into the audio jack (we use the onboard Mac sound card too). The sound goes away if the headphones are unplugged. Restarting the Mac resolves the issue. I suspect there might be a way without having to restart. Will investigate.

Turns out you just restart and it solves the problem (and be careful about what keys you press when on stage!). Not worth scratching your head too much.

I just got a new EHX Micro POG octaver pedal and a TC electronic booster pedal. Also got a TC electronics Polytune. Finally on my way to creating a pedal chain :)

So for now I’m using the pedals in this order:

Tuner -> Octaver -> Booster -> Sansamp

I think this works fine for me for now, though I might change something later on.

I read in this thread that using one octave down with an overdrive (on the sansamp) works wonders. Gonna try that now!

I am also having a look at this guide on setting up a pedal board.

Also found an interesting rig rundown by Tim Commerford (RATM).

]]>I thought I’ll try something new by recording screencasts for some of my work on Ruby open source libraries.

This is quite a change for me since I’m primarily focused on the programming and designing side of things. Creating documentation is something I’ve not ventured into a lot except the usual YARD markup for Ruby methods and classes.

In this blog post (which I will keep updating as time progresses) I hope to document my efforts in creating screencasts. Mind you this is the first time I’m creating a screencast so if you find any potential improvements in my methods please point them out in the comments.

My first ever screencast will be for my benchmark-plot gem. For creating the video I’m mainly using two tools - Kdenlive for video editing and Kazam for recording screen activity. I initially tried using Pitivi and OpenShot for video editing, but the former did not seem user friendly and the latter kept crashing on my system. For the desktop recording I first tried using RecordMyDesktop but gave up on it since it’s too heavy on resources and recoreded poor quality screencasts with not too many customization options.

For creating informative visuals, I’m using LibreOffice Impress so that I can create a slide, take it’s screenshot when in slideshow mode and put in the screencast. However I’ve generally found that using slides does not serve well the content delivery in a screencast and will probably not feature too many slides in future screencasts.

Sublime Text 3 is my primary text editor. I use it’s in built code execution functionality (by pressing `Ctrl + Shift + B`

) to execute a code snippet and display the results immediately.

I am using Audacity for recording sound. Sadly my mic produces a lot of noise, so for removing that noise in Audacity, I use the inbuilt noise reduction tools.

Noise reduction in Audacity can be achieved by first selecting a small part of the sound that does not contain speech, then go to Effects -> Noise Reduction and click on ‘Get Noise Profile’. Then select the whole sound wave with `Ctrl + A`

. Go to Effects -> Noise Reduction again and click ‘OK’. It should considerably reduce static noise from your sound file.

All files are exported to Ogg Vorbis.

I did some research on the screencasting process and found this article by Avdi Grimm and this one by Sayanee Basu extremely helpful.

I first started by writing the transcript along with any code samples that I had to show. I made it a point to describe the code being typed/displayed on the screen since it’s generally more useful to have a voice over explaning the code than having to pause the video and go over it yourself.

Then I recorded the voice over just for the part that featured slides. I imported the screenshots of the slides in kdenlive and adjusted them such that they fit the voice over. Recording the code samples was a bit of a challenge. I started typing out the code and talking about it into the mic. This was more difficult than I thought, almost like playing a Guitar and singing at the same time. I ended up recording the screencast in 4 separate takes, with several retakes for each take.

After importing the screencast with voice over into kdenlive and separating the audio and video components, I did some cuts to reduce redundancy or imperfections in my VO. Some of the parts of the video where there was a lot of typing had to be sped up by using kdenlive’s Speed tool.

Once this was upto my satisfaction, I exported it to mp4.

The video of my first screencast is now up on YouTube in the video below. Have a look and leave your feedback in the comments!

]]>The new features led to the inclusion of daru in many of SciRuby’s gems, which use daru’s data storage, access and indexing features for storing and carrying around data. Statsample, statsample-glm, statsample-timeseries, statsample-bivariate-extensions are all now compatible with daru and use Vector and DataFrame as their primary data structures. Daru’s plotting functionality, that interfaced with nyaplot for creating interactive plots directly from the data was also significantly overhauled.

Also, new gems developed by other GSOC students, notably Ivan’s GnuplotRB gem and Alexej’s mixed_models gem both accept data from daru data structures. Do see their repo pages for seeing interesting ways of using daru.

The work on daru is also proving to be quite useful for other people, which led a talk/presentation at DeccanRubyConf 2015, which is one of the three major ruby conferences in India. You can see the slides and notebooks presented at the talk here. Given the current interest in data analysis and the need for a viable solution in ruby, I plan to take daru much further. Keep watching the repo for interesting updates :)

In the rest of this post I’ll elaborate on all the work done this summer.

Daru as a gem before GSOC was not exactly user friendly. There were many cases, particularly the iterators, that required some thinking before anybody used them. This is against the design philosophy of daru, or even ruby general, where surprising programmers with ubiqtuos constructs is usually frowned down upon by the community. So the first thing that I did mainly concerned overhauling the daru’s many iterators for both `Vector`

and `DataFrame`

.

For example, the `#map`

iterator from `Enumerable`

returns an `Array`

no matter object you call it on. This was not the case before, where `#map`

would a `Daru::Vector`

or `Daru::DataFrame`

. This behaviour was changed, and now `#map`

returns an `Array`

. If you want a `Vector`

or a `DataFrame`

of the modified values, you should call `#recode`

on `Vector`

or `DataFrame`

.

Each of these iterators also accepts an optional argument, `:row`

or `:vector`

, which will define the axis over which iteration is supposed to be carried out. So now there are the `#each`

, `#map`

, `#map!`

, `#recode`

, `#recode!`

, `#collect`

, `#collect_matrix`

, `#all?`

, `#any?`

, `#keep_vector_if`

and `#keep_row_if`

. To iterate over elements along with their respective indexes (or labels), you can likewise use `#each_row_with_index`

, `#each_vector_with_index`

, `#map_rows_with_index`

, `#map_vector_with_index`

, `#collect_rows_with_index`

, `#collect_vector_with_index`

or `#each_index`

. I urge you to go over the docs of each of these methods to utilize the full power of daru.

Apart from this there was also quite a bit of refactoring involved for many methods (courtesy Alexej). This has made daru much faster than previous versions.

The next (major) thing to do was making daru compatible with statsample. This was very essential since statsample is very important tool for statistics in ruby and it was using its own `Vector`

and `Dataset`

classes, which weren’t very robust as computation tools and very difficult to use when it came to cleaning or munging data. So I replaced statsample’s Vector and Dataset clases with Daru::Vector and Daru::DataFrame. It involved a significant amount of work on both statsample and daru. Statsample because many constructs had to changed to make them compatible with daru, and daru because there was a lot of essential functionality in these classes that had to be ported to daru.

Porting code from statsample to daru improved daru significantly. There were a whole of statistics methods in statsample that were imported into daru and you can now use all them from daru. Statsample also works well with rubyvis, a great tool for visualization. You can now do that with daru as well.

Many new methods for reading and writing data to and from files were also added to daru. You can now read and write data to and from CSV, Excel, plain text files or even SQL databases.

In effect, daru is now completely compatible with statsample (and all the other statsample extensions). You can use daru data structures for storing data and pass them to statsample for performing computations. The biggest advantage of this approach is that the analysed data can be passed around to other scientific ruby libraries (some of which listed above) that use daru as well. Since daru offers in-built functions to better ‘see’ your data, better visualization is possible.

See these blogs and notebooks for a complete overview of daru’s new features.

Also see the notebooks in the statsample README for using daru with statsample.

Most of time post the mid term submissions was spent in implementing the time series functions for daru.

I implemented a new index, the DateTimeIndex, which can used for indexing data on time stamps. It enables users to query data based on time stamps. Time stamps can either be specified with precise ruby DateTime objects or can be specified as strings, which will lead to retrival of all the data falling under that time. For example specifying ‘2012’ returns all data that falls in the year 2012. See detailed usage of `DateTimeIndex`

in conjunction with other daru constructs in the daru README.

An essential utility in implementing `DateTimeIndex`

was `DateOffset`

, which is a new set of classes that offsets dates based on certain rules or business logic. It can advance or lag a ruby `DateTime`

to the nearest day or any day of the week or the end or beginning of the month etc. `DateOffset`

is an essential part of `DateTimeIndex`

and can also be used as a standalone utility for advancing/lagging `DateTime`

objects. This blog post elaborates more on the nuances of `DateOffset`

and its usage.

The last thing done during the post mid term was complete compatibility with statsample-timeseries, which was created by Ankur Goel during GSOC 2013. It offers many uesful functions for analysis of time series data. It now works with daru containers. See some use cases here.

Thats all, as far as I can remember.

]]>This post is primarily intended to serve as documentation for me and future contributors. If readers have any inputs on improving this post, I’d be happy to accept new contributions :)

Daru currently supports three types of indexes, Index, MultiIndex and DateTimeIndex.

It became very tedious to write if statements in the Vector or DataFrame codebase whenever a new data structure was to be created, since there were 3 possible indexes that could be attached with every data set. This mainly depended on what kind of data was present in the index, i.e. tuples would create a MultiIndex, DateTime objects or date-like strings would create a DateTimeIndex, and everything else would create a Daru::Index.

This looked something like the perfect use case for the factory pattern, the only hurdle being that the factory pattern in the pure sense of the term would be a superclass, something called `Daru::IndexFactory`

that created an Index, DateTimeIndex or MultiIndex index using some methods and logic. The problem is that I did not want to call a separate class for creating Indexes. This would break existing code and possibly cause problems in libraries that were already using daru (viz. statsample), not to mention confusing users about which class they’re actually supposed to be using.

The solution came after I read this blog post, which demonstrates that the `.new`

method for any class can be overridden. Thus, instead of calling `initialize`

for creating the instance of a class, it calls the overridden `new`

, which can then call initialize for instantiating an instance of that class. It so happens that you can make `new`

return any object you want, unlike initialize which must an instance of the class it is declared in. Thus, for the factory pattern implementation of Daru::Index, we over-ride the `.new`

method of the Daru::Index and write logic such that it manufactures the appropriate kind of index based on the data that is passed to `Daru::Index.new(data)`

. The pseudo code for doing this looks something like this:

1 2 3 4 5 6 7 8 9 |
class Daru::Index # some stuff... def self.new *args, &block source = args[0] if source_looks_like_a_multi_index create_multi_index_and_return elsif source_looks_like_date_time_index create_date_time_index_and_return else # Create the Daru::Index by calling initialize i = self.allocate i.send :initialize, *args, &block i end end # more stuff... end |

Also, since over-riding `.new`

tampers with the subclasses of the class as well, an `inherited`

hook that replaces the over-ridden `.new`

of the inherited class with the original one was added to `Daru::Index`

.

The where clause in daru lets users query data with a Array containing boolean variables. So whenever you call `where`

on Daru::Vector or DataFrame, and pass in an Array containing true or false values, all the rows corresponding with `true`

will be returned as a Vector or DataFrame respectively.

Since the where clause works in cojunction with the comparator methods of Daru::Vector (which return a Boolean Array), it was essential for these boolean arrays to be combined together such that piecewise AND and OR operations could be performed between multiple boolean arrays. Hence, the `Daru::Core::Query::BoolArray`

class was created, which is specialized for handling boolean arrays and performing piecewise boolean operations.

The BoolArray defines the `#&`

method for piecewise AND operations and it defines the `#|`

method for piecewise OR operations. They work as follows:

1 2 3 4 5 6 7 8 9 |
require 'daru' a = Daru::Core::Query::BoolArray.new([true,false,false,true,false,true]) #=> (Daru::Core::Query::BoolArray:84314110 bool_arry=[true, false, false, true, false, true]) b = Daru::Core::Query::BoolArray.new([false,true,false,true,false,true]) #=> (Daru::Core::Query::BoolArray:84143650 bool_arry=[false, true, false, true, false, true]) a & b #=> (Daru::Core::Query::BoolArray:83917880 bool_arry=[false, false, false, true, false, true]) a | b #=> (Daru::Core::Query::BoolArray:83871560 bool_arry=[true, true, false, true, false, true]) |

Arel is a very popular ruby gem that is one of the major components of the most popular ruby framework, Rails. It is an ORM-helper of sorts that exposes a beatiful and intuitive syntax for creating SQL strings by chaining Ruby methods.

Daru successfully adopts this syntax and the result is a very intuitive and readable syntax for obtaining any sort of data from a DataFrame or Vector.

As a quick demonstration, lets create a DataFrame which looks like this:

1 2 3 4 5 6 7 8 9 |
require 'daru' df = Daru::DataFrame.new({ a: [1,2,3,4,5,6]*100, b: ['a','b','c','d','e','f']*100, c: [11,22,33,44,55,66]*100 }, index: (1..600).to_a.shuffle) df.head(5) #=> ##<Daru::DataFrame:80543480 @name = 3fc642f2-bd9a-4f6f-b4a8-0779253720f5 @size = 5> # a b c # 109 1 a 11 # 381 2 b 22 # 598 3 c 33 # 390 4 d 44 # 344 5 e 55 |

To select all rows where `df[:a]`

equals 2 or `df[:c]`

equals 55, just write this:

1 2 3 4 5 6 7 8 9 |
selected = df.where(df[:a].eq(2) | df[:c].eq(55)) selected.head(5) # => ##<Daru::DataFrame:79941980 @name = 74175f76-9dce-4b5d-b85b-bdfbb650953e @size = 5> # a b c # 381 2 b 22 # 344 5 e 55 # 135 2 b 22 # 524 5 e 55 # 266 2 b 22 |

As is easily seen above, the Daru::Vector class has special comparators defined on it, which allow it to check each value of the Vector and return an object that can be evaluated by the `DataFrame#where`

method.

**Notice that to club the two comparators above, we have used the union OR ( |) operator.**

Daru::Vector has a bunch of comparator methods defined on it, which can be used with `#where`

for obtaining the desired results. All of these return an object of type `Daru::Core::Query::BoolArray`

, which is read by `#where`

. `BoolArray`

uses the methods `|`

(also aliased as `#or`

) and `&`

(also aliased as `#and`

) for piecewise logical operations on other `BoolArray`

objects.

BoolArray consists of an internal Array that contains `true`

for every entry in the Vector that returns `true`

for an operation between the comparable operand and a Vector entry.

For example,

1 2 3 4 5 6 |
require 'daru' vector = Daru::Vector.new([1,2,3,4,5,6,7,8,2,3]) vector.eq(3) #=>(Daru::Core::Query::BoolArray:82379030 bool_arry=[false, false, true, false, false, false, false, false, false, true]) |

The `#&`

(or `#and`

) and `#|`

(or `#or`

) methods on BoolArray apply a logical `and`

and a logical `or`

respectively between each element of the BoolArray and return another BoolArray that contains the results. For example:

1 2 3 4 5 6 |
require 'daru' vector = Daru::Vector.new([1,2,3,4,5,6,7,7,8,9,9,9,7,5,4,3,4]) vector.eq(4).or(vector.mt(8)) #=> (Daru::Core::Query::BoolArray:82294620 bool_arry=[false, false, false, true, false, false, false, false, false, true, true, true, false, false, true, false, true]) |

The following comparators can be used with a `Daru::Vector`

:

Comparator Method | Description |
---|---|

`eq` |
Uses `==` and returns `true` for each equal entry |

`not_eq` |
Uses `!=` and returns `true` for each unequal entry |

`lt` |
Uses `<` and returns `true` for each entry less than the supplied object |

`lteq` |
Uses `<=` and returns `true` for each entry less than or equal to the supplied object |

`mt` |
Uses `>` and returns `true` for each entry more than the supplied object |

`mteq` |
Uses `>=` and returns `true` for each entry more than or equal to the supplied object |

`in` |
Uses `==` for each element in the collection (Array, Daru::Vector, etc.) passed and returns `true` for a match |

A major advantage of using the `#where`

clause over `DataFrame#filter`

or `Vector#keep_if`

, apart from better readability and usability, is that it is much faster. These benchmarks prove my point.

I’ll conclude this chapter with a little more complex example of using the arel-like query syntax with a `Daru::Vector`

object:

1 2 3 4 5 6 7 8 9 |
require 'daru' vec = Daru::Vector.new([1,2,3,4,5,6,3,336,3,6,2,6,2,35,346,7,3,45,23,26,7,345,2525,22,66,2]) vec.where((vec.eq(4) | vec.eq(1) | vec.mt(300)) & vec.lt(2000)) # => # #<Daru::Vector:70585830 @name = nil @size = 5 > # nil # 0 1 # 3 4 # 7 336 # 14 346 # 21 345 |

For more examples on using the arel-like query syntax, see this notebook.

Daru::DataFrame offers the `#join`

method for performing SQL style joins between two DataFrames. Currently #join supports inner, left outer, right outer and full outer joins between DataFrames.

In order to demonstrate joins, lets consider a single example of an inner on two DataFrames:

1 2 3 4 5 6 7 8 9 |
require 'daru' left = Daru::DataFrame.new({ :id => [1,2,3,4], :name => ['Pirate', 'Monkey', 'Ninja', 'Spaghetti'] }) right = Daru::DataFrame.new({ :id => [1,2,3,4], :name => ['Rutabaga', 'Pirate', 'Darth Vader', 'Ninja'] }) left.join(right, on: [:name], how: :inner) #=> ##<Daru::DataFrame:73134350 @name = 7cc250a9-108c-4ea3-99ab-dcb828ff2b88 @size = 2> # id_1 name id_2 # 0 1 Pirate 2 # 1 3 Ninja 4 |

For more examples please refer this notebook.

]]>A time series is any data is indexed (or labelled) by time. This includes the stock market index, prices of crude oil or precious metals, or even geo-locations over a period of time.

The primary manner in which daru implements a time series is by indexing data objects (i.e Daru::Vector or Daru::DataFrame) on a new index called the DateTimeIndex. A DateTimeIndex consists of dates, which can queried individually or sliced.

A very basic time series can be created with something like this:

1 2 3 4 5 6 7 |
require 'distribution' require 'daru' rng = Distribution::Normal.rng index = Daru::DateTimeIndex.date_range(:start => '2012-4-2', :periods => 1000, :freq => 'D') vector = Daru::Vector.new(1000.times.map {rng.call}, index: index) |

In the above code, the `DateTimeIndex.date_range`

function is creating a `DateTimeIndex`

starting from a particular date and spanning for 1000 periods, with a frequency of 1 day between period. For a complete coverage of DateTimeIndex see this notebook. For an introduction to the date offsets used by daru see this blog post.

The index is passed into the Vector like a normal `Daru::Index`

object.

Many functions are avaiable in daru for computing useful statistics and analysis. A brief of summary of statistics methods available on time series is as follows:

Method Name |
Description |
---|---|

`rolling_mean` |
Calculate Moving Average |

`rolling_median` |
Calculate Moving Median |

`rolling_std` |
Calculate Moving Standard Deviation |

`rolling_variance` |
Calculate Moving Variance |

`rolling_max` |
Calculate Moving Maximum value |

`rolling_min` |
Calcuclate moving minimum value |

`rolling_count` |
Calculate moving non-missing values |

`rolling_sum` |
Calculate moving sum |

`ema` |
Calculate exponential moving average |

`macd` |
Moving Average Convergence-Divergence |

`acf` |
Calculate Autocorrelation Co-efficients of the Series |

`acvf` |
Provide the auto-covariance value |

To demonstrate, the rolling mean of a Daru::Vector can be computed as follows:

1 2 3 4 5 6 7 8 9 |
require 'daru' require 'distribution' rng = Distribution::Normal.rng vector = Daru::Vector.new( 1000.times.map { rng.call }, index: Daru::DateTimeIndex.date_range( :start => '2012-4-2', :periods => 1000, :freq => 'D') ) # Compute the cumulative sum vector = vector.cumsum rolling = vector.rolling_mean 60 rolling.tail |

This time series can be very easily plotted with its rolling mean by using the GnuplotRB gem:

1 2 3 4 5 6 |
require 'gnuplotrb' GnuplotRB::Plot.new( [vector , with: 'lines', title: 'Vector'], [rolling, with: 'lines', title: 'Rolling Mean']) |

These methods are also available on DataFrame, which results in calling them on each of numeric vectors:

1 2 3 4 5 6 7 8 9 |
require 'daru' require 'distribution' rng = Distribution::Normal.rng index = Daru::DateTimeIndex.date_range(:start => '2012-4-2', :periods => 1000, :freq => 'D') df = Daru::DataFrame.new({ a: 1000.times.map { rng.call }, b: 1000.times.map { rng.call }, c: 1000.times.map { rng.call } }, index: index) |

In a manner similar to that done with Vectors above, we can easily plot each Vector of the DataFrame with GNU plot:

1 2 3 4 5 6 7 8 9 |
require 'gnuplotrb' # Calculate cumulative sum of each Vector df = df.cumsum # Compute rolling sum of each Vector with a loopback length of 60. r_sum = df.rolling_sum(60) plots = [] r_sum.each_vector_with_index do |vec,n| plots << GnuplotRB::Plot.new([vec, with: 'lines', title: n]) end GnuplotRB::Multiplot.new(*plots, layout: [3,1], title: 'Rolling sums') |

Daru now integrates with statsample-timeseries, a statsample extension that provides many useful statistical analysis tools commonly applied to time series.

Some examples with working examples of daru and statsample-timseries are coming soon. Stay tuned!

]]>Daru’s (Data Analysis in RUby) latest release (0.2.0) brings in a host of new features, most important among them being time series manipulation functionality. In this post, we will go over the date offsets that daru offers, which can be used for creating date indexes of specific intervals. The offsets offer a host of options for easy creation of different intervals and even work with standalone DateTime objects to increase or decrease time.

The date offsets are contained in the `Daru::Offsets`

sub-module. A number of classes are offered, each of which implements business logic for advancing or retracting date times by a specific interval.

To demonstrate with a quick example:

1 2 3 4 5 6 |
require 'daru' offset = Daru::Offsets::Hour.new offset + DateTime.new(2012,4,5,4) #=> #<DateTime: 2012-04-05T05:00:00+00:00 ((2456023j,18000s,0n),+0s,2299161j)> |

As you can see in the above example, an hour was added to the time specified by DateTime and returned. All the offset classes work in a similar manner. Following offset classes are available to users:

Offset Class |
Description |
---|---|

Daru::DateOffset | Generic offset class |

Second | One Second |

Minute | One Minute |

Hour | One Hour |

Day | One Day |

Week | One Week. Can be anchored on any week of the day. |

Month | One Month. |

MonthBegin | Calendar Month Begin. |

MonthEnd | Calendar Month End. |

Year | One Year. |

YearBegin | Calendar Year Begin. |

YearEnd | Calendar Year End. |

The generic Daru::DateOffset class is used for creating a generic offset by passing the number of intervals you want as the value for a key that describes the type of interval. For example to create an offset of 3 days, you pass the option `days: 3`

into the Daru::Offset constructor.

1 2 3 4 5 6 |
require 'daru' offset = Daru::DateOffset.new(days: 3) offset + DateTime.new(2012,4,5,2) #=> #<DateTime: 2012-04-08T02:00:00+00:00 ((2456026j,7200s,0n),+0s,2299161j)> |

On a similar note, the DateOffset class constructor can accept the options `:secs`

, `:mins`

,`:hours`

, `:days`

, `:weeks`

, `:months`

or `:years`

. Optionally, specifying the `:n`

option will tell DateOffset to apply a particular offset more than once. To elaborate:

1 2 3 4 5 6 |
require 'daru' offset = Daru::DateOffset.new(months: 2, n: 4) offset + DateTime.new(2011,5,2) #=> #<DateTime: 2012-01-02T00:00:00+00:00 ((2455929j,0s,0n),+0s,2299161j)> |

The specialized offset classes like MonthBegin, YearEnd, etc. all reside inside the `Daru::Offsets`

namespace and can be used by simply calling `.new`

on them. All accept an optional Integer argument that works like the `:n`

option for Daru::DateOffset, i.e it applies the offset multiple times.

To elaborate, consider the YearEnd offset. This offsets the date to the nearest year end after itself:

1 2 3 4 5 6 7 8 9 |
require 'daru' offset = Daru::Offsets::YearEnd.new offset + DateTime.new(2012,5,1,5,2,1) #=> #<DateTime: 2012-12-31T05:02:01+00:00 ((2456293j,18121s,0n),+0s,2299161j)> # Passing an Integer into an Offsets object will apply the offset that many times: offset = Daru::Offsets::MonthBegin.new(3) offset + DateTime.new(2015,3,5) #=> #<DateTime: 2015-06-01T00:00:00+00:00 ((2457175j,0s,0n),+0s,2299161j)> |

Of special note is the `Week`

offset. This offset can be ‘anchored’ to any week of the day that you specify. When this is done, the DateTime that is being offset will be offset to that day of the week.

For example, to anchor the Week offset to a Wednesday, pass ‘3’ as a value to the `:weekday`

option:

1 2 3 4 5 6 7 8 9 |
require 'daru' offset = Daru::Offsets::Week.new(weekday: 3) date = DateTime.new(2012,1,6) date.wday #=> 5 o = offset + date #=> #<DateTime: 2012-01-11T00:00:00+00:00 ((2455938j,0s,0n),+0s,2299161j)> o.wday #=> 3 |

Likewise, the Week offset can be anchored on any day of the week, by simplying specifying the `:weekday`

option. Indexing for days of the week starts from 0 for Sunday and goes on 6 for Saturday.

The most obvious use of date offsets is for creating `DateTimeIndex`

objects with a fixed time interval between each date index. To make creation of indexes easy, each of the offset classes have been linked to certain *string alaises*, which can directly passed to the DateTimeIndex class.

For example, to create a DateTimeIndex of 100 periods with a frequency of 1 hour between each period:

1 2 3 4 5 6 |
require 'daru' offset = Daru::DateTimeIndex.date_range( :start => '2015-4-4', :periods => 100, :freq => 'H') #=> #<DateTimeIndex:86417320 offset=H periods=100 data=[2015-04-04T00:00:00+00:00...2015-04-08T03:00:00+00:00]> |

Likewise all of the above listed offsets can be aliased using strings, which can be used for specifying the offset in a DateTimeIndex index. The string aliases of each offset class are as follows:

Alias String |
Offset Class / Description |
---|---|

‘S’ | Second |

‘M’ | Minute |

‘H’ | Hour |

‘D’ | Days |

‘W’ | Default Week. Anchored on SUN. |

‘W-SUN’ | Week anchored on sunday |

‘W-MON’ | Week anchored on monday |

‘W-TUE’ | Week anchored on tuesday |

‘W-WED’ | Week anchored on wednesday |

‘W-THU’ | Week anchored on thursday |

‘W-FRI’ | Week anchored on friday |

‘W-SAT’ | Week anchored on saturday |

‘MONTH’ | Month |

‘MB’ | MonthBegin |

‘ME’ | MonthEnd |

‘YEAR’ | Year |

‘YB’ | YearBegin |

‘YE’ | YearEnd |

See this notebook on daru’s time series functions in order to get a good overview of daru’s time series manipulation functionality.

]]>Statsample is the most comprehensive statistical computation suite in Ruby as of now.

Previously, it so happened that statsample would depend on rb-gsl to speed up a lot of computations. This is great, but the biggest drawback of this approach is that rb-gsl depends on narray, which is incompatible with nmatrix - the numerical storage and linear algebra library from the SciRuby foundation - due to namespace collisions.

NMatrix is used by many current and upcoming ruby scientific gems, most notably daru, mikon, nmatrix-fftw, etc. and the a big hurdle that these gems were facing was that they could not leverage the advanced functionality of rb-gsl or statsample because nmatrix cannot co-exist with narray. On a further note, daru’s DataFrame and Vector data structures are to replace statsample’s Dataset and Vector, so that a dedicated library can be used for data storage and munging and statsample can be made to focus on statistical analysis.

The most promising solution to this problem was that rb-gsl must be made to depend on nmatrix instead of narray. This problem was solved by the gsl-nmatrix gem, which is a port of rb-gsl, but uses nmatrix instead of narray. Gsl-nmatrix also allows conversion of GSL objects to NMatrix and vice versa. Also, latest changes to statsample make it completely independent of GSL, and hence all the methods in statsample are now possible with or without GSL.

To make your installation of statsample work with gsl-nmatrix, follow these instructions:

- Install nmatrix and clone, build and install the latest gsl-nmatrix from https://github.com/v0dro/gsl-nmatrix
- Clone the latest statsample from https://github.com/SciRuby/statsample
- Open the Gemfile of statsample and add the line
`gem 'gsl-nmatrix', '~>1.17'`

- Build statsample using
`rake gem`

and install the resulting`.gem`

file with`gem install`

.

You should be good able to use statsample with gsl-nmatrix on your system now. To use with rb-gsl, just install rb-gsl from rubygems (`gem install rb-gsl`

) and put `gem 'rb-gsl', '~>1.16.0.4'`

in the Gemfile instead of gsl-nmatrix. This will activate the rb-gsl gem and you can use rb-gsl with statsample.

However please take note that narray and nmatrix cannot co-exist on the same gem list. Therefore, you should have either rb-gsl or gsl-nmatrix installed at a particular time otherwise things will malfunction.

]]>This leaves us with quite a few choices about the library that can be used. The most common and obvious interfaces for performing fast linear algebra calculations are LAPACK and BLAS. Thus the library bundled with the nmatrix extension must expose an interface similar to LAPACK and BLAS. Since ruby running on MRI can only interface with libraries having a C interface, the contenders in this regard are CLAPACK or LAPACKE for a LAPACK in C, and openBLAS or ATLAS for a BLAS interface.

I need to choose an appropriate BLAS and LAPACK interface based on its speed and usability, and to do so, I decided to build some quick ruby interfaces to these libraries and benchmark the `?gesv`

function (used for solving *n* linear equations in *n* unknowns) present in all LAPACK interfaces, so as to get an idea of what would be the fastest. This would also test the speed of the BLAS implemetation since LAPACK primarily depends on BLAS for actual computations.

To create these benchmarks, I made a couple of simple ruby gems which linked against the binaries of these libraries. All these gems define a module which contains a method `solve_gesv`

, which calls the C extension that interfaces with the C library. Each library was made in its own little ruby gem so as to nullify any unknown side effects and also to provide more clarity.

To test these libraries against each other, I used the following test code:

1 2 3 4 5 6 7 8 9 |
require 'benchmark' Benchmark.bm do |x| x.report do 10000.times do a = NMatrix.new([3,3], [76, 25, 11, 27, 89, 51, 18, 60, 32], dtype: :float64) b = NMatrix.new([3,1], [10, 7, 43], dtype: :float64) NMatrix::CLAPACK.solve_gesv(a,b) # The `NMatrix::CLAPACK` is replaced with NMatrix::LAPACKE # or NMatrix::LAPACKE_ATLAS as per the underlying binding. Read the # source code for more details. end end end |

Here I will list the libraries that I used, the functions I interfaced with, the pros and cons of using each of these libraries, and of course the reported benchmarks:

CLAPACK is an F2C’d version of the original LAPACK written in FORTRAN. The creators have made some changes by hand because f2c spews out unnecessary code at times, but otherwise its pretty much as fast as the original LAPACK.

To interface with a BLAS implementation, CLAPACK uses a blas wrapper (blaswrap) to generate wrappers to the relevant CBLAS functions exposed by any BLAS implementation. The blaswrap source files and F2C source files are provided with the CLAPACK library.

The BLAS implementation that we’ll be using is openBLAS, which is a very stable and tested BLAS exposing a C interface. It is extremely simple to use and install, and configures itself automatically according to the computer it is being installed upon. It claims to achieve performance comparable to intel MKL, which is phenomenal.

To compile CLAPACK with openBLAS, do the following:

`cd`

to your openBLAS directory and run`make NO_LAPACK=1`

. This will create an openBLAS binary with the object files only for BLAS and CBLAS. LAPACK will not be compiled even though the source is present. This will generate a`.a`

file which has a name that is similar to the processor that your computer uses. Mine was`libopenblas_sandybridgep-r0.2.13.a`

.- Now rename the openBLAS binary file to
`libopenblas.a`

so its easier to type and you lessen your chances of mistakes, and copy to your CLAPACK directory. `cd`

to your CLAPACK directory and open the`make.inc`

file in your editor. In it, you should find a`BLASDIR`

variable that points to the BLAS files to link against. Change the value of this variable to`../../libopenblas.a`

.- Now run
`make f2clib`

to make F2C library. This is needed for interconversion between C and FORTRAN data types. - Then run
`make lapacklib`

from the CLAPACK root directory to compile CLAPACK against your specified implementation of CBLAS (openBLAS in this case). - At the end of this process, you should end up with the CLAPACK, F2C and openBLAS binaries in your directory.

Since the automation of this compilation process would take time, I copied these binaries to the gem and wrote the extconf.rb such that they link with these libraries.

On testing this with a ruby wrapper, the benchmarking code listed above yielded the following results:

```
user system total real
0.190000 0.000000 0.190000 ( 0.186355)
```

LAPACKE is the ‘official’ C interface to the FORTRAN-written LAPACK. It consists of two levels; a high level C interface for use with C programs and a low level one that talks to the original FORTRAN LAPACK code. This is not just an f2c’d version of LAPACK, and hence the design of this library is such that it is easy to create a bridge between C and FORTRAN.

For example, C has arrays stored in row-major format while FORTRAN had them column-major. To perform any computation, a matrix needs to be transposed to column-major form first and then be re-transposed to row-major form so as to yield correct results. This needs to be done by the programmer when using CLAPACK, but LAPACKE’s higher level interface accepts arguments (LAPACKE_ROW_MAJOR or LAPACKE_COL_MAJOR) which specify whether the matrices passed to it are in row major or column major format. Thus extra (often unoptimized code) on part of the programmer for performing the tranposes is avoided.

To build binaries of LAPACKE compiled with openBLAS, just `cd`

to your openBLAS source code directory and run `make`

. This will generate a `.a`

file with the binaries for LAPACKE and CBLAS interface of openBLAS.

LAPACKE benchmarks turn out to be faster mainly due to the absence of manual transposing by high-level code written in Ruby (the NMatrix#transpose function in this case). I think performing the tranposing using openBLAS functions should remedy this problem.

The benchmarks for LAPACKE are:

```
user system total real
0.150000 0.000000 0.150000 ( 0.147790)
```

As you can see these are quite faster than CLAPACK with openBLAS, listed above.

This is the combination that is currently in use with nmatrix. It involves installing the `libatlas-base-dev`

package from the Debian repositories. This pacakage will load all the relevant clapack, atlas, blas and cblas binaries into your computer.

The benchmarks turned out to be:

```
user system total real
0.130000 0.000000 0.130000 ( 0.130056)
```

This is fast. But a big limitation on using this approach is that the CLAPACK library exposed by the `libatlas-base-dev`

is outdated and no longer maintained. To top it all, it does not have all the functions that a LAPACK library is supposed to have.

For this test case I compiled LAPACKE (downloaded from netlib) with an ATLAS implementation from the Debian repositories. I then included the generated static libraries in the sample ruby gem and compiled the gem against those.

To do this on your machine:

- Install the package
`libatlas-base-dev`

with your package manager. This will install the ATLAS and CBLAS shared objects onto your system. `cd`

to the lapack library and in the`make.inc`

file change the`BLASLIB = -lblas -lcblas -latlas`

. Then run`make`

. This will compile LAPACK with ATLAS installed on your system.- Then
`cd`

to the lacpack/lapacke folder and run`make`

.

Again the function chosen was `LAPACKE_?gesv`

. This test should tell us a great deal about the speed differences between openBLAS and ATLAS, since tranposing overheads are handled by LAPACKE and no Ruby code is interfering with the benchmarks.

The benchmarks turned out to be:

```
user system total real
0.140000 0.000000 0.140000 ( 0.140540)
```

As you can see from the benchmarks above, the approach followed by nmatrix currently (CLAPACK with ATLAS) is the fastest, but this approach has certain limitations:

- Requires installation of tedious to install dependencies.
- Many pacakages offer the same binaries, causing confusion.
- CLAPACK library is outdated and not maintained any longer.
- ATLAS-CLAPACK does not expose all the functions present in LAPACK.

The LAPACKE-openBLAS and the LAPACKE-ATLAS, though a little slower(~10-20 ms), offer a HUGE advantage over CLAPACK-ATLAS, viz. :

- LAPACKE is the ‘standard’ C interface to the LAPACK libraries and is actively maintained, with regular release cycles.
- LAPACKE is compatible with intel’s MKL, in case a future need arises.
- LAPACKE bridges the differences between C and FORTRAN with a well thought out interface.
- LAPACKE exposes the entire LAPACK interface.
- openBLAS is trivial to install.
- ATLAS is a little non-trivial to install but is fast.

For a further explanation of the differences between these CBLAS, CLAPACK and LAPACKE, read this blog post.

]]>The new features include extensive support for missing data, hierarchial sorting of data frames and vectors by preserving indexing, ability to group, split and aggregate data with group by, and quickly summarizing data by generating excel-style pivot tables. This release also includes new aritmetic and statistical functions on Data Frames and Vectors. Both DataFrame and Vector are now mostly compatible with statsample, allowing for a much larger scope of statistical analysis by leveraging the methods already provided in statsample.

The interface for interacting with nyaplot for plotting has also been revamped, allowing much greater control on the way graphs are handled by giving direct access to the graph object. A new class for hierarchial indexing of data (called MultiIndex) has also been added, which is immensely useful when grouping/splitting/aggregating data.

Lets look at all these features one by one:

You can now either use Ruby Arrays or NMatrix as the underlying implementation. Since NMatrix is fast and makes use of C storage, it is recommended to use nmatrix when dealing with large sets of data. Daru will store any data as Ruby Array unless explicitly specified.

Thus to specify the data type of a Vector use the option `:dtype`

and either supply it with `:array`

or `:nmatrix`

, and if using the NMatrix dtype, you can also specify the C data type that NMatrix will use internall by using the option `:nm_dtype`

and supplying it with one of the NMatrix data types (it currently supports ints, floats, rationals and complex numbers. Check the docs for further details).

As an example, consider creating a Vector which uses NMatrix underneath, and stores data using the `:float64`

NMatrix data type, which stands for double precision floating point numbers.

1 2 3 4 5 6 7 8 9 |
v = Daru::Vector.new([1.44,55.54,33.2,5.6],dtype: :nmatrix, nm_dtype: :float64) # nil # 0 1.44 # 1 55.54 # 2 33.2 # 3 5.6 v.dtype #=> :nmatrix v.type #=> :float64 |

Another distinction between types of data that daru offers is `:numeric`

and `:object`

. This is a generic feature for distinguishing numerical data from other types of data (like Strings or DateTime objects) that might be contained inside Vectors or DataFrames. These distinctions are important because statistical and arithmetic operations can only be applied on structures with type numeric.

To query the data structure for its type, use the `#type`

method. If the underlying implemetation is an NMatrix, it will return the NMatrix data type, otherwise for Ruby Arrays, it will be either `:numeric`

or `:object`

.

Thus Daru exposes three methods for querying the type of data:

`#type`

- Get the generic type of data to know whether numeric computation can be performed on the object. Get the C data type used by nmatrix in case of dtype NMatrix.`#dtype`

- Get the underlying data representation (either :array or :nmatrix).

Any data scientist knows how common missing data is in real-life data sets, and to address that need, daru provides a host of functions for this purpose. This functionality is still in its infancy but should be up to speed soon.

The `#is_nil?`

function will return a Vector object with `true`

if a value is `nil`

and `false`

otherwise.

1 2 3 4 5 6 7 8 9 |
v = Daru::Vector.new([1,2,3,nil,nil,4], index: [:a, :b, :c, :d, :e, :f]) v.is_nil? #=> ##<Daru::Vector:93025420 @name = nil @size = 6 > # nil # a nil # b nil # c nil # d true # e true # f nil |

The `#nil_positions`

function returns an Array that contains the indexes of all the nils in the Vector.

The `#replace_nils`

functions replaces nils with a supplied value.

1 2 3 4 5 6 7 8 9 |
v.replace_nils 69 #=> ##<Daru::Vector:92796730 @name = nil @size = 6 > # nil # a 1 # b 2 # c 3 # d 69 # e 69 # f 4 |

The statistics functions implemented on Vectors ensure that missing data is not considered during computation and are thus safe to call on missing data.

It is now possible to use the `#sort`

function on Daru::DataFrame such that sorting happens hierarchically according to the order of the specified vector names.

In case you want to sort according to a certain attribute of the data in a particular vector, for example sort a Vector of strings by length, then you can supply a code block to the `:by`

option of the sort method.

Supply the `:ascending`

option with an Array containing ‘true’ or ‘false’ depending on whether you want the corresponding vector sorted in ascending or descending order.

1 2 3 4 5 6 7 8 9 |
df = Daru::DataFrame.new({ a: ['ff' , 'fwwq', 'efe', 'a', 'efef', 'zzzz', 'efgg', 'q', 'ggf'], b: ['one' , 'one', 'one', 'two', 'two', 'one', 'one', 'two', 'two'], c: ['small','large','large','small','small','large','small','large','small'], d: [-1,2,-2,3,-3,4,-5,6,7], e: [2,4,4,6,6,8,10,12,14] }) df.sort([:a,:d], by: { a: lambda { |a,b| a.length <=> b.length }, b: lambda { |a,b| a.abs <=> b.abs } }, ascending: [false, true] ) |

Vector objects also have a similar sorting method implemented. Check the docs for more details. Indexing is preserved while sorting of both DataFrame and Vector.

Previously plotting with daru required a lot of arguments to be supplied by the user. The interface did not take advatage of Ruby’s blocks, nor did it expose many functionalities of nyaplot. All that changes with this new version, that brings in a new DSL for easy plotting (recommended usage with iruby notebook).

Thus to plot a line graph with data present in a DataFrame:

1 2 3 4 5 6 7 |
df = Daru::DataFrame.new({a: [1,2,3,4,5], b: [10,14,15,17,44]}) df.plot type: :line, x: :a, y: :b do |p,d| p.yrange [0,100] p.legend true d.color "green" end |

As you can see, the `#plot`

function exposes the `Nyaplot::Plot`

and `Nyaplot::Diagram`

objects to user after populating them with the relevant data. So the new interface lets experienced users utilize the full power of nyaplot but keeps basic plotting very simple to use for new users or for quick and dirty visualization needs. Unfortunately for now, until a viable solution to interfacing with nyaplot is found, you will need to use the nyaplot API directly.

Refer to this notebook for advanced plotting tutorials.

Daru includes a host of methods for simple statistical analysis on numeric data. You can call `mean`

, `std`

, `sum`

, `product`

, etc. directly on the DataFrame. The corresponding computation is performed on numeric Vectors within the DataFrame, and missing data if any is excluded from the calculation by default.

So for this DataFrame:

1 2 3 4 5 6 7 8 9 |
df = Daru::DataFrame.new({ a: ['foo' , 'foo', 'foo', 'foo', 'foo', 'bar', 'bar', 'bar', 'bar'], b: ['one' , 'one', 'one', 'two', 'two', 'one', 'one', 'two', 'two'], c: ['small','large','large','small','small','large','small','large','small'], d: [1,2,2,3,3,4,5,6,7], e: [2,4,4,6,6,8,10,12,14], f: [10,20,20,30,30,40,50,60,70] }) |

To calculate the mean of numeric vectors:

Apart from that you can use the `#describe`

method to calculate many statistical features of numeric Vectors in one shot and see a summary of statistics for numerical vectors in the DataFrame that is returned. For example,

The covariance and correlation coeffiecients between the numeric vectors can also be found with `#cov`

and `#corr`

1 2 3 4 5 6 7 8 |
df.cov # => # #<Daru::DataFrame:91700830 @name = f5ae5d7e-9fcb-46c8-90ac-a6420c9dc27f @size # = 3> # d e f # d 4 8 40 # e 8 16 80 # f 40 80 400 |

A new way of hierarchially indexing data has been introduced in version 0.0.5. This is done with the new `Daru::MultiIndex`

class. Hierarchial indexing allows grouping sets of similar data by index and lets you select sub sets of data by specifying an index name in the upper hierarchy.

A MultiIndex can be created by passing a bunch of tuples into the Daru::MultiIndex class. A DataFrame or Vector can be created by passing it a MultiIndex object into the `index`

option. A MultiIndex can be used for determining the order of Vectors in a DataFrame too.

1 2 3 4 5 6 7 8 9 |
tuples = [ [:a,:one,:bar], [:a,:one,:baz], [:a,:two,:bar], [:a,:two,:baz], [:b,:one,:bar], [:b,:two,:bar], [:b,:two,:baz], [:b,:one,:foo], [:c,:one,:bar], [:c,:one,:baz], [:c,:two,:foo], [:c,:two,:bar] ] multi_index = Daru::MultiIndex.new(tuples) vector_arry1 = [11,12,13,14,11,12,13,14,11,12,13,14] vector_arry2 = [1,2,3,4,1,2,3,4,1,2,3,4] order_mi = Daru::MultiIndex.new([ [:a,:one,:bar], [:a,:two,:baz], [:b,:two,:foo], [:b,:one,:foo]]) df_mi = Daru::DataFrame.new([ vector_arry1, vector_arry2, vector_arry1, vector_arry2], order: order_mi, index: multi_index) |

Selecting a top level index from the hierarchy will select all the rows under that name, and return a new DataFrame with just that much data and indexes.

Alternatively passing the entire tuple will return just that row as a `Daru::Vector`

, indexed according to the column index.

Hierachical indexing is especially useful when aggregating or splitting data, or generating data summaries as we’ll see in the following examples.

When dealing with large sets of scattered data, it is often useful to ‘see’ the data grouped according to similar values in a Vector instead of it being scattered all over the place.

The `#group_by`

function does exactly that. For those familiar SQL, `#group_by`

works exactly like the GROUP BY clause, but is much easier since its all Ruby.

The `#group_by`

function will accept one or more Vector names and will scan those vectors for common elements that can be grouped together. In case multiple names are specified it will check for common attributes accross rows.

So for example consider this DataFrame:

1 2 3 4 5 6 7 8 9 |
df = Daru::DataFrame.new({ a: %w{foo bar foo bar foo bar foo foo}, b: %w{one one two three two two one three}, c: [1 ,2 ,3 ,1 ,3 ,6 ,3 ,8], d: [11 ,22 ,33 ,44 ,55 ,66 ,77 ,88] }) #<Daru::DataFrame:88462950 @name = 0dbc2869-9a82-4044-b72d-a4ef963401fc @size = 8> # a b c d # 0 foo one 1 11 # 1 bar one 2 22 # 2 foo two 3 33 # 3 bar three 1 44 # 4 foo two 3 55 # 5 bar two 6 66 # 6 foo one 3 77 # 7 foo three 8 88 |

To group this DataFrame by the columns `:a`

and `:b`

, pass them as arguments to the `#group_by`

function, which returns a `Daru::Core::GroupBy`

object.

Calling `#groups`

on the returned `GroupBy`

object returns a `Hash`

with the grouped rows.

1 2 3 4 5 6 7 8 9 |
grouped = df.group_by([:a, :b]) grouped.groups # => { # ["bar", "one"]=>[1], # ["bar", "three"]=>[3], # ["bar", "two"]=>[5], # ["foo", "one"]=>[0, 6], # ["foo", "three"]=>[7], # ["foo", "two"]=>[2, 4]} |

To see the first group of each group from this collection, call `#first`

on the `grouped`

variable. Calling `#last`

will return the last member of each group.

1 2 3 4 5 6 7 8 9 |
grouped.first #=> a b c d # 1 bar one 2 22 # 3 bar three 1 44 # 5 bar two 6 66 # 0 foo one 1 11 # 7 foo three 8 88 # 2 foo two 3 33 |

On a similar note `#head(n)`

will return the first `n`

groups and `#tail(n)`

the last `n`

groups.

The `#get_group`

function will select only the rows that a particular group belongs to and return a DataFrame with those rows. The original indexing is ofcourse preserved.

1 2 3 4 5 6 7 |
grouped.get_group(["foo", "one"]) # => # #<Daru::DataFrame:90777050 @name = cdd0afa8-252d-4d07-ad0f-76c7581a492a @size # = 2> # a b c d # 0 foo one 1 11 # 6 foo one 3 77 |

The `Daru::Core::GroupBy`

object contains a bunch of methods for creating summaries of the grouped data. These currently include `#mean`

, `#std`

, `#product`

, `#sum`

, etc. and many more to be added in the future. Calling any of the aggregation methods will create a new DataFrame which will have the index as the group and the aggregated data of the non-group vectors as the corresponding value. Of course this aggregation will apply only to `:numeric`

type Vectors and missing data will not be considered while aggregation.

A hierarchichally indexed DataFrame is returned. Check the `GroupBy`

docs for more aggregation methods.

You can generate an excel-style pivot table with the `#pivot_table`

function. The levels of the pivot table are stored in MultiIndex objects.

To demonstrate with an example, consider this CSV file on sales data.

To look at the data from the point of view of the manager and rep:

You can see that the pivot table has summarized the data and grouped it according to the manager and representative.

To see the sales broken down by the products:

Daru is now completely compatible with statsample and you can now perform all of the functions by just passing it a Daru::DataFrame or Daru::Vector to perform statistical analysis.

Find more examples of using daru for statistics in these notebooks.

Heres an example to demonstrate:

1 2 3 4 5 6 7 8 9 |
df = Daru::DataFrame.new({a: [1,2,3,4,5,6,7], b: [11,22,33,44,55,66,77]}) Statsample::Analysis.store(Statsample::Test::T) do t_2 = Statsample::Test.t_two_samples_independent(df[:a], df[:b]) summary t_2 end Statsample::Analysis.run_batch # Analysis 2015-02-25 13:34:32 +0530 # = Statsample::Test::T # == Two Sample T Test # Mean and standard deviation # +----------+---------+---------+---+ # | Variable | mean | sd | n | # +----------+---------+---------+---+ # | a | 4.0000 | 2.1602 | 7 | # | b | 44.0000 | 23.7627 | 7 | # +----------+---------+---------+---+ # # Levene test for equality of variances : F(1, 12) = 13.6192 , p = 0.0031 # T statistics # +--------------------+---------+--------+----------------+ # | Type | t | df | p (both tails) | # +--------------------+---------+--------+----------------+ # | Equal variance | -4.4353 | 12 | 0.0008 | # | Non equal variance | -4.4353 | 6.0992 | 0.0042 | # +--------------------+---------+--------+----------------+ # # Effect size # +-------+----------+ # | x1-x2 | -40.0000 | # | d | -12.0007 | # +-------+----------+ |

- Pivot Tables example taken from here.

This involved solving a system of linear equations using forward substution followed by back substution using the LU factorization of the matrix of co-efficients.

The reduction techniques were quite baffling at first, because I had always solved equations in the traditional way and this was something completely new. I eventually figured it out and also implemented it in NMatrix. Here I will document how I did that. Hopefully, this will be useful to others like me!

I’m assuming that you are familiar with the LU decomposed form of a square matrix. If not, read this resource first.

Throughout this post, I will refer to *A* as the square matrix of co-efficients, *x* as the column matrix of unknowns and *b* as column matrix of right hand sides.

Lets say that the equation you want to solve is represented by:

The basic idea behind an LU decomposition is that a square matrix A can be represented as the product of two matrices *L* and *U*, where *L* is a lower triangular matrix and *U* is an upper triangular matrix.

Given this, equation (1) can be represented as:

Which we can use for solving the vector *y* such that:

and then solving:

The LU decomposed matrix is typically carried in a single matrix to reduce storage overhead, and thus the diagonal elements of *L* are assumed to have a value *1*. The diagonal elements of *U* can have any value.

The reason for breaking down *A* and first solving for an upper triangular matrix is that the solution of an upper triangular matrix is quite trivial and thus the solution to (2) is found using the technique of *forward substitution*.

Forward substitution is a technique that involves scanning an upper triangular matrix from top to bottom, computing a value for the top most variable and substituting that value into subsequent variables below it. This proved to be quite intimidating, because according to Numerical Recipes, the whole process of forward substitution can be represented by the following equation:

Figuring out what exactly is going on was quite a daunting task, but I did figure it out eventually and here is how I went about it:

Let *L* in equation (2) to be the lower part of a 3x3 matrix A (as per (1)). So equation (2) can be represented in matrix form as:

Our task now is calculate the column matrix containing the *y* unknowns.
Thus by equation (4), each of them can be calculated with the following sets of equations (if you find them confusing just correlate each value with that present in the matrices above and it should be clear):

Its now quite obvious that forward substitution is called so because we start from the topmost row of the matrix and use the value of the variable calculated in that row to calculate the *y* for the following rows.

Now that we have the solution to equation (2), we can use the values generated in the *y* column vector to compute *x* in equation (3). Recall that the matrix *U* is the upper triangular decomposed part of *A* (equation (1)). This matrix can be solved using a technique called *backward substitution*. It is the exact reverse of the *forward substitution* that we just saw, i.e. the values of the bottom-most variables are calculated first and then substituted into the rows above to calculate subsquent variables above.

The equation describing backward substitution is described in Numerical Recipes as:

Lets try to understand this equation by extending the example we used above to understand forward substitution. To gain a better understanding of this concept, consider the equation (3) written in matrix form (keeping the same 3x3 matrix *A*):

Using the matrix representation above as reference, equation (5) can be expanded in terms of a 3x3 matrix as:

Looking at the above equations its easy to see how backward substitution can be used to solve for unknown quantities when given a upper triangular matrix of co-efficients, by starting at the lowermost variable and gradually moving upward.

Now that the methodology behind solving sets of linear equations is clear, lets consider a set of 3 linear equations and 3 unknowns and compute the values of the unknown quantities using the nmatrix #solve method.

The #solve method can be called on any nxn square matrix of a floating point data type, and expects its sole argument to be a column matrix containing the right hand sides. It returns a column nmatrix object containing the computed co-efficients.

For this example, consider these 3 equations:

These can be translated to Ruby code by creating an NMatrix only for the co-efficients and another one only for right hand sides:

]]>One dimensional interpolation involves considering consecutive points along the X-axis with known Y co-ordinates and predicting the Y co-ordinate for a given X co-ordinate.

There are several types of interpolation depending on the number of known points used for predicting the unknown point, and several methods to compute them, each with their own varying accuracy. Methods for interpolation include the classic Polynomial interpolation with Lagrange’s formula or spline interpolation using the concept of spline equations between points.

The spline method is found to be more accurate and hence that is what is used in the interpolation gem.

Install the `interpolation`

gem with `gem install interpolation`

. Now lets see a few common interpolation routines and their implementation in Ruby:

This is the simplest kind of interpolation. It involves simply considering two points such that *x[j]* < *num* < *x[j+1]*, where *num* is the unknown point, and considering the slope of the straight line between *(x[j], y[j] )* and *(x[j+1], y[j+1])*, predicts the Y co-ordinate using a simple linear polynomial.

Linear interpolation uses this equation:

Here *interpolant* is the value of the X co-orinate whose corresponding Y-value needs to found.

Ruby code:

1 2 3 4 5 6 7 8 9 |
require 'interpolation' x = (0..100).step(3).to_a y = x.map { |a| Math.sin(a) } int = Interpolation::OneDimensional.new x, y, type: :linear int.interpolate 35 # => -0.328 |

Cubic Spline interpolation defines a cubic spline equation for each set of points between the *1st* and *nth* points. Each equation is smooth in its first derivative and continuos in its second derivative.

So for example, if the points on a curve are labelled *i*, where *i = 1..n*, the equations representing any two points *i* and *i-1* will look like this:

Cubic spline interpolation involves finding the second derivative of all points , which can then be used for evaluating the cubic spline polynomial, which is a function of *x*, *y* and the second derivatives of *y*.

For more information read this resource.

]]>In this first article on daru, I will show you some aspects of how daru handles data and some operations that can be performed on a real-life data set.

daru consists of two major data structures:

**Vector**- A named one-dimensional array-like structure.**DataFrame**- A named spreadsheet-like two-dimensional frame of data.

A *Vector* can either be represented by a Ruby Array, NMatrix(MRI) or MDArray(JRuby) internally. This allows for fast data manipulation in native code. Users can change the underlying implementation at will (demonstrated in the next blog post).

Both of these can be indexed by the `Daru::Index`

or `Daru::MultiIndex`

class, which allows us to reference and operate on data by name instead of the traditional numeric indexing, and also perform index-based manipulation, equality and plotting operations.

The easiest way to create a vector is to simply pass the elements to a `Daru::Vector`

constructor:

1 2 3 4 5 6 7 8 9 |
v = Daru::Vector.new [23,44,66,22,11] # This will create a Vector object v # => ##<Daru::Vector:78168790 @name = nil @size = 5 > # ni # 0 23 # 1 44 # 2 66 # 3 22 # 4 11 |

Since no name has been specified, the vector is named `nil`

, and since no index has been specified either, a numeric index from 0..4 has been generated for the vector (leftmost column).

A better way to create vectors would be to specify the name and the indexes:

1 2 3 4 5 6 7 8 9 |
sherlock = Daru::Vector.new [3,2,1,1,2], name: :sherlock, index: [:pipe, :hat, :violin, :cloak, :shoes] #=> #<Daru::Vector:78061610 @name = sherlock @size = 5 > # sherlock # pipe 3 # hat 2 # violin 1 # cloak 1 # shoes 2 |

This way we can clearly see the quantity of each item possesed by Sherlock.

Data can be retrieved with the `[]`

operator:

A basic DataFrame can be constructed by simply specifying the names of columns and their corresponding values in a hash:

1 2 3 4 5 6 7 8 9 |
df = Daru::DataFrame.new({a: [1,2,3,4,5], b: [10,20,30,40,50]}, name: :normal) # => ##<Daru::DataFrame:77782370 @name = normal @size = 5> # a b # 0 1 10 # 1 2 20 # 2 3 30 # 3 4 40 # 4 5 50 |

You can also specify an index for the DataFrame alongwith the data and also specify the order in which the vectors should appear. Every vector in the DataFrame will carry the same index as the DataFrame once it has been created.

1 2 3 4 5 6 7 8 9 |
plus_one = Daru::DataFrame.new({a: [1,2,3,4,5], b: [10,20,30,40,50], c: [11,22,33,44,55]}, name: :plus_one, index: [:a, :e, :i, :o, :u], order: [:c, :a, :b]) # => ##<Daru::DataFrame:77605450 @name = plus_one @size = 5> # c a b # a 11 1 10 # e 22 2 20 # i 33 3 30 # o 44 4 40 # u 55 5 50 |

daru will also add `nil`

values to vectors that fall short of elements.

1 2 3 4 5 6 7 8 |
missing = Daru::DataFrame.new({a: [1,2,3], b: [1]}, name: :missing) #=> #<Daru::DataFrame:76043900 @name = missing @size = 3> # a b # 0 1 1 # 1 2 nil # 2 3 nil |

Creating a DataFrame by specifying `Vector`

objects in place of the values in the hash will correctly align the values according to the index of each vector. If a vector is missing an index present in another vector, that index will be added to the vector with the corresponding value set to `nil`

.

1 2 3 4 5 6 7 8 9 |
a = Daru::Vector.new [1,2,3,4,5], index: [:a, :e, :i, :o, :u] b = Daru::Vector.new [43,22,13], index: [:i, :a, :queen] on_steroids = Daru::DataFrame.new({a: a, b: b}, name: :on_steroids) #=> #<Daru::DataFrame:75841450 @name = on_steroids @size = 6> # a b # a 1 22 # e 2 nil # i 3 43 # o 4 nil # queen nil 13 # u 5 nil |

A DataFrame can be constructed from multiple sources:

- To construct by columns:
**Array of hashes**- Where the key of each hash is the name of the column to which the value belongs.**Name-Array Hash**- Where the hash key is set as the name of the vector and the data the corresponding value.**Name-Vector Hash**- This is the most advanced way of creating a DataFrame. Treats the hash key as the name of the vector. Also aligns the data correctly based on index.**Array of Arrays**- Each sub array will be considered as a Vector in the DataFrame.

- To construct by rows using the
`.rows`

class method:**Array of Arrays**- This will treat each sub-array as an independent row.**Array of Vectors**- Uses each Vector in the Array as a row of the DataFrame. Sets vector names according to the index of the Vector. Aligns vector elements by index.

Now that you have a basic idea about representing data in daru, lets see some more features of daru by loading some real-life data from a CSV file and performing some operations on it.

For this purpose, we will use iruby notebook, with which daru is compatible. iruby provides a great interface for visualizing and playing around with data. I highly recommend installing it for full utilization of this tutorial.

Let us load some data about the music listening history of one user from this subset of the Last.fm data set:

As you can see the *timestamp* field is in a somewhat non-Ruby format which is pretty difficult for the default Time class to understand, so we destructively map time zone information (IST in this case) and then change every *timestamp* string field into a Ruby *Time* object, so that operations on time can be easily performed.

Notice the syntax for referencing a particular vector. Use ‘row’ for referencing any row.

1 2 3 4 5 6 7 |
require 'date' df = df.recode(:row) do |row| row[:timestamp] = DateTime.strptime(row[:timestamp], '%Y-%m-%dT%H:%M:%SZ%z').to_time row end |

A bunch of rows can be selected by specifying a range:

`df.row[900..923]`

Lets dive deeper by actually trying to extract something useful from the data that we have. Say we want to know the name of the artist heard the maximum number of times. So we create a Vector which consists of the names of the artists as the index and the number of times the name appears in the data as the corresponding values:

1 2 3 |
# Group by artist name and call 'size' to see the number of rows each artist populates. artists = df.group_by(:artname).size |

To get the maximum value out of these, use `#max_index`

. This will return a Vector which has the max:

`count.max_index`

daru uses Nyaplot for plotting, which is an optional dependency. Install nyaplot with `gem install nyaplot`

and proceed.

To demonstrate, lets find the top ten artists heard by this user and plot the number of times their songs have been heard against their names in a bar graph. For this, use the `#sort`

function, which will preserve the indexing of the vector.

1 2 3 4 5 6 7 8 |
top_ten = artists.sort(ascending: false)[0..10] top_ten.plot type: :bar do |plt| plt.width 1120 plt.height 500 plt.legend true end |

More examples can be found in the notebooks section of the daru README.

- This was but a very small subset of the capabilities of daru. Go through the documentation for more methods of analysing your data with daru.
- You can find all the above examples implemented in this notebook.
- Contribute to daru on github. Any contributions will be greatly appreciated!
- Many thanks to last.fm for providing the data.
- Check out the next blog post in this series, elaborating on the next release of daru.

Recently, I was working on implementing a matrix inversion routine using the Gauss-Jordan elimination technique in C++. This was part of the NMatrix ruby gem, and because of the limitations imposed by trying to interface a dynamic language like Ruby with C++, the elements of the NMatrix object had to expressed as a 1D contiguous C++ array for computation of the inverse.

The in-place Gauss-Jordan matrix inversion technique uses many matrix elements in every pass. Lets see some simple equations that can be used for accessing different types of elements in a matrix in a loop.

Lets say we have a square matrix A with shape *M*. If *k* is iterator we are using for going over each diagonal element of the matrix, then the equation will be something like .

A for loop using the equation should look like this:

1 2 3 4 5 6 |
for (k = 0; k < M; ++k) { cout << A[k * (M + 1)]; } // This will print all the diagonal elements of a square matrix. |

To iterate over each element in a given row of a matrix, use . Here `row`

is the fixed row and `col`

goes from 0 to M-1.

To iterate over each element in a given column of a matrix, use . Here `col`

is the fixed column and `row`

goes from 0 to M-1.

In general the equation will yield a matrix element with row index `row`

and column index `col`

.