Elaboration on certain internals of daru
In this blog post I will elaborate on how a few of the features in daru were implemeted. Notably I will stress on what spurred a need for that particular design of the code.
This post is primarily intended to serve as documentation for me and future contributors. If readers have any inputs on improving this post, I’d be happy to accept new contributions :)
Index factory architecture
Daru currently supports three types of indexes, Index, MultiIndex and DateTimeIndex.
It became very tedious to write if statements in the Vector or DataFrame codebase whenever a new data structure was to be created, since there were 3 possible indexes that could be attached with every data set. This mainly depended on what kind of data was present in the index, i.e. tuples would create a MultiIndex, DateTime objects or date-like strings would create a DateTimeIndex, and everything else would create a Daru::Index.
This looked something like the perfect use case for the factory pattern, the only hurdle being that the factory pattern in the pure sense of the term would be a superclass, something called Daru::IndexFactory
that created an Index, DateTimeIndex or MultiIndex index using some methods and logic. The problem is that I did not want to call a separate class for creating Indexes. This would break existing code and possibly cause problems in libraries that were already using daru (viz. statsample), not to mention confusing users about which class they’re actually supposed to be using.
The solution came after I read this blog post, which demonstrates that the .new
method for any class can be overridden. Thus, instead of calling initialize
for creating the instance of a class, it calls the overridden new
, which can then call initialize for instantiating an instance of that class. It so happens that you can make new
return any object you want, unlike initialize which must an instance of the class it is declared in. Thus, for the factory pattern implementation of Daru::Index, we over-ride the .new
method of the Daru::Index and write logic such that it manufactures the appropriate kind of index based on the data that is passed to Daru::Index.new(data)
. The pseudo code for doing this looks something like this:
class Daru::Index
# some stuff...
def self.new *args, &block
source = args[0]
if source_looks_like_a_multi_index
create_multi_index_and_return
elsif source_looks_like_date_time_index
create_date_time_index_and_return
else # Create the Daru::Index by calling initialize
i = self.allocate
i.send :initialize, *args, &block
i
end
end
# more stuff...
end
Also, since over-riding .new
tampers with the subclasses of the class as well, an inherited
hook that replaces the over-ridden .new
of the inherited class with the original one was added to Daru::Index
.
Working of the where clause
The where clause in daru lets users query data with a Array containing boolean variables. So whenever you call where
on Daru::Vector or DataFrame, and pass in an Array containing true or false values, all the rows corresponding with true
will be returned as a Vector or DataFrame respectively.
Since the where clause works in cojunction with the comparator methods of Daru::Vector (which return a Boolean Array), it was essential for these boolean arrays to be combined together such that piecewise AND and OR operations could be performed between multiple boolean arrays. Hence, the Daru::Core::Query::BoolArray
class was created, which is specialized for handling boolean arrays and performing piecewise boolean operations.
The BoolArray defines the #&
method for piecewise AND operations and it defines the #|
method for piecewise OR operations. They work as follows:
require 'daru'
a = Daru::Core::Query::BoolArray.new([true,false,false,true,false,true])
#=> (Daru::Core::Query::BoolArray:84314110 bool_arry=[true, false, false, true, false, true])
b = Daru::Core::Query::BoolArray.new([false,true,false,true,false,true])
#=> (Daru::Core::Query::BoolArray:84143650 bool_arry=[false, true, false, true, false, true])
a & b
#=> (Daru::Core::Query::BoolArray:83917880 bool_arry=[false, false, false, true, false, true])
a | b
#=> (Daru::Core::Query::BoolArray:83871560 bool_arry=[true, true, false, true, false, true])