finding and combining data in daru
Arel-like query syntax
Arel is a very popular ruby gem that is one of the major components of the most popular ruby framework, Rails. It is an ORM-helper of sorts that exposes a beatiful and intuitive syntax for creating SQL strings by chaining Ruby methods.
Daru successfully adopts this syntax and the result is a very intuitive and readable syntax for obtaining any sort of data from a DataFrame or Vector.
As a quick demonstration, lets create a DataFrame which looks like this:
require 'daru'
df = Daru::DataFrame.new({
a: [1,2,3,4,5,6]*100,
b: ['a','b','c','d','e','f']*100,
c: [11,22,33,44,55,66]*100
}, index: (1..600).to_a.shuffle)
df.head(5)
#=>
##<Daru::DataFrame:80543480 @name = 3fc642f2-bd9a-4f6f-b4a8-0779253720f5 @size = 5>
# a b c
# 109 1 a 11
# 381 2 b 22
# 598 3 c 33
# 390 4 d 44
# 344 5 e 55
To select all rows where df[:a] equals 2 or df[:c] equals 55, just write this:
selected = df.where(df[:a].eq(2) | df[:c].eq(55))
selected.head(5)
# =>
##<Daru::DataFrame:79941980 @name = 74175f76-9dce-4b5d-b85b-bdfbb650953e @size = 5>
# a b c
# 381 2 b 22
# 344 5 e 55
# 135 2 b 22
# 524 5 e 55
# 266 2 b 22
As is easily seen above, the Daru::Vector class has special comparators defined on it, which allow it to check each value of the Vector and return an object that can be evaluated by the DataFrame#where method.
Notice that to club the two comparators above, we have used the union OR (|) operator.
Daru::Vector has a bunch of comparator methods defined on it, which can be used with #where for obtaining the desired results. All of these return an object of type Daru::Core::Query::BoolArray, which is read by #where. BoolArray uses the methods | (also aliased as #or) and & (also aliased as #and) for piecewise logical operations on other BoolArray objects.
BoolArray consists of an internal Array that contains true for every entry in the Vector that returns true for an operation between the comparable operand and a Vector entry.
For example,
require 'daru'
vector = Daru::Vector.new([1,2,3,4,5,6,7,8,2,3])
vector.eq(3)
#=>(Daru::Core::Query::BoolArray:82379030 bool_arry=[false, false, true, false, false, false, false, false, false, true])
The #& (or #and) and #| (or #or) methods on BoolArray apply a logical and and a logical or respectively between each element of the BoolArray and return another BoolArray that contains the results. For example:
require 'daru'
vector = Daru::Vector.new([1,2,3,4,5,6,7,7,8,9,9,9,7,5,4,3,4])
vector.eq(4).or(vector.mt(8))
#=> (Daru::Core::Query::BoolArray:82294620 bool_arry=[false, false, false, true, false, false, false, false, false, true, true, true, false, false, true, false, true])
The following comparators can be used with a Daru::Vector:
| Comparator Method | Description |
|---|---|
eq |
Uses == and returns true for each equal entry |
not_eq |
Uses != and returns true for each unequal entry |
lt |
Uses < and returns true for each entry less than the supplied object |
lteq |
Uses <= and returns true for each entry less than or equal to the supplied object |
mt |
Uses > and returns true for each entry more than the supplied object |
mteq |
Uses >= and returns true for each entry more than or equal to the supplied object |
in |
Uses == for each element in the collection (Array, Daru::Vector, etc.) passed and returns true for a match |
A major advantage of using the #where clause over DataFrame#filter or Vector#keep_if, apart from better readability and usability, is that it is much faster. These benchmarks prove my point.
I’ll conclude this chapter with a little more complex example of using the arel-like query syntax with a Daru::Vector object:
require 'daru'
vec = Daru::Vector.new([1,2,3,4,5,6,3,336,3,6,2,6,2,35,346,7,3,45,23,26,7,345,2525,22,66,2])
vec.where((vec.eq(4) | vec.eq(1) | vec.mt(300)) & vec.lt(2000))
# =>
# #<Daru::Vector:70585830 @name = nil @size = 5 >
# nil
# 0 1
# 3 4
# 7 336
# 14 346
# 21 345
For more examples on using the arel-like query syntax, see this notebook.
Joins
Daru::DataFrame offers the #join method for performing SQL style joins between two DataFrames. Currently #join supports inner, left outer, right outer and full outer joins between DataFrames.
In order to demonstrate joins, lets consider a single example of an inner on two DataFrames:
require 'daru'
left = Daru::DataFrame.new({
:id => [1,2,3,4],
:name => ['Pirate', 'Monkey', 'Ninja', 'Spaghetti']
})
right = Daru::DataFrame.new({
:id => [1,2,3,4],
:name => ['Rutabaga', 'Pirate', 'Darth Vader', 'Ninja']
})
left.join(right, on: [:name], how: :inner)
#=>
##<Daru::DataFrame:73134350 @name = 7cc250a9-108c-4ea3-99ab-dcb828ff2b88 @size = 2>
# id_1 name id_2
# 0 1 Pirate 2
# 1 3 Ninja 4
For more examples please refer this notebook.