Setting up a lexical analyser and parser in Ruby
I wrote this post as I was setting up the lexer and parser for Rubex, a new superset of Ruby that I’m developing.
Let’s demonstrate the basic working of a lexical analyser and parser in action with a demonstration of a very simple addition program. Before you start, please make sure rake, oedipus_lex and racc are installed on your computer.
Configuring the lexical analyser
The most fundamental need of any parser is that it needs string tokens to work with, which we will provide by way of lexical analysis by using the oedipus_lex gem (the logical successor of rexical). Go ahead and create a file lexer.rex
with the following code:
class AddLexer
macro
DIGIT /\d+/
rule
/#{DIGIT}/ { [:DIGIT, text.to_i] }
/.|\n/ { [text, text] }
inner
def do_parse; end # this is a stub.
end # AddLexer
In the above code, we have defined the lexical analyser using Oedipus Lex’s syntax inside the AddLexer
class. Let’s go over each element of the lexer one by one:
macro
The macro keyword lets you define macros for certain regular expressions
that you might need to write repeatedly. In the above lexer, the macro DIGIT
is a regular expression (\d+
) for detecting one or more integers. We place the regular expression inside forward slashes (/../
) because oedipus_lex requires it that way. The lexer can handle any valid Ruby regular expression. See the Ruby docs for details on Ruby regexps.
rule
The section under the rule
keyword defines your rules for the lexical analysis. Now it so happens that we’ve defined a macro for detecting digits, and in order to use that macro in the rules, it must be inside a Ruby string interpolation (#{..}
). The line to the right of the /#{DIGIT}/
states the action that must be taken if such a regular expression is encountered. Thus the lexer will return a Ruby Array that contains the first element as :DIGIT
. The second element uses the text
variable. This is a reserved variable in lex that holds the text that the lexer has matched. Similar the second rule will match any character (.
) or a newline (/n
) and return an Array
with [text, text]
inside it.
inner
Under the inner
keyword you can specify any code that you want to occur inside your lexer class. This can be any logic that you want your lexer to execute. The Ruby code under the inner
section is copied as-is into the final lexer class. In the above example, we’ve written an empty method called do_parse
inside this section. This method is mandatory if you want your lexer to sucessfully execute. We’ll be coupling the lexer with racc
shortly, so unless you want to write your own parsing logic, you should leave this method empty.
Configuring the parser
In order for our addition program to be successful, it needs to know what to do with the tokens that are generated by the lexer. For this purpose, we need racc, an LALR(1) parser generator for Ruby. It is similar to yacc or bison and let’s you specify grammars easily.
Go ahead and create a file called parser.racc
in the same folder as the previous lexer.rex
and Rakefile
, and put the following code inside it:
class AddParser
rule
target: exp { result = 0 }
exp: exp '+' exp { result += val[2]; puts result }
| DIGIT
end
---- header
require_relative 'lexer.rex.rb'
---- inner
def next_token
@lexer.next_token
end
def prepare_parser file_name
@lexer = AddLexer.new
@lexer.parse_file file_name
end
As you can see, we’ve put the logic for the parser inside the AddParser
class. Yacc’s $$
is the result
; $0
, $1
… is an array called val
, and $-1
, $-2
… is an array called _values
. Notice that in racc, only the parsing logic exists inside the class and everything else (i.e under header
and inner
) exists outside the class. Let’s go over each part of the parser one by one:
class AddParser
This is the core class that contains the parsing logic for the addition parser. Similar to oedipus_lex
, it contains a rule
section that specifies the grammar. The parser expects tokens in the form of [:TOKEN_NAME, matched_text]
. The :TOKEN_NAME
must be a symbol. This token name is matched to literal characters in the grammar (DIGIT
in the above case). token
and expr
are varibles. Have a look at this introduction to LALR(1) grammars for further information.
header
The header
keyword tells racc what code should be put at the top of the parser that it generates. You usually put your require
statements here. In this case, we load the lexer class so that the parser can use it for accessing the tokens generated by the lexer. Notice that header
has 4 hyphens (-
) and a space before it. This is mandatory if your program is to not malfunction.
inner
The inner
keyword tells racc what should be put inside the generated parser class. As you can see there are two methods in the above example - next_token
and prepare_parser
. The next_token
method is mandatory for the parser to function and you must include it in your code. It should contain logic that will return the next token for the parser to consider. Moving on the prepare_parser
method, it takes a file name that is to be parsed as an argument (how we pass that argument in will be seen later), and initialzes the lexer. It then calls the parse_file
method, which is present in the lexer class by default.
The next_token
method in turn uses the @lexer
object’s next_token
method to get a token generated by the lexer so that it can be used by the parser.
Putting it all together
Our lexical analyser and parser are now coupled to work with each other, and we now use them in a Ruby program to parse a file. Create a new file called adder.rb
and put the following code in it:
require_relative 'parser.racc.rb'
file_name = ARGV[0]
parser = AddParser.new
parser.prepare_parser(file_name)
parser.do_parse
The prepare_parser
is the same one that was defined in the inner
section of the parser.racc
above. The do_parse
method called on the parser will signal the parser to start doing it’s job.
In a separate file called text.txt
put the following text:
2+2
Oedipus Lex does not have a command line tool like rexical for generating a lexer from the logic specified, but rather has a bunch of rake tasks defined for doing this job.
So now create a Rakefile
in the same folder and put this code inside it:
require 'oedipus_lex'
Rake.application.rake_require "oedipus_lex"
desc "Generate Lexer"
task :lexer => "lexer.rex.rb"
desc "Generate Parser"
task :parser => :lexer do
`racc parser.racc -o parser.racc.rb`
end
Running rake parser
will generate a two new files - lexer.rex.rb
and parser.racc.rb
- which will house the classes and logic for the lexer and parser, respectively. You can use your newly written lexer + parser with a ruby adder.rb text.txt
command. It should output 4
as the answer.
You can find all the code in this blogpost here.