Writing the Grammar and Parser in Ruby first has the advantage of interactivity. Ruby is interpreted and has a very quick startup time.
To get going I’ll start writing a JSON parser. This may sound strange as there are already JSON parsers out there, but I don’t try to replace them, but rather show the concepts along that line rather simple grammar.
At the end, I will also give you a peek at a grammar for a more complex DSL for alert trigger creation.
To get an overview of the grammar, have a look at JSON.org.
The Ruby parser
For Ruby, I have chosen a treetop parser, which is a Parsing Expression Grammar (PEG).
Grammars in it look like this:
grammar Grammarname (1)
rule FirstRule (2)
.. projections..
end
rule NextRule
.. projections..
end
end
1 | The name of the grammar determines the name of the parser. See below |
2 | The first rule is the starting point of the grammar. |
For our Json parser the top of the file json-simple.treetop
would look like this:
# A treetop grammar for JSON
grammar JSONGrammar
rule json (1)
array / object (2)
end
1 | Topmost rule, parsing starts here |
2 | Content of the rule used for matching |
A Json object would thus be either an array
or an object
. As both are not terminal symbols in a simple character or string (like 'this'
), more rules are needed to resolve this. We will look at them below.
Before we continue with the grammar, let’s have a look at how to compile the grammar and use it.
Compiling the grammar
Treetop brings a command line tool called tt
, which will compile the grammar into Ruby code.
tt.
$ tt json-simple.treetop (1)
$ ls json.rb (2)
json.rb
1 | The input grammar file |
2 | The resulting Ruby parser |
This worked well, but calling tt
during a development cycle is a bit too heavy. Luckily, there is a way to compile grammar files on the fly when you first use them.
Here you see how a parser driver can load the definition and instantiate the parser.
require 'treetop'
class JsonParser
Treetop.load(File.join(__dir__, 'json-simple.treetop'))
@@parser = JSONGrammarParser.new
def self.parse(data, env = {})
# Pass the data over to the parser instance
tree = @@parser.parse(data)
# If the AST is nil then there was an error during parsing
# we need to report a simple error message to help the user
The key part is those two lines:
Treetop.load(File.join(base_path, 'json-simple.treetop')) (1)
@@parser = JSONGrammarParser.new (2)
1 | We are loading the .treetop grammar file |
2 | The name of the parser is determined from the grammar |
The name of our parser is determined from the grammar
instruction by appending Parser
to the grammar name given there.
Let’s try our grammar.
$ irb
2.3.0 :001 > require_relative 'json_parser' (1)
=> true
2.3.0 :002 > JsonParser.parse '{}' (2)
NameError: undefined local variable or method `_nt_array' for #<JSONGrammarParser:0x007f895a9b31c0>
from (eval):21:in `_nt_json'
1 | Load the parser |
2 | Try to parse the empty object |
You see that the parsing fails with an exception about an unknown method _nt_array
. If you look at the grammar above you can see that we have said that a json object is either an array
or an object
but did not define them anywhere.
While the compiler is not telling us where in our source file it thinks the error is, the message above (which could be less cryptic) tells us that _nt_array
is missing from _nt_json
or in other words the rule json
uses a non-terminal ( _nt
) named array
, which is not defined.
TIP | As the grammar is evaluated left to right, it also means that the error reporting is left to right.
In the above case, the missing |
Continuing with the grammar
So let’s continue with the grammar.
array.
rule array
'[' space a_value? (space ',' space a_value)* space ']'
end
In this example, you find two new things terminals enclosed in single quotes like the opening and
closing brackets, '['
and ']'
. The parser expects such a symbol or keyword literally at this place.
So for the above, an array
needs to start with a [
character and end with ]
to be recognized.
The other new concept you see here are the quantifiers, that you probably already know from
regular expressions:
?
The item may show up zero or one time+
The item must show up at least one time (not shown above)*
The item may be given any number of times (including zero)
And last but not least items can be grouped together with parens (
and ).
So the array has to start with [
has a space
rule, zero or one value
and then zero or many times a block, which consists of a space, a literal colon, space and a value.
Now let’s look at the a_value
.
a_value.
rule a_value
object / array / string / number / true / false / null
end
Here you see the slash /
which means or. A value
can thus be an object, an array, a string, a number or one of true, false and null. As treetop is a PEG parser generator, it will evaluate the list from left to right.
Sometimes you want to have a rule match different options like the above, but where only a part of the matching expression is a choice. In this case, you can put a sub-expression in parens:
rule eval_boolean
boolean ( 'and' / 'or' ) boolean
end
The last excerpt from the grammar is the space.
space.
rule space
[\s\n\t\r]*
end
Here you see that the space
consists of zero or more space, newline or tab characters.
TIP | The full grammar up to this point is available in the file json-simple.treetop . |
Generating values from the input
Up to this point, we have a grammar that can parse JSON input, but for the parser to be useful we actually want to transform the input into an object that contains nested lists and objects.
You may have seen above that the parser returns the abstract syntax tree (AST) it has generated:
tree = @@parser.parse(data)
We could now walk the tree and determine the values from the walk. But there is a better way to do this in the treetop.
Individual grammar nodes can provide methods that can be called during the parsing run. There are several options to do this - I will show two.
Inline in the grammar
In this case, the function is directly inside the .treetop
file. This is nice for a very simple function but otherwise is probably more confusing than helpful. Also, even if it is Ruby code, your IDE is most likely not helping you.
rule val_in_parens
'(' string ')' (1)
{ (2)
def value (3)
elements[1].text_value (4)
end
}
end
1 | The usual matching |
2 | The inline function definition is delimited by { and } |
3 | The name of the function can be anything with and without parameters. |
4 | Evaluation of the current node, see below |
Even if the name and signature of the function is arbitrary, it needs to be known to a potential calling node. |
Via Mixin
Here we externalize the functions into a Module
and only provide a pointer to it inside the
grammar.
rule val_in_parens
'(' string ')' <ValueNode> (1)
end
1 | The match definition gets the name of the Module passed in angle brackets |
We then have a Ruby file that contains the definition:
module ValueNode
def value
elements[1].text_value
end
end
You see that this is the same as inside the grammar, but you get all the help from the IDE. It is now needed though to require
that additional file from our parser driver.json_parser.rb
# Our JsonParser
require_relative 'parser_extensions'
require 'treetop'
class JsonParser
Evaluation from the parser driver
Now that we have the functions defined, we can start evaluating the input. This is simply done by calling a function on the generated AST:
def self.parse(data, env = {})
# Pass the data over to the parser instance
tree = @@parser.parse(data)
[...]
# Compute the hash and return it
tree.value
end
As we are calling the function value
we also need to make sure that the topmost grammar node has it defined. Otherwise, an error like this is raised:
In our case this is not needed, as our json rule only diverts to array or object and thus does not add any value (pun intended). |
2.3.0 :003 > JsonParser.parse '{}' NoMethodError: undefined method 'value' for #<Treetop::Runtime::SyntaxNode:0x007fc204068970>
Now the big question is how can we access the elements in a rule to evaluate them?
Internally the parsed result is out into a Ruby-array called elements
. Data starts at index 0.
This is a bit tricky sometimes and I have added the element positions below the matches.
rule kv_pair
string space? ':' space? a_value
# 0 1 2 3 4
end
To obtain the key (a string
) you can thus use elements[0]
and elements[4]
for the value. Our value
function for kv_pair
could thus look like this:
def value (parent)
parent[elements[0].value.to_sym] = elements[4].value
end
This works quite nicely and if you compile the grammar with tt
as seen above, you can see that the compiled parser uses the same elements to provide some aliases:
module KvPair0
def string
elements[0]
end
def a_value
elements[4]
end
end
It is thus possible to use the aliases in the above for it to look like this:
def value (parent)
parent[string.value.to_sym] = a_value.value
end
The parser gets a lot less brittle this way. If your rule now prepends a matching character literal,
the aliases stay the same while the indexes into the elements would have changed.
Now suppose you need another string
in your input. Treetop would then name them string1
and string2
. A more elegant solution here is to "tag" those input rules with the desired alias as in,
rule kv_pair
key:string space ':' space a_value
end
So that you can access the key string as key
:
module KVNode
def value(parent)
parent[key.value.to_sym] = a_value.value
end
end
Still, the value part of the key-value-pair uses the un-aliased a_node
. If you ever decide to
rename this, you also need to rename the ruby code.
Treetop v1.6.8 seems to only provide the aliases for non-terminal symbols that are required. For cases where you have optional symbols, you still need to fall back to tags. |
Handling of bad input
Now sometimes parsing an input fails because the input does not match the defined grammar. The parser does not return an AST in this case, which we can test for and return some debugging
information to the user:
# If the AST is nil then there was an error during parsing
# we need to report a simple error message to help the user
if(tree.nil?)
puts "Input : >>|#{data}|<<"
puts @@parser.terminal_failures.join("\n")
raise JsonParserException, "Parse error: #{@@parser.failure_reason}"
end
# Compute the hash and return it
tree.value
Let’s see how this looks in irb.
>> require_relative 'json_parser' => true >> JsonParser.parse '{ x' Input : >>|{ x|<< String matching [\s\n\t] expected. String matching '"' expected. JsonParserException: Parse error: Expected one of [\s\n\t\r], '"' at line 1, column 3 (byte 3) after { from /Users/hrupp/src/dsl_writing/ruby/json_parser.rb:21:in `parse' from (irb):2
While working in your own code, you can change the error handling to your liking.
Traps
In this section, I want to point out a few traps that one can easily fall into.
Too much space
Suppose the following grammar
rule object
'{' space kv_pair? space '}'
end
rule kv_pair
space string space ':' space value space
end
Parsing or filling of the elements
array may fail in unexpected ways. The parser is unable to
determine to which element the various space characters should be matched. This becomes a bit more obvious when we 'insert' the kv_pair ` declaration into the `object
one:
rule object
'{' space space string space ':' space value space space '}' <ObjectNode>
# 0 1 2 3 4 5 6 7 8 9 10
end
At positions 1,2 and 8,9, each time the parser is asked to (greedily) consume white space. One could argue that positions 1 and 8 do this and 2 and 9 could be no-ops, but apparently, this is not the case.
A solution can be to remove the surrounding space
in kv_pair
:
rule kv_pair
string space ':' space value
end
Adjacent things that should not be
Take this grammar snippet:
rule array
'[' a_value?']'
end
Treetop will fail to compile the grammar, as there is no whitespace between a_value?
and ']'
, so the expression makes no sense to it.
Forgotten grouping
In the next snippet, we have two aBool
that can be tied together via 'and'
or 'or'
.
rule eval_boolean
aBool 'and' / 'or' aBool
end
Treetop will, in this case, try to match aBool 'and'
or 'or' aBool
, which is most of the time not what we intended. To fix this we need to put the alternatives into a group.
rule eval_boolean
aBool ( 'and' / 'or' ) aBool
end
Naming things
Treetop will internally provide aliases for matches as we have seen before. Take this example:
rule object
'{' string ':' value '}'
It will create an alias 'value'. Now when you define a function value
on it, you will get
a silent clash.
rule object
'{' string ':' value '}'
{
def value
Best here is to name your functions in a way that you would never name the rules or grammar elements.
Looking at a DSL for alert trigger creation
The Hawkular project also has an alerting component that can be used to inform the user when conditions are met. Those could be when certain values are larger than a threshold, when a string is matching or when the availability of a resource goes down. The definition of the alert trigger with its conditions etc. is done via REST-api.
With HawkFX, a pet project of mine, I have created an explorer for Hawkular. And in this regard, I have also started to add a DSL for trigger definition. I have blogged about it as well.
I will not show the full code here, but some excerpts to give you some ideas how a more complex grammar may look like. The full code is available in the HawkFX repo on GitHub.
Some examples
define trigger "MyTrigger" (1)
disabled (2)
severity HIGH (3)
( availability "mymetric" is DOWN ) (4)
1 | We define a new trigger with the name MyTrigger |
2 | It is disabled by at creation time |
3 | When it fires an alert, then the alert severity is High |
4 | It fires when the availability of mymetric is Down. |
define trigger "MyTrigger"
AND( (1)
( threshold counter "mycount" < 5 ) (2)
( string "mymetric" CO "ERROR" ) (3)
)
1 | The trigger only fires if all conditions are true |
2 | The condition is true if the counter-metric 'mycount' is less than 5 |
3 | This condition is true if the string-metric 'mymetric' contains the string 'ERROR' |
define trigger "MyTrigger"
( threshold "myvalue" > 3 )
auto-disable (1)
1 | When this keyword is present, the Trigger will auto-disable after firing. |
The grammar
Now that we have seen some examples, we can look at the grammar rules.
rule define
'define trigger ' space? name:string space? (1)
disabled:disabled? space?
sev:severity? space?
id:id? space? conditions space? act:action? (2)
ae:auto_enable? space? ad:auto_disable? space?
<DefineNode> (3)
end
1 | 'define trigger' needs to be written as is |
2 | We need conditions. Pretty much everything else is optional |
3 | Evaluation happens in the DefineNode module |
The rule for disabled is pretty simple and just checks if the keyword 'disabled' is present.
rule disabled
'disabled'
end
Inside the Ruby code it is queried on the value
method for the DefineNode
module:
module DefineNode
def value(env = {})
t = Hawkular::Alerts::Trigger.new({}) (1)
t.enabled = true if disabled.empty? (2)
t.name = name.value env
[..]
1 | Trigger object that we want to prepare |
2 | We check if disabled is set. |
Remember from above that disabled
is available as a variable for us to use because we set it in the define
rule via disabled:disabled?
.
rule conditions
conds:(and_cond / or_cond / condition) <ConditionsNode>
end
Conditions can be either a single definition, or conditions joined by 'AND' or 'OR'. Next is the rule
for the 'AND' case:
rule and_cond
'AND(' space? first:condition more:( space condition )* space? ')' <AndConditionNode>
end
Here you see a trick that is often used: we parse the first condition in first:condition
and then define the remaining conditions as a list of any number of conditions in the more
part.
In code this looks as follows:
module AndConditionNode
def value(env = {})
first.value env (1)
more.elements.each {|elem| elem.condition.value env} (2)
env[:trigger].firing_match = :ALL
end
end
1 | Evaluate the first condition |
2 | Evaluate the additional and optional conditions. |
I am at the end of the example. Again, the full code is available in the HawkFX repository on GitHub.
References
Treetop home page http://treetop.rubyforge.org/index.html
Parsing time ranges in Treetop https://github.com/jcinnamond/treetop-time-range
Json.org http://json.org/index.html
Join Red Hat Developers, a developer program for you to learn, share, and code faster – and get access to Red Hat software for your development. The developer program and software are both free!
Download Red Hat Enterprise Linux Developer Suite, which includes Red Hat Enterprise Linux 7 server, a collection of development tools, and much more.