This tool generates syntax LALR (Look-Ahead LR parser) analyzer based on
grammar specified using BNF (Backus–Naur or context-free
Form). Its input file is very similar to the input of standard yacc tool by
structure and syntax. But contrary to yacc instead of generating full analyzer with inlined action handlers parsegen
generates only tables and parser engine C-compliant function as files, which can be included into the rest analyzer's
implementation.
This README file briefly describes this tool and can be not up-do-date, it also gives guidelines how to use this stuff. For more detailed information see wiki pages.
Assume that we already have built this tool as parsegen executable, and we want to create a simple arithmetical
calculator, which supports '+', '-' (unary and binary), '*', and '/' operators. Brackets '(' and ')' can also be used to
override precedence.
Here is a source input file test.gr for the desired parser:
# This is an arithmetical expression syntax analyzer.
# This section contains token and action definitions as well as token priority and
# associativity specifications.
# Start conditions are also can be defined in this section using the form: %start <sc-name>.
# By default only one `initial` start condition is implicitly defined.
# Tokens
%token num # number token
%token eof # end of file token
# Operator precedence
# Tokens in upper lines has less priority than lower ones
%left '+' '-'
%left '*' '/'
%right $unary # Names preceded with symbol '$' are used for internal tokens, which are declared
# in-place and are not exposed to final analyzer. They are only used as priority
# references. Names $default, $empty, and $error are reserved.
# Actions
%action add # add action ('+')
%action sub # subtract action ('-')
%action mul # multiply action ('*')
%action div # divide action ('/')
%action uminus # unary minus
%% # section separator
# This section describes grammar using Backus–Naur form.
# By default its first production is a starting production with default `initial` start condition.
# If one of productions is needed to be marked as a starting production, then start condition
# name should be specified in triangle braces just after the left part of the production.
# Note, that starting production must be ended with a token.
result : expr [eof] ; # or result<initial> : expr [eof]
expr : '(' expr ')' # character ':' separates left and right production parts
| expr '+' expr {add} # character '|' separates right parts of productions with the same left parts
| expr '-' expr {sub} # actions to be performed can be specified in curly braces in any point of a production
| expr '*' expr {mul} # single quotation marks can be used for single-character tokens
| expr '/' expr {div}
| '-' expr {uminus} %prec $unary # explicit precedence specification is used for this production
| [num] # tokens are specified in square brackets
; # character ';' finishes production
%% # mandatory end of section
Then, in order to generate our syntax analyzer let's issue the following:
$ ./parsegen test.gr
test.gr: info: building analyzer...
test.gr: info: - action table row size: max 6, avg 2
test.gr: info: - goto table row size: max 7, avg 4
test.gr: info: done: 0 shift/reduce, 0 reduce/reduce conflict(s) foundAs the result two files with default names parser_defs.h and parser_analyzer.inl are generated. If it is needed to
specify the names explicitly the following should be issued:
$ ./parsegen test.gr -o <new-analyzer-file-name> --header-file=<new-defs-file-name>File parser_defs.h contains numerical identifiers for tokens, actions, and start conditions (or start analyzer
states). Only one sc_initial start condition is defined for our example.
File parser_analyzer.inl contains necessary tables and parse() function implementation, defined as static. This
function has the following prototype:
static int parse(int tt, int* sptr0, int** p_sptr, int rise_error);where:
tt- look-ahead token identifiersptr0- pointer to the beginning of user-provided state stackp_sptr- pointer to current user-provided state stack pointerrise_error- if is<0then an error is risen with coderise_error, error recovering procedure is initiated so as the analyzer faced a syntax error
returns: action identifier to perform on reduction or
predef_act_shiftto shift look-ahead tokenpredef_act_reduceto reduce without specified action
The analyzer returns the decision result whether to shift next look-ahead token or to reduce some production based on current state stack and current look-ahead token identifier.
The starting state must be on the top of user-provided state stack before the first call of parse(). Current stack
pointer *p_sptr must point to the position after the last state (the first free stack cell).
Before each parse() call one free stack cell must be reserved. The caller is responsible for it.
In case of reduction as well as in case of erroneous reduction (involving $error token) the length of this reduction can be calculated as a difference between stack top pointers before and after the call plus 1.
The calculators's code can be something like this:
#include <cctype>
#include <cstdint>
#include <cstdlib>
#include <cstring>
#include <iostream>
#include <memory>
#include <vector>
namespace parser_detail {
#include "parser_defs.h"
#include "parser_analyzer.inl"
} // namespace parser_detail
int lex(const char*& p, double& lval) {
while (std::isspace(static_cast<unsigned char>(*p))) { ++p; }
if (!*p) { return parser_detail::tt_eof; }
if ((*p < '0' || *p > '9') && *p != '.') { return *p++; }
char* str_end{};
lval = std::strtod(p, &str_end);
p = str_end;
return parser_detail::tt_num;
}
int main(int argc, char* argv[]) {
constexpr std::size_t kInitialStackSize = 64;
if (argc < 2) { return -1; }
const char* p = argv[1];
auto state_stack = std::make_unique<int[]>(kInitialStackSize);
int* slast = state_stack.get() + kInitialStackSize;
int* sptr = state_stack.get();
*sptr++ = parser_detail::sc_initial;
double lval;
std::vector<double> eval_stack;
int tt = lex(p, lval);
while (true) {
if (sptr == slast) {
const std::size_t old_stack_size = static_cast<std::ptrdiff_t>(sptr - state_stack.get());
const std::size_t new_stack_size = 2 * old_stack_size;
auto new_state_stack = std::make_unique<int[]>(new_stack_size);
std::memcpy(new_state_stack.get(), state_stack.get(), old_stack_size * sizeof(*sptr));
slast = new_state_stack.get() + new_stack_size;
sptr = new_state_stack.get() + old_stack_size;
state_stack = std::move(new_state_stack);
}
int act = parser_detail::parse(tt, state_stack.get(), &sptr, 0);
if (act < 0) {
std::cout << "syntax error" << std::endl;
return -1;
} else if (act != parser_detail::predef_act_shift) {
switch (act) {
case parser_detail::act_add: {
*(eval_stack.end() - 2) += eval_stack.back();
eval_stack.pop_back();
} break;
case parser_detail::act_sub: {
*(eval_stack.end() - 2) -= eval_stack.back();
eval_stack.pop_back();
} break;
case parser_detail::act_mul: {
*(eval_stack.end() - 2) *= eval_stack.back();
eval_stack.pop_back();
} break;
case parser_detail::act_div: {
if (eval_stack.back() == 0) {
std::cout << "division by zero" << std::endl;
return -1;
}
*(eval_stack.end() - 2) /= eval_stack.back();
eval_stack.pop_back();
} break;
case parser_detail::act_uminus: {
*(eval_stack.end() - 1) = -eval_stack.back();
} break;
default: break;
}
} else if (tt != parser_detail::tt_eof) {
if (tt == parser_detail::tt_num) { eval_stack.push_back(lval); }
tt = lex(p, lval);
} else {
// we can accept the whole expression
break;
}
}
std::cout << "result = " << eval_stack.back() << std::endl;
return 0;
}Build and run result:
$ ./parsegen test.gr
$ gcc example.cpp
$ ./a.out "9*(1+1)"
result = 18$ ./parsegen --help
OVERVIEW: A tool for LALR-grammar based parser generation
USAGE: ./parsegen file [-o <file>] [--header-file=<file>] [-h] [-V]
OPTIONS:
-o, --outfile=<file> Place the output analyzer into <file>.
--header-file=<file> Place the output definitions into <file>.
-h, --help Display this information.
-V, --version Display version.Perform these steps to build the project:
-
Clone
parsegenrepository and enter it$ git clone https://github.com/gbuzykin/parsegen.git $ cd parsegen -
Initialize and update
uxssubmodule$ git submodule update --init
-
Create compilation script using
cmaketool$ cmake --preset default
Default C++ compiler will be used.
-
Build
parsegen$ cmake --build build --config Release
To run several parallel processes (e.g. 8) for building use
-jkey$ cmake --build build --config Release -j 8
-
Install
parsegen$ cmake --install build --config Release --prefix <install-dir>