Parser Specification

NMFU source files support C++-style line comments (text after // on a line is ignored until the end of the line)

Top-Level Constructs

At the top-level, all NMFU parsers consist of a set of output-variables, macro-definitions, hook-definitions and the parser code itself.

Outputs

The output-variables are specified with the output-declaration:

out_decl: "out" out_type IDENTIFIER ";"
        | "out" out_type IDENTIFIER "=" atom ";"

out_type: "bool" -> bool_type
        | "int" ("{" int_attr ("," int_attr)* "}")? -> int_type
        | "enum" "{" IDENTIFIER ("," IDENTIFIER)+ "}" -> enum_type
        | "str" "[" RADIX_NUMBER "]" -> str_type
        | "unterminated" "str" "[" RADIX_NUMBER "]" -> unterm_str_type
        | "raw" "{" IDENTIFIER "}" -> raw_type

int_attr: SIGNED -> signed_attr
        | "size" NUMBER -> width_attr

For example:

out int{unsigned, size 16} content_length = 32;
out bool supports_gzip = false;

// note you can't set default values for strings and enums

out str[32] url;
out enum{GET,POST} method;
out raw{uint64_t} unsigned_out;

Integers can have their signedness and bit size set (defaulting to signed 32-bits). All strings have a defined maximum size, which includes the automatically added null-terminator (if not suppressed with the unterminated option). There is also the raw type, which acts like a string except it's exposed to you as whatever type you provide. No byte-order conversions are performed by NMFU.

Macros

Macros in NMFU are simple parse-tree level replacements. They look like:

macro_decl: "macro" IDENTIFIER macro_args "{" statement* "}"

macro_args: "(" macro_arg ("," macro_arg)* ")"
          | "(" ")" -> macro_arg_empty

macro_arg: "macro"     IDENTIFIER -> macro_macro_arg
         | "out"       IDENTIFIER -> macro_out_arg
         | "match"     IDENTIFIER -> macro_match_expr_arg
         | "expr"      IDENTIFIER -> macro_int_expr_arg
         | "hook"      IDENTIFIER -> macro_hook_arg
         | "loop"      IDENTIFIER -> macro_breaktgt_arg
         | RESULT_CODE IDENTIFIER -> macro_rescode_arg

RESULT_CODE: /yieldcode|finishcode/

For example:

macro ows() { // optional white space
   optional {
      " ";
   }
}

When macros are "called", or instantiated, all NMFU does is copy the contents of the parse tree from the macro declaration to the call-site. Note that although macros can call other macros, they cannot recurse.

Macros can also take arguments, which are similarly treated as parse-tree level replacements, with the added restriction that their types are checked. For example:

macro read_number(out target, match delimit) {
    target = 0;
    foreach {
        /\d+/;
    } do {
        target = [target * 10 + ($last - '0')];
    }

    delimit;
}

There are 6 types of arguments:

  • macro: a reference to another macro
  • hook: a reference to a hook
  • out: a reference to an output-variable
  • match: an arbitrary match-expression
  • expr: an arbitrary integer-expression
  • loop: an arbitrary named loop-statement, for use in break-statements.
  • finishcode/yieldcode: an arbitrary result-code of the given type.

Hooks

Hooks (which are callbacks to user code which the parser can call at certain points) are defined with a hook-declaration:

hook_decl: "hook" IDENTIFIER ";"

For example:

hook got_header;

Custom Result Codes

Both finish-statements and yield-statements take (optionally for the former) a result-code to return. These are declared using the result-code-declaration:

code_decl: RESULT_CODE IDENTIFIER ("," IDENTIFIER)* ";"

RESULT_CODE: /yieldcode|finishcode/

For example:

finishcode TOO_BIG;
yieldcode GOT_HEADER, GOT_BODY;

Parser Declaration

The parser proper is declared with the parser-declaration,

parser_decl: "parser" "{" statement+ "}"

and contains a set of statements which are "executed" in order to parse the input.

Basic Statements

Basic statements are statements which do not have an associated code block, and which end with a semicolon.

simple_stmt: expr -> match_stmt
           | IDENTIFIER "=" expr -> assign_stmt
           | IDENTIFIER "+=" expr -> append_stmt
           | IDENTIFIER "(" (expr ("," expr)*)? ")" -> call_stmt
           | "break" IDENTIFIER? -> break_stmt
           | "delete" IDENTIFIER -> delete_stmt
           | "finish" -> finish_stmt
           | "finish" IDENTIFIER -> custom_finish_stmt
           | "yield" IDENTIFIER -> custom_yield_stmt
           | "wait" expr -> wait_stmt

The most basic form of statement in NMFU is a match-statement, which matches any match-expression (explained in the next section).

The next two statements are the assign-statement and append_statement. The assign-statement parses an integer-expression (which are not limited to just integers, again explained in the next section). and assigns its result into the named output-variable. The append-statement instead appends whatever is matched by the match-expression into the named output-variable which must by a string type. Additionally, if the argument to an append-statement is a math-expression, then the result of evaluating the expression will be treated as a character code and appended to the string.

The delete-statement resets a string or raw-typed output-variable to an empty state.

The call-statement instantiates a macro or calls a hook. Note that there is currently no valid way to pass parameters to a hook, and as such the expressions provided in that case will be ignored. Macro arguments are always parsed as generic expressions and then interpreted according to the type given to them at declaration.

If a hook and macro have the same name, the macro will take priority. Priority is undefined if a macro argument and global hook or macro share a name.

The break-statement is explained along with loops in a later section.

The finish-statement causes the parser to immediately stop and return a DONE status code, which should be interpreted by the calling application as a termination condition. Alternatively, a custom result-code can be provided, which the parser will return instead of DONE.

The yield-statement is similar, except it must be provided with a result-code, and instead of the returned code indicating termination, it instead acts like an OK status code, indicating some application-specific state but that parsing can and should still continue.

The wait-statement spins and consumes input until the match-expression provided matches successfully. Importantly, no event (including end of input!) can stop the wait statement, which makes it useful primarily in error handlers.

It is also important to note that this is not the same as using a regex like /.*someterminator/, as the wait statement does not "try" different starting positions for a string when its match fails. More concretely, something like wait abcdabce would not match abcdabcdabce, as the statement would bail and restart matching from the beginning at the second d.

Expressions

There are three types of expressions in NMFU, match-expressions, integer-expressions and math-expressions.

A match-expression is anything that can consume input to the parser and check it:

?expr: atom // string match
     | regex // not an atom to simplify things
     | binary_regex
     | "end" -> end_expr
     | "(" expr+ ")" -> concat_expr

atom: STRING "i" -> string_case_const
    | STRING "b" -> binary_string_const
    | STRING -> string_const

The simplest form of match-expression is the direct-match, which matches a literal string. It can optionally match with case insensitivity by suffixing the literal string with an "i", and can match binary data with the "b" suffix -- see the section on binary matches.

The end-match-expression is a match expression which only matches the end of input.

The concat-expression matches any number of match-expressions in order.

The regex-expression and binary-regex-expression matches a subset of regexes. The quirks of the regex dialect NMFU uses can be found in a later section.

Additionally, the name of a macro argument of type match can replace any match-expression, including the sub-expressions inside concat-expressions.

An integer-expression is anything that can be directly assigned to an output variable, including strings:

?expr: atom 
     | "[" sum_expr "]"

atom: BOOL_CONST -> bool_const
    | RADIX_NUMBER -> number_const
    | CHAR_CONSTANT -> char_const
    | STRING -> string_const
    | IDENTIFIER -> enum_const

RADIX_NUMBER: HEX_NUMBER | BIN_NUMBER | NUMBER

The only two kinds of integer-expressions are literal-expressions, which are just literal strings, integers (with support for binary and hexadecimal modes), character constants (which behave as they do in C, just becoming integers), enumeration constants (which are resolved based on the context and which output-variable is being assigned) and booleans ("true"/"false"); and math-expressions, which are surrounded in square brackets:

_math_expr: disjunction_expr

?disjunction_expr: conjunction_expr ("||" conjunction_expr)*
?conjunction_expr: bit_or_expr ("&&" bit_or_expr)*
?bit_or_expr: bit_xor_expr ("|" bit_xor_expr)*
?bit_xor_expr: bit_and_expr ("^" bit_and_expr)*
?bit_and_expr: comp_expr ("&" comp_expr)*
?comp_expr: shift_expr (CMP_OP shift_expr)?
// nmfu does not allow (1 << 2 << 3) because that is dumb
?shift_expr: sum_expr (SHIFT_OP sum_expr)?
?sum_expr: mul_expr (SUM_OP mul_expr)*
?mul_expr: math_unary (MUL_OP math_unary)*
?math_unary: math_atom
           | "!" math_atom -> not_expr
           | "-" math_atom -> negate_expr
?math_atom: RADIX_NUMBER -> math_num
          | IDENTIFIER -> math_var
          | IDENTIFIER ".len" -> math_str_len
          | IDENTIFIER "[" _math_expr "]" -> math_str_index
          | CHAR_CONSTANT -> math_char_const
          | BOOL_CONST -> bool_const
          | "$" IDENTIFIER -> builtin_math_var
          | "(" _math_expr ")"

SUM_OP: /[+-]/
MUL_OP: /[*\/%]/
CMP_OP: /[!=]=/ | /[<>]=?/
CHAR_CONSTANT: /'[^'\\]'/ | /'\\.'/
SHIFT_OP: "<<" | ">>" 

(named _math_expr in the grammar)

math-expressions support most of the expressions in C, including bit manipulation and comparisons. They operate on a few atomic values:

  • Literals (as above)
  • References to the current value of output-variables
  • The currents size of a string/raw-typed output-variable
  • Specific byte positions inside string/raw-typed output-variables (which by default are runtime range-checked)
  • Or, a builtin-math-variable:

The current list of builtin-math-variables is:

Name Meaning
last The codepoint of the last matched character. Useful for interpreting numbers in input streams.

For example:

content_length = [content_length * 10 + ($last - '0')];
into = [into | (($last & 127) << (7 * varint_counter))];
[!($last == ' ' || $last == '\t') || advisory.len > 0];

Additionally, the name of a macro argument of type expr can replace any integer-expression. Priority versus output-variable names is undefined.

Warning

The precise semantics of "last matched" character for the purposes of $last are somewhat vague. It is only guaranteed to be well-defined in the following situations:

  • Immediately after (and in the same block) as a match-statement, in which case it is the last-matched character of the match-expression.
  • Inside a case-statement clause which itself cannot match characters, in which case it is the last-matched character of the predicate expression.
  • During a foreach-statement, where it takes on precisely the values matched by the contents of the first block.

In other places it may bind to the first character of the next match expression depending on the surrounding context.

Block Statements

Block statements are statements which contain a block of other statements:

block_stmt: "loop" IDENTIFIER? "{" statement+ "}" -> loop_stmt
          | "case" "{" case_clause+ "}" -> case_stmt
          | "greedy" "case" "{" greedy_prio_block+ "}" -> greedy_case_stmt
          | "optional" "{" statement+ "}" -> optional_stmt
          | "try" "{" statement+ "}" catch_block -> try_stmt
          | "foreach" "{" statement+ "}" "do" "{" foreach_actions "}" -> foreach_stmt
          | "if" if_condition ("elif" if_condition)* else_condition? -> if_stmt

catch_block: "catch" catch_options? "{" statement* "}"

?greedy_prio_block: "prio" NUMBER case_clause
                  | "prio" NUMBER "{" case_clause+ "}"
                  | case_clause

case_clause: case_predicate ("," case_predicate)* "->" "{" statement* "}"
case_predicate: "else" -> else_predicate
              | expr -> expr_predicate

catch_options: "(" CATCH_OPTION ("," CATCH_OPTION)* ")" 

if_condition: _math_expr "{" statement+ "}"
else_condition: "else" "{" statement+ "}"

CATCH_OPTION: /nomatch|outofspace/

loop-statements repeat their block forever until broken out of using a break-statement. loop-statements can optionally have names, which can be referred to in the break-statement to break out of nested loops. If a bare break-statement is encountered, the innermost loop is broken from.

optional-statements either execute their contents if the first match within them matches, otherwise do nothing. It is an error to have anything that does not match as the first statement in an optional-statement.

case-statements match one or more match-expressions and upon successful matching of one of those expressions execute a given block.

For example:

case {
   "GET" -> {method = GET;}
   "POST" -> {method = POST;}
   "PUT","PATCH" -> {wait "\r\n"; result=INVALID_METHOD; finish;}
   else, "OPTIONS" -> {}
}

The special constant "else" means anything not matched by any of the other clauses.

greedy-case-statements are a special version of case-statements. Instead of disallowing all ambiguity, greedy-case-statements follow two rules:

  • In a conflict where one match could end but another could continue, the finishing match wins
  • In a conflict where two or more matches could end on the same character, the one with the greatest priority wins. If there is no match with highest priority, an error is raised.

Priority is assigned with the greedy-priority-declaration; if unspecified, priority defaults to 0.

For example:

greedy case {
    /\s+/ -> {yield _RESET;}
    "(" -> {yield LP;}
    ")" -> {yield RP;}
    /\w*[a-zA-Z_]\w*/ -> {yield SYMBOL;}
    /-?\d+/ -> {yield INTEGER;}

    prio 1 {
        // higher priority over "SYMBOL"
        "define" -> {yield DEFINE;}
        "display" -> {yield DISPLAY;}
    }

    prio 2 "critical" -> {} // highest priority
}

try-except-statements can redirect certain error conditions to the beginning of a different block. They follow the general structure of

try {
   // ... some set of statements ...
}
catch {
   // ... an error handler ...
}

The specific error conditions which they match can be limited by placing a parenthesized comma-separated list of catch-options after the "catch" token, like

try {
   url = /\w+/;
}
catch (outofspace) {
   // ... something to deal with this ...
}

foreach-statements allow you to run various statements for every character read by some other set of statements. For example:

number = 0;
foreach {
   /\d+/;
} do {
   number = [number * 10 + ($last - '0')];
}

which will read a number in base 10 and store it into a variable. It accomplishes this by executing number = [number * 10 + ($last - '0')]; for each digit. Only statements which do not consume any input themselves are allowed in the do section of a foreach-statement.

if-statements allow you to execute statements conditional on things not necessarily directly related to the input. Conditions are math-expressions without the parentheses that must evaluate to either a boolean or an integer (in which case anything other than 0 is truthy). The specific value of $last when evaluating the condition is undefined in all cases except where the only statements in the body of the conditionals do not consume input, although this restriction may be relaxed in future versions.

For example:

parser {
    loop {
        /asd\w/;
        if $last == 'f' {
            break;
        }
    }
    "end";
}

out int{unsigned, size 2} number;

parser {
    foreach {
        /\d+/;
    } do {
        number = [number * 10 + ($last - 48)];
    }
    if number > 100 {
       "additional matches";
    }
    elif number < 4 {
        finish;
    }
    else {
        number = 0;
    }
    wait "\r\n";
}

NMFU Regexes

The regex grammar in NMFU is

regex: "/" regex_alternation "/"

regex_char_class: "\\" REGEX_CHARCLASS

?regex_alternation: regex_group ("|" regex_group)*

?regex_group: regex_alternation_element+

?regex_alternation_element: regex_literal
                          | regex_literal REGEX_OP -> regex_operation
                          | regex_literal "{" NUMBER "}" -> regex_exact_repeat
                          | regex_literal "{" NUMBER "," NUMBER "}" -> regex_range_repeat
                          | regex_literal "{" NUMBER "," "}" -> regex_at_least_repeat

?regex_literal: REGEX_UNIMPORTANT -> regex_raw_match // creates multiple char classes
              | regex_char_class // creates a char class
              | "(" regex_alternation ")" -> regex_group
              | "[^" regex_set_element+ "]" -> regex_inverted_set // creates an inverted char class
              | "[" regex_set_element+ "]" -> regex_set // creates a char class
              | "." -> regex_any // creates an inverted char class

?regex_set_element: REGEX_CHARGROUP_ELEMENT_RAW
                  | REGEX_CHARGROUP_ELEMENT_RAW "-" REGEX_CHARGROUP_ELEMENT_RAW -> regex_set_range
                  | regex_char_class

REGEX_UNIMPORTANT: /[^.?*()\[\]\\+{}|\/]|\\\.|\\\*|\\\(|\\\)|\\\[|\\\]|\\\+|\\\\|\\\{|\\\}|\\\||\\\//
REGEX_OP: /[+*?]/
REGEX_CHARGROUP_ELEMENT_RAW: /[^\-\]\\\/]|\\-|\\\]|\\\\|\\\//
REGEX_CHARCLASS: /[wWdDsSntr ]/

NMFU regexes support the following common features:

  • matching characters
  • matching character classes (\w, \S, \d)
  • matching character sets ([ab], [a-z\w], [^b])
  • the plus, star and question operators
  • the wildcard dot
  • groups
  • alternation (also called OR) (abc|def)
  • repetition

There are some key differences though:

  • There is no direct way to match the end of input, even with the normal regex syntax $ (it just matches the literal $)
  • The wildcard dot matches every single input except end
  • Groups are non-capturing, in the sense that they serve no function other than to logically group components together
  • The space character must be escaped, due to limitations in the lexer.

Binary Matches

Most of NMFU's core matches support a binary mode, where characters can be specified as hexadecimal bytes instead of having to use \xHH escapes.

For direct-matches, this is accomplished with the "b" suffix:

parser {
   "11 22 05"b;
}

For binary-regex-expressions, use instead the "b" prefix. Binary regexes don't support character classes, but still support most other features. For example:

parser {
   b/00 [10-15]+|(44 56? 12)/;
}