Parser Specification
NMFU source files support C++-style line comments (text after //
on a line is ignored until the end of the line)
Top-Level Constructs
At the top-level, all NMFU parsers consist of a set of output-variables, macro-definitions, hook-definitions and the parser code itself.
Outputs
The output-variables are specified with the output-declaration:
out_decl: "out" out_type IDENTIFIER ";"
| "out" out_type IDENTIFIER "=" atom ";"
out_type: "bool" -> bool_type
| "int" ("{" int_attr ("," int_attr)* "}")? -> int_type
| "enum" "{" IDENTIFIER ("," IDENTIFIER)+ "}" -> enum_type
| "str" "[" RADIX_NUMBER "]" -> str_type
| "unterminated" "str" "[" RADIX_NUMBER "]" -> unterm_str_type
| "raw" "{" IDENTIFIER "}" -> raw_type
int_attr: SIGNED -> signed_attr
| "size" NUMBER -> width_attr
For example:
out int{unsigned, size 16} content_length = 32;
out bool supports_gzip = false;
// note you can't set default values for strings and enums
out str[32] url;
out enum{GET,POST} method;
out raw{uint64_t} unsigned_out;
Integers can have their signedness and bit size set (defaulting to signed 32-bits). All strings have a defined maximum size, which includes the automatically added null-terminator (if not suppressed with the unterminated
option).
There is also the raw
type, which acts like a string except it's exposed to you as whatever type you provide. No byte-order conversions are performed by NMFU.
Macros
Macros in NMFU are simple parse-tree level replacements. They look like:
macro_decl: "macro" IDENTIFIER macro_args "{" statement* "}"
macro_args: "(" macro_arg ("," macro_arg)* ")"
| "(" ")" -> macro_arg_empty
macro_arg: "macro" IDENTIFIER -> macro_macro_arg
| "out" IDENTIFIER -> macro_out_arg
| "match" IDENTIFIER -> macro_match_expr_arg
| "expr" IDENTIFIER -> macro_int_expr_arg
| "hook" IDENTIFIER -> macro_hook_arg
| "loop" IDENTIFIER -> macro_breaktgt_arg
| RESULT_CODE IDENTIFIER -> macro_rescode_arg
RESULT_CODE: /yieldcode|finishcode/
For example:
When macros are "called", or instantiated, all NMFU does is copy the contents of the parse tree from the macro declaration to the call-site. Note that although macros can call other macros, they cannot recurse.
Macros can also take arguments, which are similarly treated as parse-tree level replacements, with the added restriction that their types are checked. For example:
macro read_number(out target, match delimit) {
target = 0;
foreach {
/\d+/;
} do {
target = [target * 10 + ($last - '0')];
}
delimit;
}
There are 6 types of arguments:
macro
: a reference to another macrohook
: a reference to a hookout
: a reference to an output-variablematch
: an arbitrary match-expressionexpr
: an arbitrary integer-expressionloop
: an arbitrary named loop-statement, for use in break-statements.finishcode
/yieldcode
: an arbitrary result-code of the given type.
Hooks
Hooks (which are callbacks to user code which the parser can call at certain points) are defined with a hook-declaration:
For example:
Custom Result Codes
Both finish-statements and yield-statements take (optionally for the former) a result-code to return. These are declared using the result-code-declaration:
For example:
Parser Declaration
The parser proper is declared with the parser-declaration,
and contains a set of statements which are "executed" in order to parse the input.
Basic Statements
Basic statements are statements which do not have an associated code block, and which end with a semicolon.
simple_stmt: expr -> match_stmt
| IDENTIFIER "=" expr -> assign_stmt
| IDENTIFIER "+=" expr -> append_stmt
| IDENTIFIER "(" (expr ("," expr)*)? ")" -> call_stmt
| "break" IDENTIFIER? -> break_stmt
| "delete" IDENTIFIER -> delete_stmt
| "finish" -> finish_stmt
| "finish" IDENTIFIER -> custom_finish_stmt
| "yield" IDENTIFIER -> custom_yield_stmt
| "wait" expr -> wait_stmt
The most basic form of statement in NMFU is a match-statement, which matches any match-expression (explained in the next section).
The next two statements are the assign-statement and append_statement. The assign-statement parses an integer-expression (which are not limited to just integers, again explained in the next section). and assigns its result into the named output-variable. The append-statement instead appends whatever is matched by the match-expression into the named output-variable which must by a string type. Additionally, if the argument to an append-statement is a math-expression, then the result of evaluating the expression will be treated as a character code and appended to the string.
The delete-statement resets a string or raw-typed output-variable to an empty state.
The call-statement instantiates a macro or calls a hook. Note that there is currently no valid way to pass parameters to a hook, and as such the expressions provided in that case will be ignored. Macro arguments are always parsed as generic expressions and then interpreted according to the type given to them at declaration.
If a hook and macro have the same name, the macro will take priority. Priority is undefined if a macro argument and global hook or macro share a name.
The break-statement is explained along with loops in a later section.
The finish-statement causes the parser to immediately stop and return a DONE
status code, which should be interpreted by the calling application as a termination condition. Alternatively,
a custom result-code can be provided, which the parser will return instead of DONE
.
The yield-statement is similar, except it must be provided with a result-code, and instead of the returned code indicating termination, it instead acts like an OK
status code,
indicating some application-specific state but that parsing can and should still continue.
The wait-statement spins and consumes input until the match-expression provided matches successfully. Importantly, no event (including end of input!) can stop the wait statement, which makes it useful primarily in error handlers.
It is also important to note that this is not the same as using a regex like /.*someterminator/
, as
the wait statement does not "try" different starting positions for a string when its match fails. More concretely, something like wait abcdabce
would not match abcdabcdabce
, as
the statement would bail and restart matching from the beginning at the second d
.
Expressions
There are three types of expressions in NMFU, match-expressions, integer-expressions and math-expressions.
A match-expression is anything that can consume input to the parser and check it:
?expr: atom // string match
| regex // not an atom to simplify things
| binary_regex
| "end" -> end_expr
| "(" expr+ ")" -> concat_expr
atom: STRING "i" -> string_case_const
| STRING "b" -> binary_string_const
| STRING -> string_const
The simplest form of match-expression is the direct-match, which matches a literal string. It can optionally match with case insensitivity by suffixing the literal string with an "i", and can match binary data with the "b" suffix -- see the section on binary matches.
The end-match-expression is a match expression which only matches the end of input.
The concat-expression matches any number of match-expressions in order.
The regex-expression and binary-regex-expression matches a subset of regexes. The quirks of the regex dialect NMFU uses can be found in a later section.
Additionally, the name of a macro argument of type match
can replace any match-expression, including the sub-expressions inside concat-expressions.
An integer-expression is anything that can be directly assigned to an output variable, including strings:
?expr: atom
| "[" sum_expr "]"
atom: BOOL_CONST -> bool_const
| RADIX_NUMBER -> number_const
| CHAR_CONSTANT -> char_const
| STRING -> string_const
| IDENTIFIER -> enum_const
RADIX_NUMBER: HEX_NUMBER | BIN_NUMBER | NUMBER
The only two kinds of integer-expressions are literal-expressions, which are just literal strings, integers (with support for binary and hexadecimal modes), character constants (which behave as they do in C, just becoming integers), enumeration constants (which are resolved based on the context and which output-variable is being assigned) and booleans ("true"/"false"); and math-expressions, which are surrounded in square brackets:
_math_expr: disjunction_expr
?disjunction_expr: conjunction_expr ("||" conjunction_expr)*
?conjunction_expr: bit_or_expr ("&&" bit_or_expr)*
?bit_or_expr: bit_xor_expr ("|" bit_xor_expr)*
?bit_xor_expr: bit_and_expr ("^" bit_and_expr)*
?bit_and_expr: comp_expr ("&" comp_expr)*
?comp_expr: shift_expr (CMP_OP shift_expr)?
// nmfu does not allow (1 << 2 << 3) because that is dumb
?shift_expr: sum_expr (SHIFT_OP sum_expr)?
?sum_expr: mul_expr (SUM_OP mul_expr)*
?mul_expr: math_unary (MUL_OP math_unary)*
?math_unary: math_atom
| "!" math_atom -> not_expr
| "-" math_atom -> negate_expr
?math_atom: RADIX_NUMBER -> math_num
| IDENTIFIER -> math_var
| IDENTIFIER ".len" -> math_str_len
| IDENTIFIER "[" _math_expr "]" -> math_str_index
| CHAR_CONSTANT -> math_char_const
| BOOL_CONST -> bool_const
| "$" IDENTIFIER -> builtin_math_var
| "(" _math_expr ")"
SUM_OP: /[+-]/
MUL_OP: /[*\/%]/
CMP_OP: /[!=]=/ | /[<>]=?/
CHAR_CONSTANT: /'[^'\\]'/ | /'\\.'/
SHIFT_OP: "<<" | ">>"
(named _math_expr
in the grammar)
math-expressions support most of the expressions in C, including bit manipulation and comparisons. They operate on a few atomic values:
- Literals (as above)
- References to the current value of output-variables
- The currents size of a string/raw-typed output-variable
- Specific byte positions inside string/raw-typed output-variables (which by default are runtime range-checked)
- Or, a builtin-math-variable:
The current list of builtin-math-variables is:
Name | Meaning |
---|---|
last |
The codepoint of the last matched character. Useful for interpreting numbers in input streams. |
For example:
content_length = [content_length * 10 + ($last - '0')];
into = [into | (($last & 127) << (7 * varint_counter))];
[!($last == ' ' || $last == '\t') || advisory.len > 0];
Additionally, the name of a macro argument of type expr
can replace any integer-expression. Priority versus output-variable names is undefined.
Warning
The precise semantics of "last matched" character for the purposes of $last
are somewhat vague. It is only guaranteed to be well-defined in the following
situations:
- Immediately after (and in the same block) as a match-statement, in which case it is the last-matched character of the match-expression.
- Inside a case-statement clause which itself cannot match characters, in which case it is the last-matched character of the predicate expression.
- During a foreach-statement, where it takes on precisely the values matched by the contents of the first block.
In other places it may bind to the first character of the next match expression depending on the surrounding context.
Block Statements
Block statements are statements which contain a block of other statements:
block_stmt: "loop" IDENTIFIER? "{" statement+ "}" -> loop_stmt
| "case" "{" case_clause+ "}" -> case_stmt
| "greedy" "case" "{" greedy_prio_block+ "}" -> greedy_case_stmt
| "optional" "{" statement+ "}" -> optional_stmt
| "try" "{" statement+ "}" catch_block -> try_stmt
| "foreach" "{" statement+ "}" "do" "{" foreach_actions "}" -> foreach_stmt
| "if" if_condition ("elif" if_condition)* else_condition? -> if_stmt
catch_block: "catch" catch_options? "{" statement* "}"
?greedy_prio_block: "prio" NUMBER case_clause
| "prio" NUMBER "{" case_clause+ "}"
| case_clause
case_clause: case_predicate ("," case_predicate)* "->" "{" statement* "}"
case_predicate: "else" -> else_predicate
| expr -> expr_predicate
catch_options: "(" CATCH_OPTION ("," CATCH_OPTION)* ")"
if_condition: _math_expr "{" statement+ "}"
else_condition: "else" "{" statement+ "}"
CATCH_OPTION: /nomatch|outofspace/
loop-statements repeat their block forever until broken out of using a break-statement. loop-statements can optionally have names, which can be referred to in the break-statement to break out of nested loops. If a bare break-statement is encountered, the innermost loop is broken from.
optional-statements either execute their contents if the first match within them matches, otherwise do nothing. It is an error to have anything that does not match as the first statement in an optional-statement.
case-statements match one or more match-expressions and upon successful matching of one of those expressions execute a given block.
For example:
case {
"GET" -> {method = GET;}
"POST" -> {method = POST;}
"PUT","PATCH" -> {wait "\r\n"; result=INVALID_METHOD; finish;}
else, "OPTIONS" -> {}
}
The special constant "else" means anything not matched by any of the other clauses.
greedy-case-statements are a special version of case-statements. Instead of disallowing all ambiguity, greedy-case-statements follow two rules:
- In a conflict where one match could end but another could continue, the finishing match wins
- In a conflict where two or more matches could end on the same character, the one with the greatest priority wins. If there is no match with highest priority, an error is raised.
Priority is assigned with the greedy-priority-declaration; if unspecified, priority defaults to 0.
For example:
greedy case {
/\s+/ -> {yield _RESET;}
"(" -> {yield LP;}
")" -> {yield RP;}
/\w*[a-zA-Z_]\w*/ -> {yield SYMBOL;}
/-?\d+/ -> {yield INTEGER;}
prio 1 {
// higher priority over "SYMBOL"
"define" -> {yield DEFINE;}
"display" -> {yield DISPLAY;}
}
prio 2 "critical" -> {} // highest priority
}
try-except-statements can redirect certain error conditions to the beginning of a different block. They follow the general structure of
The specific error conditions which they match can be limited by placing a parenthesized comma-separated list of catch-options after the "catch" token, like
foreach-statements allow you to run various statements for every character read by some other set of statements. For example:
which will read a number in base 10 and store it into a variable. It accomplishes this by executing number = [number * 10 + ($last - '0')];
for each digit. Only
statements which do not consume any input themselves are allowed in the do
section of a foreach-statement.
if-statements allow you to execute statements conditional on things not necessarily directly related to the input. Conditions are math-expressions without
the parentheses that must evaluate to either a boolean or an integer (in which case anything other than 0 is truthy). The specific value of $last
when
evaluating the condition is undefined in all cases except where the only statements in the body of the conditionals do not consume input, although this
restriction may be relaxed in future versions.
For example:
parser {
loop {
/asd\w/;
if $last == 'f' {
break;
}
}
"end";
}
out int{unsigned, size 2} number;
parser {
foreach {
/\d+/;
} do {
number = [number * 10 + ($last - 48)];
}
if number > 100 {
"additional matches";
}
elif number < 4 {
finish;
}
else {
number = 0;
}
wait "\r\n";
}
NMFU Regexes
The regex grammar in NMFU is
regex: "/" regex_alternation "/"
regex_char_class: "\\" REGEX_CHARCLASS
?regex_alternation: regex_group ("|" regex_group)*
?regex_group: regex_alternation_element+
?regex_alternation_element: regex_literal
| regex_literal REGEX_OP -> regex_operation
| regex_literal "{" NUMBER "}" -> regex_exact_repeat
| regex_literal "{" NUMBER "," NUMBER "}" -> regex_range_repeat
| regex_literal "{" NUMBER "," "}" -> regex_at_least_repeat
?regex_literal: REGEX_UNIMPORTANT -> regex_raw_match // creates multiple char classes
| regex_char_class // creates a char class
| "(" regex_alternation ")" -> regex_group
| "[^" regex_set_element+ "]" -> regex_inverted_set // creates an inverted char class
| "[" regex_set_element+ "]" -> regex_set // creates a char class
| "." -> regex_any // creates an inverted char class
?regex_set_element: REGEX_CHARGROUP_ELEMENT_RAW
| REGEX_CHARGROUP_ELEMENT_RAW "-" REGEX_CHARGROUP_ELEMENT_RAW -> regex_set_range
| regex_char_class
REGEX_UNIMPORTANT: /[^.?*()\[\]\\+{}|\/]|\\\.|\\\*|\\\(|\\\)|\\\[|\\\]|\\\+|\\\\|\\\{|\\\}|\\\||\\\//
REGEX_OP: /[+*?]/
REGEX_CHARGROUP_ELEMENT_RAW: /[^\-\]\\\/]|\\-|\\\]|\\\\|\\\//
REGEX_CHARCLASS: /[wWdDsSntr ]/
NMFU regexes support the following common features:
- matching characters
- matching character classes (
\w
,\S
,\d
) - matching character sets (
[ab]
,[a-z\w]
,[^b]
) - the plus, star and question operators
- the wildcard dot
- groups
- alternation (also called OR) (
abc|def
) - repetition
There are some key differences though:
- There is no direct way to match the end of input, even with the normal regex syntax
$
(it just matches the literal$
) - The wildcard dot matches every single input except end
- Groups are non-capturing, in the sense that they serve no function other than to logically group components together
- The space character must be escaped, due to limitations in the lexer.
Binary Matches
Most of NMFU's core matches support a binary mode, where characters can be specified as hexadecimal bytes instead of having to
use \xHH
escapes.
For direct-matches, this is accomplished with the "b" suffix:
For binary-regex-expressions, use instead the "b" prefix. Binary regexes don't support character classes, but still support most other features. For example: