29 Jan 2024
Over the course of about 4 days, I wrote a configuration language parser similar to TOML in Rust called Frostwalker for my web server herb as I wanted to have the experience of writing a parser stack as well as using it for herb to lower the amount of dependencies (toml requires 8 dependencies for build time).
Frostwalker's main library file contains public data types used by all components of the parser stack, a wrapper function around each of the components in the stack, and public module exports for the three parser stack modules: the lexer, the validator, and the formatter.
At a high level, the shared data types are mainly for passing lexeme data around and are the Class enum and Token struct. These are in the main library purely so that the validator and the formatter aren't dependent on the lexer. This was an intentional design decision so that the Frostwalker validator and formatter can be used for other languages if someone else wants to write a new lexer for another language.
The Class enum varied through development as I decided how to properly tokenise the subset of TOML features I wanted implemented: the BOOLEAN class used to be two split classes, TRUE and FALSE, and the LITERAL class also used to bew two split classes, STRING and INTEGER.
The task of the lexer (also known as a tokeniser or lexical analyser) is to take the source file as an input, break it down into chunks and build a sequence of tokens, herein referred to as a token tree, which can be used by further stages of a parser or compiler to turn this machine readable format into a usable format. In the case of a configuration language, this would be in an associative array like a hash map or dictionary.
Most lexers nowadays are usually made by lex family generators due to the sheer size of lexemes featured in modern languages, making it impractical to manually create a lexer capable of building a tree for more complicated languages, but the subset of TOML I was aiming for didn't need a lot of lexemes so I chose to write it by hand.
The core loop of the lexer runs over each word in the source file after splitting the source file by lines and then by words and applies rules onto the words.
Broken down, the rules are, in order of prominence:
If the current word ends with a comma and the previous word is an open square bracket or a comma, cut the comma off the end of the word and push it into the words list at the next index.
If the current word ends with a hashtag, isn't just a hashtag, and doesn't start with a quotation mark, cut the hasthag off the end of the word and push it into the words list at the next index.
If the current word is a comma, push a comma separator token into the token tree and move to the next word.
If the current word, after removing all spaces, is empty, move to the next word.
If the current word is an open or close square bracket, push an equivalent separator token into the token tree and move to the next word.
If the current word is equal to true or false, push an equivalent boolean token into the token tree and move to the next word.
If the current word can be converted into a 32 bit signed integer, push an equivalent literal token into the token tree and move to the next word.
If the current word is an equals sign, push an equals token into the token tree and move to the next word.
If the current word starts with a quotation mark, enter the string handling loop.
If the current word starts with a hashtag, break the loop and move to the next line.
If the remaining amount of words after the current word is less than 2, and the previous token is a newline, push an identifier token to the token tree with the value of the word and move to the next word.
If no other rule applies that moves to the next word, add an unknown lexeme token to the token tree, which will be caught in the validator.
As you can see, the amount of rules used to tokenise the language is fairly small (the lexer itself being only 165 lines long), reflecting the complexity of the TOML subset in comparison to something like a functional programming language which may require much more rules to cover floating points, functions, and definitions amongst other features.
There was one thing I left out of the explanation and that was the string handling loop. This is arguably the most complex part of the lexer as it involves watching for the end of the string and deciding where that is.
The loop starts with detecting if the current word ends with double quotes, signifying that the string only spans the current word, so it drops the quotes and adds it as a literal token to the tree before moving onto the next word. If not, it will use "format!()" to build a string variable within the lexer, starting with dropping the starting quote marks off of the starting word and adding it to the built string. It will then continually iterate on each word, adding it onto the built string until it detects a word ending with a quotation mark, allowing it to drop this quotation mark and finalise the built string. This built string will then be added into the token tree.
All in all, the lexer is fairly stable and makes an accurate representation of the source configuration file. The token tree outputted by the lexer could be easily read to rebuild the original source, sans comments.
The validator's task is to step through the token tree and determine, following rules, if the output is valid. This serves as a sanity check not only on the input, but also on the lexer. Any error in the input or the lexer will be treated the same and the validator will kick out an error.
The validator relies on an integrated enum called ExpectedClass which contains copies of most of the classes, with the BOOLEAN class being folded into the generic LITERAL expected class, and a new expected class called LITERALORSEPARATOR, used after an equals where the key could be either a single literal or an array.
The validator's rules are as follows, in order of significance:
If the current token is a new line, and it is the start of the token tree, move onto the next token.
If the current and expected token is a new line, expect an identifier and move to the next token.
If the current and expected token is an identifier and there are more tokens in the tree, expect an equals and move to the next token.
If the current and expected token is an equals and there are more tokens in the tree, expect a literal or separator and move to the next token.
If the current token is a literal or boolean and the expected token is a literal or separator, expect a new line and move to the next token.
If the current and expected token is a literal or boolean, expect a separator and move to the next token.
If the current token is an open square bracket separator, the expected token is a literal or separator, and there are more tokens in the tree, expect a literal and move to the next token.
If the current and expected token is a separator, expect a new line if the separator is a closed square bracket, or expect a literal if the separator is a comma, and move to the next token.
If all of the above rules are not met, the token is unknown or an error has occured.
The validator is very simple and straightforward and can detect syntax errors like using the wrong separators in the wrong places, leaving keys undefined, and unknown tokens. Uses for the validator outside of the current application could be writing a new lexer and sanity checking it against the expected outcome if it is the same as the current lexer, amongst others.
The formatter's job is to consume the token tree and return the final hash map. It is very simple and much shorter than the rest of the components, clocking in at 58 lines.
The formatter expects that the token tree has been validated, otherwise it will throw errors. This was an intentional design decision as people may not want the formatter to also be the validator as users may want to replace the formatter with another component that turns the token tree into another format, uses may include a converter for the TOML subset to JSON.
The formatter's rules are as follows, in order of significance:
If the current token is an identifier, store its value in the current key variable and move to the next token.
If the current token is a literal or boolean, and the current key variable is not empty, push the value of the literal to the hash map with the current key as its key.
If the current token is a separator, loop over the next tokens until you meet a closed square bracket separator. Each array item, separated by a comma, will be inserted into the hashmap with keys starting with the current key and with array syntax starting from 0 (e.g. "array[5]"). The current key without any selector will contain the amount of keys in the array.
If no rules are matched, skip the token.
The formatter is intentionally simple as there is no need to add bunches of complexity for a simple configuration language subset. This permeates the rest of the parser stack as well due to me wanting to keep the complexity low as I wanted this project done at a fair speed so that I could integrate it into herb. I feel that I did a good job at this personally as Frostwalker is a fairly robust, but modular parser stack as it stands now.
The parser wrapper exists as to tie the pieces of the parser stack together as one simple function usable by outsider programs that don't need to interact with the individual pieces of the stack. It is implemented in the "frostwalker::parse()" function, taking in a borrowed string and returning a Result, either containing a HashMap or a String containing the error kicked out by the validator.
It removes the need to handle the token tree directly and pass it to the individual components as the function does this all for you. It is also pragmatic, the result is wrapped within a Result, meaning error handling is easy and all errors within the parser are non-catastrophic and recoverable. I used this quality within herb so that when a configuration file can't be read, it will fall back on its own default settings, allowing the web server to run without config files or with a broken config file with no problems.
Each component of the parser stack features unit tests which test how the component can handle different test cases. These cases range from a simple single key input to arrays containing multiple literals, unicode characters, emoji, and broken syntax. I also wrote integration tests to make sure that the parser stack is working nicely and is able to collectively create the correct output.
The integration tests also test the parser wrapper. It is imperative that the parser wrapper also returns the expected result as this will be the most commonly used portion of the library, owing to it being the front and center function within the library.
Writing this parser stack has been one of the most enjoyable experiences in my programming career. Not only has it allowed me to get really stuck into algorithmic problems, lexeme tokenising, validation, working with Rust's many systems, quirks and concepts, and has really made me fall in love with programming again. Rust's compiler driven development and testing features definitely made this project's development fly by and made the development experience incredibly enjoyable.
I'd definitely recommend someone who is a more experienced programmer looking for a project should take on a parser project, whether it'd be for a configuration language or something more advanced like a general purpose language like Rust. It will be something you'll be very proud of, like I am of Frostwalker, and you can totally show it to all your programmer friends and they'll go "Wowee"!