Anatomy of a parser

29 Jan 2024


Over the course of about 4 days, I wrote a configuration language parser similar to TOML in Rust called Frostwalker for my web server herb as I wanted to have the experience of writing a parser stack as well as using it for herb to lower the amount of dependencies (toml requires 8 dependencies for build time).

Overall architecture

Frostwalker's main library file contains public data types used by all components of the parser stack, a wrapper function around each of the components in the stack, and public module exports for the three parser stack modules: the lexer, the validator, and the formatter.

Shared data types

At a high level, the shared data types are mainly for passing lexeme data around and are the Class enum and Token struct. These are in the main library purely so that the validator and the formatter aren't dependent on the lexer. This was an intentional design decision so that the Frostwalker validator and formatter can be used for other languages if someone else wants to write a new lexer for another language.

The Class enum varied through development as I decided how to properly tokenise the subset of TOML features I wanted implemented: the BOOLEAN class used to be two split classes, TRUE and FALSE, and the LITERAL class also used to bew two split classes, STRING and INTEGER.

Lexer

The task of the lexer (also known as a tokeniser or lexical analyser) is to take the source file as an input, break it down into chunks and build a sequence of tokens, herein referred to as a token tree, which can be used by further stages of a parser or compiler to turn this machine readable format into a usable format. In the case of a configuration language, this would be in an associative array like a hash map or dictionary.

Most lexers nowadays are usually made by lex family generators due to the sheer size of lexemes featured in modern languages, making it impractical to manually create a lexer capable of building a tree for more complicated languages, but the subset of TOML I was aiming for didn't need a lot of lexemes so I chose to write it by hand.

The core loop of the lexer runs over each word in the source file after splitting the source file by lines and then by words and applies rules onto the words.

Broken down, the rules are, in order of prominence:

As you can see, the amount of rules used to tokenise the language is fairly small (the lexer itself being only 165 lines long), reflecting the complexity of the TOML subset in comparison to something like a functional programming language which may require much more rules to cover floating points, functions, and definitions amongst other features.

There was one thing I left out of the explanation and that was the string handling loop. This is arguably the most complex part of the lexer as it involves watching for the end of the string and deciding where that is.

The loop starts with detecting if the current word ends with double quotes, signifying that the string only spans the current word, so it drops the quotes and adds it as a literal token to the tree before moving onto the next word. If not, it will use "format!()" to build a string variable within the lexer, starting with dropping the starting quote marks off of the starting word and adding it to the built string. It will then continually iterate on each word, adding it onto the built string until it detects a word ending with a quotation mark, allowing it to drop this quotation mark and finalise the built string. This built string will then be added into the token tree.

All in all, the lexer is fairly stable and makes an accurate representation of the source configuration file. The token tree outputted by the lexer could be easily read to rebuild the original source, sans comments.

Validator

The validator's task is to step through the token tree and determine, following rules, if the output is valid. This serves as a sanity check not only on the input, but also on the lexer. Any error in the input or the lexer will be treated the same and the validator will kick out an error.

The validator relies on an integrated enum called ExpectedClass which contains copies of most of the classes, with the BOOLEAN class being folded into the generic LITERAL expected class, and a new expected class called LITERALORSEPARATOR, used after an equals where the key could be either a single literal or an array.

The validator's rules are as follows, in order of significance:

The validator is very simple and straightforward and can detect syntax errors like using the wrong separators in the wrong places, leaving keys undefined, and unknown tokens. Uses for the validator outside of the current application could be writing a new lexer and sanity checking it against the expected outcome if it is the same as the current lexer, amongst others.

Formatter

The formatter's job is to consume the token tree and return the final hash map. It is very simple and much shorter than the rest of the components, clocking in at 58 lines.

The formatter expects that the token tree has been validated, otherwise it will throw errors. This was an intentional design decision as people may not want the formatter to also be the validator as users may want to replace the formatter with another component that turns the token tree into another format, uses may include a converter for the TOML subset to JSON.

The formatter's rules are as follows, in order of significance:

The formatter is intentionally simple as there is no need to add bunches of complexity for a simple configuration language subset. This permeates the rest of the parser stack as well due to me wanting to keep the complexity low as I wanted this project done at a fair speed so that I could integrate it into herb. I feel that I did a good job at this personally as Frostwalker is a fairly robust, but modular parser stack as it stands now.

Parser wrapper

The parser wrapper exists as to tie the pieces of the parser stack together as one simple function usable by outsider programs that don't need to interact with the individual pieces of the stack. It is implemented in the "frostwalker::parse()" function, taking in a borrowed string and returning a Result, either containing a HashMap or a String containing the error kicked out by the validator.

It removes the need to handle the token tree directly and pass it to the individual components as the function does this all for you. It is also pragmatic, the result is wrapped within a Result, meaning error handling is easy and all errors within the parser are non-catastrophic and recoverable. I used this quality within herb so that when a configuration file can't be read, it will fall back on its own default settings, allowing the web server to run without config files or with a broken config file with no problems.

Tests

Each component of the parser stack features unit tests which test how the component can handle different test cases. These cases range from a simple single key input to arrays containing multiple literals, unicode characters, emoji, and broken syntax. I also wrote integration tests to make sure that the parser stack is working nicely and is able to collectively create the correct output.

The integration tests also test the parser wrapper. It is imperative that the parser wrapper also returns the expected result as this will be the most commonly used portion of the library, owing to it being the front and center function within the library.

Plenary

Writing this parser stack has been one of the most enjoyable experiences in my programming career. Not only has it allowed me to get really stuck into algorithmic problems, lexeme tokenising, validation, working with Rust's many systems, quirks and concepts, and has really made me fall in love with programming again. Rust's compiler driven development and testing features definitely made this project's development fly by and made the development experience incredibly enjoyable.

I'd definitely recommend someone who is a more experienced programmer looking for a project should take on a parser project, whether it'd be for a configuration language or something more advanced like a general purpose language like Rust. It will be something you'll be very proud of, like I am of Frostwalker, and you can totally show it to all your programmer friends and they'll go "Wowee"!

The source code for Frostwalker is available on my Gitea instance and if you want to use it in your software, it is on crates.io as frostwalker too!


All content on this website is licensed CC BY-ND 4.0 unless otherwise stated. Copyright © abbieoverflight 2021-2024
Powered by ewfm. This website features no AI-generated content.