Good error reporting is crucial in programming languages. Doing it at compile time was easy in ArkScript as we have all the context we need at hand, but since we compile code down to bytecode, which is then run by the virtual machine, we loose a lot of context: the source files, their path, their content, we don’t have that anymore! Our runtime errors could only show the VM internal state. This article is about how it all changed.
Multiple solutions
I went to the drawing board, and three solutions presented themselve to me:
- create a source location table in the bytecode, mapping an instruction to a file and line ;
- emit special instructions that would be skipped by the VM, and used only when the VM crashed, backtracking to find the nearest
SOURCEinstruction ; - extend the size of instructions to 8 bytes and use the 4 new bytes to track the source file (eg 2 bytes for an identifier) and line (2 bytes for the line seemed enough to track 64k+ lines files) ;
The second one was off the table pretty quickly, because I had a hunch it would hinder performances too much to my liking. It would also disrupt the IR optimizer, and I would have had to detect more instruction patterns as SOURCE instructions could be interleaved with optimizable instructions.
The third solution felt like a lot of work for a small gain, as it would be used only when handling errors. It would also double the size of the bytecode files, and lock the future evolutions of the VM as I wouldn’t be able to use those additional 4 bytes for anything else.
ℹ️ Note
As Robert Nystrom noted on Reddit, making the bytecode larger the VM would have more cache misses, making performance worse.
As you might have guessed, I went with the first solution.
Implementation
The source location data is added to each AST node by the parser, and the last compiler pass that can access the AST is the AST lowerer, whose job is to generate the IR, hence it felt logical to add two fields, source_file and source_line to the IR::Entity structure.
Bonus point: using this source tracking solution, there are (nearly) 0 modifications on the IR Optimizer! Nearly 0, because I had to add source location to each optimized IR entity from the compacted IR entities.
Which instruction should we track?
All of them!
You might ask yourself “But wouldn’t this make the generated bytecode twice bigger, as solution 3 would?”, and you’re partly right. To make this right, we have to introduce de-duplication!
The proposed solution was to track every instruction source location, but many instructions would point to the same file and line, as a single statement like (if (< value 5) (print "you can pass!") (go-to-jail)) would involve around 10-12 instructions.
If we keep track of the source location of the first instruction in a serie of instructions from the same file and on the same line, that’s more than enough!
| |
Exploiting the new source location for our errors
We now need to track filenames and (page, instruction, filename id, line) tuples so that we have source locations for our errors. Those are split in two data tables in the bytecode.
Having those tables, given a page pointer and instruction pointer, we can retrieve the nearest source location information and the associated file by its id (index in the filenames table).
| |
Results!
Given the following erroneous code:
| |
We expect an arity error upon calling foo, as we passed too many arguments.
| |
In terms of bytecode, it generated the following:
| |
As you can see, the instruction locations table is quite small thanks the de-duplication, and we have all the information we need to report errors correctly!