Instruction source location tracking in ArkScript

Good error reporting is crucial in programming languages. Doing it at compile time was easy in ArkScript as we have all the context we need at hand, but since we compile code down to bytecode, which is then run by the virtual machine, we loose a lot of context: the source files, their path, their content, we don’t have that anymore! Our runtime errors could only show the VM internal state. This article is about how it all changed.

Multiple solutions

I went to the drawing board, and three solutions presented themselve to me:

create a source location table in the bytecode, mapping an instruction to a file and line ;
emit special instructions that would be skipped by the VM, and used only when the VM crashed, backtracking to find the nearest SOURCE instruction ;
extend the size of instructions to 8 bytes and use the 4 new bytes to track the source file (eg 2 bytes for an identifier) and line (2 bytes for the line seemed enough to track 64k+ lines files) ;

The second one was off the table pretty quickly, because I had a hunch it would hinder performances too much to my liking. It would also disrupt the IR optimizer, and I would have had to detect more instruction patterns as SOURCE instructions could be interleaved with optimizable instructions.

The third solution felt like a lot of work for a small gain, as it would be used only when handling errors. It would also double the size of the bytecode files, and lock the future evolutions of the VM as I wouldn’t be able to use those additional 4 bytes for anything else.

ℹ️ Note
As Robert Nystrom noted on Reddit, making the bytecode larger the VM would have more cache misses, making performance worse.

As you might have guessed, I went with the first solution.

Implementation

The source location data is added to each AST node by the parser, and the last compiler pass that can access the AST is the AST lowerer, whose job is to generate the IR, hence it felt logical to add two fields, source_file and source_line to the IR::Entity structure.

Bonus point: using this source tracking solution, there are (nearly) 0 modifications on the IR Optimizer! Nearly 0, because I had to add source location to each optimized IR entity from the compacted IR entities.

Which instruction should we track?

All of them!

You might ask yourself “But wouldn’t this make the generated bytecode twice bigger, as solution 3 would?”, and you’re partly right. To make this right, we have to introduce de-duplication!

The proposed solution was to track every instruction source location, but many instructions would point to the same file and line, as a single statement like (if (< value 5) (print "you can pass!") (go-to-jail)) would involve around 10-12 instructions.

If we keep track of the source location of the first instruction in a serie of instructions from the same file and on the same line, that’s more than enough!

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
std::optional<InstLocation> prev = std::nullopt;

for (const auto& inst : page)
{
  if (inst.hasValidSourceLocation())
  {
    // we are guaranteed to have a value since we listed all
    // existing filenames in IRCompiler::process before,
    // thus we do not have to check if std::ranges::find
    // returned a valid iterator.
    auto file_id = static_cast<uint16_t>(
      std::distance(
        m_filenames.begin(),
        std::ranges::find(m_filenames, inst.filename())));

    std::optional<internal::InstLoc> prev = std::nullopt;
    if (!locations.empty())
      prev = locations.back();

     // skip redundant instruction location
     if (!(
           prev.has_value() &&
           prev->filename_id == file_id &&
           prev->line == inst.sourceLine() &&
           prev->page_pointer == i))
       locations.push_back(
         { .page_pointer = static_cast<uint16_t>(i),
           .inst_pointer = ip,
           .filename_id = file_id,
           .line = static_cast<uint32_t>(inst.sourceLine()) });
  }

  if (inst.kind() != IR::Kind::Label)
    ++ip;
}

Exploiting the new source location for our errors

We now need to track filenames and (page, instruction, filename id, line) tuples so that we have source locations for our errors. Those are split in two data tables in the bytecode.

Having those tables, given a page pointer and instruction pointer, we can retrieve the nearest source location information and the associated file by its id (index in the filenames table).

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
std::optional<InstLoc> VM::findSourceLocation(
    const std::size_t ip,
    const std::size_t pp
) {
  std::optional<InstLoc> match = std::nullopt;

  for (const auto location : m_state.m_inst_locations)
  {
    if (location.page_pointer == pp && !match)
      match = location;

    // select the best match: we want to find the location
    // that's nearest our instruction pointer, but not equal
    // to it as the IP will always be pointing to the next
    // instruction, not yet executed. Thus, the erroneous
    // instruction is the previous one.
    if (location.page_pointer == pp && match &&
        location.inst_pointer < ip / 4)
      match = location;

    // early exit because we won't find anything better, as
    // inst locations are ordered by ascending (pp, ip)
    if (location.page_pointer > pp || (
          location.page_pointer == pp &&
          location.inst_pointer >= ip / 4))
      break;
  }

  return match;
}

Results!

Given the following erroneous code:

1
2
3
(let foo (fun (a b) (+ a b)))

(foo 1 2 3)

We expect an arity error upon calling foo, as we passed too many arguments.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
ArityError: When calling `(foo)', received 3 arguments, but expected 2: `(foo a b)'

In file a.ark
    1 | (let foo (fun (a b) (+ a b)))
      | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~
    2 |
    3 | (foo 1 2 3)

[   2] In function `foo' (a.ark:1)
[   1] In global scope (a.ark:3)

Current scope variables values:
foo = Function@1
At IP: 0, PP: 1, SP: 5

In terms of bytecode, it generated the following:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
Symbols table (length: 3)
0) foo
1) a
2) b

Constants table (length: 4)
0) (PageAddr) 1
1) (Number) 1
2) (Number) 2
3) (Number) 3

Instruction locations table (length: 3)
 PP, IP
  0,  0 -> a.ark:0
  0,  4 -> a.ark:2
  1,  4 -> a.ark:0

Code segment 0 (length: 24)
   0 39 00 00 00 LOAD_CONST_STORE
   1 38 00 20 01 LOAD_CONST_LOAD_CONST
   2 03 00 00 03 LOAD_CONST 3 (Number)
   3 02 00 00 00 LOAD_SYMBOL_BY_INDEX (0)
   4 0b 00 00 03 CALL (3)
   5 0a 00 00 00 HALT

Code segment 1 (length: 28)
   0 05 00 00 01 STORE a
   1 05 00 00 02 STORE b
   2 02 00 00 01 LOAD_SYMBOL_BY_INDEX (1)
   3 02 00 00 00 LOAD_SYMBOL_BY_INDEX (0)
   4 20 00 00 00 ADD
   5 09 00 00 00 RET
   6 0a 00 00 00 HALT

As you can see, the instruction locations table is quite small thanks the de-duplication, and we have all the information we need to report errors correctly!