Revisiting Forth
2025-06-03
When I was at Recurse I briefly explored Forth. I've been looking into it again lately. To be specific, I have been trying to understand Jonesforth by rewriting my own Forth to more closely resemble it.
Trying to understand Forth
I don't know how useful it is but trying to understand Forth is an interesting intellectual challenge for me. It's a low level language that you can write high level code in, but only if you're willing and able to dive into the low level details when needed. It's an odd combination.
When I played Microcorruption I learned about how you can use buffer overflows and other bugs to corrupt the control flow of other people's C programs. In Forth you are allowed and sometimes required to corrupt the control flow of your own program.
Forth taxonomy
Below, when I say "Forth", I often mean "Jonesforth". I think Jonesforth follows a lot of Forth conventions but I don't know enough about it to say when it's being more "Jones" than "Forth" so I won't bother to make the distinction.
The basic building blocks of Forth are called "words". You can think of these as functions. There are two kinds of words: code words and colon words. A "code word" is a word defined in something other than Forth. In the case of Jonesforth this is assembler, in my own Forth it's C. A "colon word" is implemented as an array of Forth words. They are named after the Forth :
(colon) operator which you can use to define them.
Colon words behave like function calls in a language like C. Their call behavior is a tree structure with code words as the leaves. To implement colon words in the interpreter you need (something like) a return stack. Forth has two stacks; one for parameters (the "parameter stack" or "data stack") and one for return addresses (the "return stack").
Code words do not use the return stack. They (or rather pointers to them) are arranged sequentially in memory. Most of the time, code words end in a macro called NEXT
which jumps to the next code word in the list.
I don't know what to call this "end in NEXT
" control flow. It's certainly not a function call, because there is no return. Code words just jump to the next word and somehow it all works out.
Memory management
Like C, you could say Forth has a notion of "stack memory" and "heap memory". In C, "stack memory" means local variables. In Forth, "stack memory" would be the parameter stack and sometimes, if you can get away with it, also the return stack.
In C, "heap memory" means malloc
: the thing you make arbitrary sized long-lived allocations with at run time. In Forth, there is also something like "heap memory" except it is not organized as a heap. Heap memory means that you can free bits of memory in between other bits that are still in use. In C, any memory allocated with malloc
can be freed with free
at any time. Forth only has a simple bump allocator. You can reset its state to a previous point but that means you can only free memory the most recently allocated memory.
The bump allocator uses a pointer named HERE
. Working with long lived allocations usually means storing data at HERE
and post-incrementing it. The Forth word ,
("comma") pulls an integer from the stack and stores it at HERE
. In C this would be *HERE++ = x;
.
Interpreter state
The Forth interpreter can be in one of two states: immediate or compiling. In "immediate mode" it reads Forth words from standard input one at a time executing each in turn. In "compiling mode", it also reads words, but instead of executing them it writes pointers to each successive word at the HERE
pointer.
How do you transition between these two states? Going from immediate to compiling is straightforward: code words can change the internal state of the compiler. But once it is compiling how does the compiler ever stop? While compiling it doesn't execute words, it only stores pointers to their definitions. To break out of this, Forth words can be marked as being an "immediate word". Immediate words get executed immediately even if the compiler is in compile mode.
To give an example, consider this code. First we define a colon word called HELLO
that prints Hello world!
to stdout, then we call it.
: HELLO ." Hello world!" CR ;
HELLO
Without going into the specifics of how this works, note that :
is a word that the interpreter sees while in immediate mode, and which (among other things) transitions the interpreter state to compiling mode. And ;
is an immediate word, read while in compiling mode, which sends the interpreter back into immediate mode. Then when the interpreter sees the second HELLO
it looks it up in the dictionary and executes the definition we compiled into memory just one line above.
Conclusion
I'll wrap up this blog post for now because I could keep going on for a long time.
Forth outwardly appears to use nested function calls like many other programming languages but some of its words are executed sequentially with hard to follow jumps. It has two stacks which allow it to get away with letting local variables be anonymous and implicit. Long lived allocations are made sequentially using the HERE
pointer. The interpreter alternates between two states that govern how it treats input words, but some words get to override the "compiling" state and run immediately anyway.
Tags: forth