Quirks of the es interpreter
Despite the fact the es interpreter runtime is written in ANSI C and should build and run with just about any reasonably standards-compliant compiler and OS, it is an odd-looking codebase, containing a number of less-common patterns and mechanisms.
A certain type of person will be excited to read that most of these mechanisms involve “advanced” use of preprocessor macros.
A different type of person, upon reading that, will have an urge to close this page immediately.
Understanding some of these aspects of es can take some time without any documentation available, so this page is intended to help.
The garbage collector
Es uses a bespoke copying garbage collector, and precisely and explicitly tracks variable references in order that the GC knows what can and cannot be collected at any time.
This reference tracking is one of the most conspicuous oddities visible throughout es code:
if (list == NULL) {
sethistory(NULL);
return NULL;
}
Ref(List *, lp, list);
sethistory(getstr(lp->term));
RefReturn(lp);
In this example, these macros, Ref and RefReturn, define blocks of code where the variable lp is known by the garbage collector and kept up-to-date across collections.
The macros are expanded so that the critical block resembles the following.
{
List *lp = list;
add_root(lp);
sethistory(getstr(lp->term));
remove_root(lp);
return lp;
}
Laying it out in prose, the behavior here is
Ref(List *, lp, list):
- introduces a new block,
- declares a new
List * called lp, which is assigned the value of list, and
- adds
lp to the GC's root list, ensuring lp remains valid across collections.
sethistory(getstr(lp->term)) converts the first term of lp to a string — an action which may trigger a garbage collection — and calls sethistory on that string.
RefReturn(lp):
- removes
lp from the root list,
- returns
lp, and
- ends the block.
Note that these macros, and similar ones like RefEnd, which has the same behavior as RefReturn except it does not return, involve beginning and ending blocks implicitly as part of the macro definition.
This has the benefit of making it more difficult to accidentally leak GC roots.
Unfortunately, the compiler errors produced due to missing Refs or RefEnds have very little to do with the semantics of the macros themselves, which can be confusing.
There are a couple challenges with working with GCed objects in es: not everything in es is garbage collected, and collections can't occur at just any moment.
This can make it somewhat difficult to predict exactly when a Ref needs to be used.
A good rule of thumb for collections is that they can occur any time the shell needs more memory, which can happen in one of two cases: allocations, and when the GC is enabled with a call to gcenable() after having previously been gcdisable()-ed.
A GC in the latter case is typically rare, but if enough allocations happen while the GC is disabled then it is possible for the shell to need to perform a collection immediately.
Knowing exactly which objects are GCed and which aren't can be even trickier.
Generally, List *s, Term *s, and Tree *s are nearly always in the GCed space in memory.
char *s can go either way, and in some cases very similar functions can produce either a GC-space or standard malloc-space string.
For example, the str() and mprint() functions both function similarly to sprintf(3), and only differ from each other in that str() produces a garbage collected string, while mprint() produces a string that needs to be free()-ed.
The most foolproof way to build confidence in GC behavior is to build with GCDEBUG enabled, as in the following test invocation.
The GCDEBUG define enables both GCALWAYS, which forces a GC pass to happen any time it is possible, and GCPROTECT, which disables collected areas of memory using mprotect(3), causing references to stale pointers trigger immediate crashes.
With both of these enabled, it is reasonably easy to know if a necessary Ref has been missed.
; make ADDCFLAGS=-DGCDEBUG=1 clean test
Triggering segfaults the easy way.
The exception mechanism
Es has an exception mechanism, which, like the garbage collection mechanism, is implemented using macros which add a sort of exception mechanism to the C runtime itself.
These macros look like this, from the runesrc() function used to run the .esrc script:
char *esrc = str("%L/.esrc", varlookup("home", NULL), "\001");
int fd = eopen(esrc, oOpen);
ExceptionHandler
runfd(fd, esrc, 0);
CatchException (e)
if (termeq(e->term, "exit"))
exit(exitstatus(e->next));
else if (termeq(e->term, "error")) {
eprint("%L\n",
e->next == NULL ? NULL : e->next->next,
" ");
return;
}
if (!issilentsignal(e))
eprint("uncaught exception: %L\n", e, " ");
return;
EndExceptionHandler
The exceptions that the CatchExceptions handle are List *s.
These are produced by calls to throw(), which is often itself called by the fail() function typically used to generate error exceptions.
The definition of the $&throw primitive makes use of both of these exception-generating functions:
if (list == NULL)
fail("$&throw", "usage: throw exception [args ...]");
throw(list);
Under the hood, the ExceptionHandler macros manage a dynamic stack of setjmp(3) targets, and the throw() function pops the top target of the stack and performs a longjmp(3) to it; there is also some bookkeeping so that these exceptions “just work” in the face of GCed memory and dynamic variables.
However, exiting an ExceptionHandler block early with a return or goto will probably cause problems in later code, as the necessary exception handler cleanup couldn't happen (note that exiting a handler early with a throw() should be fine, because the throw() function performs its own cleaning-up behavior).
This setup is an es-specific implementation of a reasonably well-established set of conventions to add exceptions to C code.
This paper by Eric S. Roberts shows a version of this very similar to es', but early versions of exceptions in C date back to as early as 1985.
Signals
One of es' jobs as a shell is to handle a steady stream of signals coming from the terminal, child processes, users, and elsewhere.
Es models signals in-language as exceptions, such that any signal that isn't ignored leads to a call to the throw() function discussed above, but this has to be done explicitly, so signal-specific breadcrumbs are present in the code base, primarily in the form of the SIGCHK() macro.
SIGCHK(), which simply wraps the sigchk() function, checks if any signals have been received by the shell and handles those signals however the user has configured it to—either ignoring it or converting it into a signal exception which it throws.
This follows the best practices of signal handling in C, where the signal handler itself does essentially nothing except increment a counter, which later code must check and handle more thoroughly.
The obvious follow-up question is, when should SIGCHK() be invoked?
The signal handler in es does one more thing, in addition to setting things up for SIGCHK().
Right at the end of the handler is the following statement:
if (slow)
longjmp(slowlabel, 1);
These variables, slow and slowlabel, are also referenced where es calls the readline library:
if (!setjmp(slowlabel)) {
slow = TRUE;
r = readline(prompt);
} else {
r = NULL;
errno = EINTR;
}
slow = FALSE;
SIGCHK();
This nine-line snippet is essentially equivalent to the expression r = readline(prompt), except with defined signal-handling behavior.
It is a reasonably standard pattern for calling any library function (or syscall) when two things are true:
- You want the call to end when a signal comes in
- You do not trust the function to cede control on its own
This pattern arises because there is no universal behavior for what library functions, or syscalls, do when they are interrupted by a signal.
In some cases, the call will fail with an EINTR error, indicating it was interrupted, while in other cases the call will try to resume after the signal handler finishes.
Over the years, standard libraries across Unixes were inconsistent with which behavior they implemented.
This required applications like es that wanted consistent behavior between OSes to do this little longjmp dance to force control away from the call.
Since the “bad old days”, POSIX has largely standardized the behavior so that any call to, for example, open(3) will behave consistently between Unixes.
But not every function everywhere has settled on the same pattern, and readline is an example of a library that wants to try to handle signals itself without giving control to the calling application.
So, in order that signals can interrupt the prompt in es, we are forced to wrap the readline() call inside this block, and in fact wrap this block inside a loop, so that if a signal comes in that does not get turned into an exception, readline has a chance to restart.
Curiously, this use of longjmp from a signal handler is well-known and common, and also broadly considered unsafe.
That's C for you.
File descriptors
Another distinctive part of es is its handling of file descriptors.
This results from a simple problem: shells generally give their users access to file descriptors identified by number.
This means that, if implemented in the naïve way, users have access to every file descriptor used by the shell, and can mess with those file descriptors in ways that are make it difficult for the shell to function or the user to reason about.
To prevent this, es presents a set of file descriptors to the user that differ from its own internal idea of them, and, when fork/execing, places the user-visible descriptors in the right places for the child processes.
As a side benefit, this user-fd/runtime-fd split also enables running shell built-ins within redirections without requiring a fork.
For example, the $&read primitive starts with a call to fdmap(0).
This call allows the primitive to work on the real file descriptor which its standard input appears to be, whether or not it's really fd 0, or some other file which has been redirected.