This is part fourteen in a series on the 0.3 version of the language spec for the Merg-E Domain Specific Language for the InnuenDo Web 3.0 stack. I'll add more parts to the below list as the spec progresses:
- part 1 : coding style, files, merging, scoping, name resolution and synchronisation
- part 2 : reverse markdown for documentation
- part 3 : Actors and pools.
- part 4 : Semantic locks, blockers, continuation points and hazardous blockers
- part 5 : Semantic lexing, DAGs, prune / ent and alias.
- part 6 : DAGs and DataFrames as only data structures, and inline lambdas for pure compute.
- part 7 : Freezing
- part 8 : Attenuation, decomposition, and membranes
- part 9 : Sensitive data in immutables and future vault support.
- part 10 : Scalars and High Fidelity JSON
- part 11 : Operators, expressions and precedence.
- part 12 : Robust integers and integer bitwidth generic programming
- part 13 : The Merg-E ownership model, capture rules, and the --trustmebro compiler flag.
- part 14 : Actorcitos and structural iterators
- part 15: Explicit actorcitos, non-inline structural iterators, runtimes, and abstract schedular pipeline.
- part 16 : Async functions and resources and the full use of InnuenDo VaultFS
- part 17 : RAM-Points, RAM-points normalization bag, and the quota-membrane.
In this blog post we are going to look at a possibly somewhat weird language feature: iterators. Just like semantic locks, the iterators in Merg-E aren't actual iterators in the traditional meaning of the word. They are structural iterators in that they are thin abstractions on top of the Merg-E schedular and continuation point based design. They are actually so structural that what seem like language primitives like while and if(/else) are in fact higher level abstractions on top of these structural iterators. And Merg-E doesn't have for loops because it tries to be an as simple as possible language in terms of implementation, and a separate abstraction for for loops doesn't seem to add the same level of cognitive value that while and if do.
The user should consider if and while as convenient aliases for specific use, that aid language adoption for new users, but while use of if and while isn't discouraged, used of the lower abstraction structural iterators is deemed as more idiomatic. Using iterators and actorcitos helps prevent deep nested flow control inside of functions which are the actual thing that in Merg-E is considered non-idiomatic.
The actorcito
So far we have seen ephemeral functions with somewhat nondeterministic but usually short lifetimes, we have seen actors with lifetimes close to the program's main process lifetime, but now we are going to explore the actorcito, an actor like callable that just like the purely computational inline lambdas is expected to be run in a synchronous blocking way with respect to its invoking parent, but this is just usually. The user has fine control in its usage with actorcito pools when in situations where order doesn't matter and borrows and shares aren't needed. So what does an actorcito look like?
Well, just like a function, in fact it is just a function, but it has a very specific fingerprint.
The most basic actorcito will act just like a function:
mutable function uppercase ( c character |||
partial_return callable<character> |||
full_return callable<string> |||
is_last boolean )::{
}{
partial_return( lang.unicode.uppercase c );
};
While you can also choose to make it act closer to an actor.
mutable string rval;
mutable function uppercase ( c character |||
partial_return callable<character> |||
full_return callable<string> |||
is_last boolean )::{
rval;
}{
rval = rval lang.unicode.uppercase c;
if is_last {
full_return( rval );
};
};
The use of either of the return callables is not needed in most structural iterator usage patterns, but they do need to be part of the fingerprint for the function to be a usable actorcito.
Like all callables, an actorcito can have an error body as well as an happy path body.
The structural iterator
Now that we have actorcitos, we can create a structural iterator from a string, dataframe or DAG. When later we invoke the iterator in a foreach expression we can optionally do so using a filter, and the filter has matches ans non-matches. By default non-matches are silently discarded, unless the iterator was defined using a non-match actorcito.
Here is the most trivial version of an iterator and its usage:
inert string mixedcase = "Mixed-case String that We are going to iterate over.";
iterator mc_iter = iteratify mixedcase uppercase;
inert string upper = inline foreach mc_iter;
Remember that inline tells us something about things being handled in a single execution context as the parent, and that things, lacking awaits, will run non-preemptively. Next to that, inline enables the actual use of the return callables.
Before we get into the complexities, let's first look into what we can create iterators from and what modifiers we can use when creating an iterator. Let's start with the filter. To understand the filter, we need to discuss the special node token first. The __node__ or rather lang.Itterator.__node__ is something that can only be used in a filter definition inside a iteratively:
iterator mc_iter = iteratify mixedcase |||
uppercase_ascii uppercase_unicode |||
iterfilter lang.unicode.isascii __node__;
So what is happening here? We are wiring up our iterator with a kind of if/else logic. We are saying:
- Make an iterator for our mixedcase string
- Use the uppercase_ascii acrorito for each ascii character as defined by our filter
- Use the uppercase_unicode actorcito for each non ascii character as defined by our filter
- Define a filter using the lang.unicode.isascii expression to check if node (a single character) is an ascii character.
We are working with strings. As we discussed before, strings have two faces in Merg-E, the unicode face and the uint8 face. But this isn't just strings, dataframes have string fields and DAG nodes have names that have these two faces as well. That is why asciistring as modifier is implicitly used if none is defined. For a dataframe or DAG this is considered perfectly OK, but for strings being implicit, while not requires is idiomatic:
iterator mc_iter = iteratify utf8strings mixedcase |||
uppercase_ascii uppercase_unicode |||
iterfilter lang.unicode.isascii __node__;
The utf8strings modifier tells the iterator to treat strings like utf8 strings, not raw strings. If we want to iterate over the actual uint8 bytes, we can define so:
iterator mc_iter = iteratify rawbytes mixedcase |||
uppercase_ascii uppercase_unicode |||
iterfilter lang.unicode.isascii __node__;
Here rawbytes gives us an iterator that works with uint8 bytes rather than characters. Please note that we will need an actorcito function with a different function signature.
There is another modifier: sorted, we shouldn't use it on string iterators because it makes no sense, but let's do it anyway just to illustrate, but we'll go back to our simplest example:
inert string mixedcase = "Mixed-case String that We are going to iterate over.";
iterator mc_iter = iteratify sorted mixedcase uppercase;
inert string upper = inline foreach mc_iter;
Now when inline foreach is called, the characters in the string are sorted before being given to the actorcito.
The result is going to be " -.MSWAAAACDEEEEEEEGGGHIIIINNOOORRRRSTTTTTTVX", which is pretty useless but illustrates how things work.
Where this sorting "could" be useful is then iterating (or tree-walking a DAG, but even here, only use it if you really need it because sorting isn't free.
iterator tree_iter = iteratify sorted mydag |||
tree_data_size |||
iterfilter lang.Type.data.scalar.type in __node__.types;
This will create an iterator over a DAG, mydag, that traverses the tree part of the DAG (aliases are treated as leaf nodes), depth first, and invokes the actorcito named tree_data_size with each scalar leaf node.
Now that we have covered the basics of DAG and string iterators, let's have a look at dataframes. These are mostly the same, but both our sorting and our filters are a bit more expressive.
iterator df_iter = iteratify myDf sortedby "id,seq" |||
coordinate_split |||
iterfilter __node__.columns.seq != 0;
Now instead of generic sorting we tell the iterator to sort by the id column first,and then by the seq column second. And we filter so we skip the rows where the value of the seq column is zero.
Error handling and the if and while abstractions
Next to a reduction in nested flow control constructs like nested ifs and while loops (and the for loops that are missing from the language completely, one extra bonus that actorcitos and inline iterators allow us to have slightly more granular error handling. To illustrate, let's pinch through the syntactic sugar of if and while.
function if_body ( node dag |||
partial_return callable<bool> |||
full_return callable<bool> |||
is_last boolean )::{
IFBODY
};
function else_body ( node dag |||
partial_return callable<bool> |||
full_return callable<bool> |||
is_last boolean )::{
ELSEBODY
};
iterator if_iter = iteratify shallow<0> scope |||
if_body else_body |||
__node__.export.myvar % 17;
inline foreach if_iter;
Notice the fingerprint is a bit different here because we don't expect to assign anything from the return functions. Instead these booleans are meant for flow control.
The parameterized shallow<0> modifier is important here, it will only iterate over our own scope node without actually traversing anything.
The above is roughly the equivalent of:
if myvar % 17 {
IFBODY
}{
ELSEBODY
};
It may seem like a lot of boilerplate, but imagine you have 6 or even 10 levels of nested if/else, making the code harder to reason about than the verbose low abstraction actorcito alternative.
And also, there is no room for granular error handling in this high abstraction equivalent.
Now before we look into error handling, let's dive into the while abstraction.
This may be a bit surprising but the while statement isn't actually in the filter as you may expect.
function while_body ( node dag |||
partial_return callable<bool> |||
full_return callable<bool> |||
is_last boolean )::{
partial_return( __node__.export.myvar < 100 );
WHILEBODY
};
iterator while_iter = iteratify shallow<0> scope while_body;
inline foreachloop while_iter;
Notice that instead of foreach we are using foreachloop, that will just keep iterating until a False is injected by either a full_return invocation or a partial_return invocation. This is also the reason for our strange fingerprint with booleans.
So the above is the equivalent of:
while myvar < 100 {
WHILEBODY
};
So now for the granular error handling. It is really very simple, every actorcito can define an error body:
function while_body ( node dag |||
partial_return callable<bool> |||
full_return callable<bool> |||
is_last boolean )::{
partial_return( __node__.export.myvar < 100 );
WHILEBODY
}!!{
WHILEERRORBODY
};
This works the same as any other callable. When using the higher abstraction if and while this more granular error handling isn't available, making the verbose low abstraction alternative often the more idiomatic alternative. Having said this however: If your if or while wouldn't benefit from an error body and the nesting level stays at two or maybe three, then using the higher level abstraction of if and while is completely idiomatic.
Coming up
This was a first post on structural iterators and actorcitos, where we exposed that if and while in Merg-E are actually abstractions on top of actorcitos and structural iterators. We showed that while structural iterators aren't what people usually think of as iterators, they are the easiest to think about as iterators nonetheless. We discussed that while higher abstractions like while and if are great for entry into the language, and are often sufficient and an idiomatic choice, there can be advantages to abandoning these abstractions to avoid deep nesting and unlock more granular error handling.
In an upcoming post I will try to bridge from iterators and actorcitos, that allow for other more concurrent and sometimes messy but fast and parallel solutions, to the parallelism models that different planned runtimes aim to implement, and chances are I'll need a few more posts to tie up a few loose ends.