This is part eleven in a series on the 0.3 version of the language spec for the Merg-E Domain Specific Language for the InnuenDo Web 3.0 stack. I'll add more parts to the below list as the spec progresses:
- part 1 : coding style, files, merging, scoping, name resolution and synchronisation
- part 2 : reverse markdown for documentation
- part 3 : Actors and pools.
- part 4 : Semantic locks, blockers, continuation points and hazardous blockers
- part 5 : Semantic lexing, DAGs, prune / ent and alias.
- part 6 : DAGs and DataFrames as only data structures, and inline lambdas for pure compute.
- part 7 : Freezing
- part 8 : Attenuation, decomposition, and membranes
- part 9 : Sensitive data in immutables and future vault support.
- part 10 : Scalars and High Fidelity JSON
- part 11 : Operators, expressions and precedence.
- part 12 : Robust integers and integer bitwidth generic programming
- part 13 : The Merg-E ownership model, capture rules, and the --trustmebro compiler flag.
- part 14 : Actorcitos and structural iterators
- part 15: Explicit actorcitos, non-inline structural iterators, runtimes, and abstract schedular pipeline.
- part 16 : Async functions and resources and the full use of InnuenDo VaultFS
- part 17 : RAM-Points, RAM-points normalization bag, and the quota-membrane.
In this post we are going to do a bit of a deep dive into the guts of the Merg-E parser. Merg-E doesn't exactly have something like operator overloading, at least not in the way that languages like Python and C++ do, and even while there are hoops you can jump through to make operator overloading actually work, there are just too many thing you need to get just right, too many shotguns pointing at your feet, to consider the idea anywhere close to idiomatic. What Merge-E however does offer is operator extension in the unicode space, and it is considered to be very much idiomatic.
As we wrote before, a Merg-E source file is considered to be UTF-8. Though the core language itself, and its operators are all defined in the ASCII space. To elaborate on this we need to look into unicode categories a little bit. Unicode defines 29 different character categories within 7 groups. Most of these are not relevant here, but a few of them are.
The 'C*', 'L*, 'M*, 'N*' and 'Z* groups are not relevant for us here, but all four of the '\S*' SYMBOL groups and a selected subset of most of the 'P*' PUNCTIATIONS is up for grabs. Let's look at the
| category name | description | count | examples | available |
|---|---|---|---|---|
| Sm | Maths symbols | 948 | ∞ ∩ ⊍ | non-blacklisted |
| Sk | Modifier symbols | 123 | ꜠ ¨ | NO |
| Sc | Currency Symbols | 62 | £ ₱ 𞋿 | non-blacklisted |
| So | Other symbols | 6,431 | ࿌ ᭩ ℧ | non-blacklisted |
| Pd | Dash Punctuation | 25 | ⸻ 〰 | whitelisted |
| Ps | Open Punctuation | 75 | ༼ ༺ | NO |
| Pe | Close Punctuation | 73 | ༽ ༻ | NO |
| Pi | Initial Punctuation | 12 | « ⸉ | NO |
| Pf | Final Punctuation | 10 | » ⸊ | NO |
| Po | Other Punctuation | 593 | ¡ ߷ ༒ | non-blacklisted |
We won't allow symbols that linguistically should be expected in pairs, and characters that merely modify other characters are not allowed either. There will be a blacklist because for example thye blacklist contains 〉 because it resembles > just a little too much, and there are many such examples.
For dash punctuations most characters would be blacklisted, so there we use a tiny whitelist.
So how do we add an operator? Well lets start at the end this time because it is easier to explain that way. The built-in language operators that live in ASCII space are basically static aliases.
For example '+' is an alias for __operator_add__, where __operator_add__, like any other token, will be resolved according to name resolution rules defined earlier in this series.
While we can define new versions of these, and we can even change the name resolution so our version would be picked over the standard language one, if we do that we run into operator precedence logic, or rather expression precedence logic, because of Merg-E's layered operator precedence system that keeps the user from giving their operators higher precedence than built-in ones. More on this later.
The convention in Merg-E is that names starting and ending with double underscore are language internal by default. Redefine them at your own peril, but in this case, if you do, you better redefine every single operator and you better do it for each and every supported type, or you will run into unexpected behaviour in operator precedence.
For self defined operators, the convention is to use single underscores in names. Let assume we had defined an expression token named _my_operator_ (don't worry, we will explain how to do this later).
We can add it to the global operator alias list like this:
operatoralias "℧" _my_operator_;
Please note that operatoralias is add only, no overwriting and no deletion, just extending the language, and unlike most of the rest of the language, no, there is no attenuation of that power, there is no security benefit to restraining the possibility to use this power because the changed language only extends downward and has no influence on what happens upwards (there is some copy-on-mute magic going on there as discussed earlier in the membrane attenuation section).
So after this, depending on what the ℧ actualy does and for what types, the following might now be a valid expression:
inert string foo = "apples" ℧ hazardous 74;
It looks like nonsense, and quite frankly for now it is, but it illustrated how things are extendable in Merg-E.
If _my_operator_ is written differently, it might enable code that looks something like:
inert uint32 bar = baz + 7 ℧ 17 14 * quaz + 5;
Let's get ahead of ourself a bit to show how might work out, then in the following sections we will see why it will work out like that.
Imagine we are the Merg-E lexer first, and then the parser. Think of the lexer as producing a flat stream of resolved operator tokens, and the parser as repeatedly collapsing the highest-precedence captureable token. As lexer, we look at this line abouve and canonicalize it, resolving names because we are a semantic lexer. Let's ignore the stuff on the left and focus on the right side of the = sign. We shall write it a bit nested to keep things readable, but this is just presentation:
scope.Export.baz
lang.Operator.__operator_add__
7
scope.Export._my_operator_
17
14
lang.Operator.__operator_multiply__
scope.Export.quaz
lang.Operator.__operator_add__
5;
This looks very complex now, but this, plus representing these canonical identifiers as tokens is what the lexer does.
Now the parser is going to have a look. If the token is an expression token, it has a precedence number that determines which one of this soup of canonicals gets to do its thing first. In our case it is lang.Operator._operator_multiply_. Now capture rules come into play, lang.Operator.operator_multiply defines a capture of one left, one right, no modifiers.
scope.Export.baz
lang.Operator.__operator_add__
7
scope.Export._my_operator_
17
lang.Operator.__operator_multiply__(14, scope.Export.quaz)
lang.Operator.__operator_add__
5;
The parser looks at the string again, now lang.Operator.operator_add has the most precedence, but there are two of these, so we go left to right in two steps.
lang.Operator.__operator_add__(scope.Export.baz, 7)
scope.Export._my_operator_
17
lang.Operator.__operator_multiply__(14, scope.Export.quaz)
lang.Operator.__operator_add__
5;
after the first run, then
lang.Operator.__operator_add__(scope.Export.baz, 7)
scope.Export._my_operator_
17
lang.Operator.__operator_add__(
lang.Operator.__operator_multiply__(14, scope.Export.quaz),
5
);
Now, finally it's our _my_operator_'s turn. If we define that _my_operator_ had one left and two right captures, both of uint32, then the result could be:
scope.Export._my_operator_(
lang.Operator.__operator_add__(scope.Export.baz, 7),
17,
lang.Operator.__operator_add__(
lang.Operator.__operator_multiply__(14, scope.Export.quaz),
5
)
);
In reality this would all happen in parsing structures, something close to an Abstract Parse Tree, not in code like here, but we are illustrating the principle. We come last with our user level expression because our default precedence number is 1,048,576 and lower numbers come first. The only operator with a higher precedence number than us in the original line was the =, and while we have the ability to choose lower and higher numbers for our own expression precedence, we can only do so in a band that neither allows us to pick a lower number than * and +, nor a higher number than =.
Let's look at that band some more:
| precedence | min | max | value | example |
|---|---|---|---|---|
| Merg-E high precedence | 0 | 65535 | * + | |
| User defined | 65536 | 2147483647 | ℧ | |
| User default | - | - | 1048576 | |
| Merg-E low precedence | 2147483648 | 4294967295 | = |
defining and prioritizing an expression token.
In the previous section we just stepped over the definition of _my_operator_, so lets get into the meat of the operator definition and the operator precedence and capture rules.
We need to revisit the inline from our previous discussion of lambdas. Let's make a real example, we start of with the pure compute inlineable from part 6 of this series.:
function power (base uint8 exponent uint8 hashazardous boolean callable<uint2048> return)::{
}{
if hasharardous {
return typecast uint2048 lang.math.power typecast float16 base typecast float16 exponent;
}{
mutable uint8 run = 0;
mutable uint2048 rval = 1
while run < exponent {
rval = rval * base;
run = run + 1;
};
return(rval);
};
As we know, if we want to use this function we can actualy call it using inline
int x = inline power( 2 , 11, False);
but we have no use for that right now. Instead we want to register it as an expression token. Note this token lives under scope.Export and for our decendents under scope.Imports, so we need to make sure name resolution includes these in order to actually use it.
expressiontoken _power_ = expression leftcapture uint8 power rightcapture uint8 65536 hazardous;
operatoralias "⏻" _power_;
We now have a bit of a puzzle, what exactly is going on. Let's look a bit closer at our function. As we had seen before the return isn't an actual return but a callable, we knew about that one already, but now there is a boolean called hashazardous. What we do is if this boolean is true, we use a language builtin actually for floating point numbers. This should be safe, in theory because we use uint8 numbers, but we are using double type casting from integers to floats and from floats to integers, so we are not sure about our implementation. We do know that this hack is way faster than the loop solution, and because the language defined a hazardous modifier and we are creating an expressiontoken, we have the ability to use that modifier if we feel we need to.
Things become clearer when we look at the expression line. The expression line starts with a leftcapture expression indicating we are taking the left most function argument from the left. Then the function name power is there, indicating this is the inlineable pure compute function that we are wrapping. Then a rightcapture follows telling the runtime the second argument should be captured from the right. Then the optional operator precedence, in our case 65536, the lowest number we can possibly choose, thus the highest precedence that is possible for a user defined expressiontoken, and finally a list of modifiers we are able to work with, in this case just hazardous. This last one is then mapped to a boolean indicating if the modifier was actually used.
So let's see it in action:
inert uint2048 mybigone = 42 ⏻ 42;
This will work. However note that even if we chose the highers precedence we possibly could, we can't out precede the ASCII operators that the language defined:
inert uint2048 mybigone = 3 + 42 ⏻ 42 + 1;
Because + has a precedence in the 0-65535 range and ⏻ is at 65536, the parser will resolve the additions first. The internal execution will look like: power( (3+42), (42+1), False ) Even at its 'strongest,' a user-defined operator respects the core math of the language.
Coming up
In this post we looked at what in other languages is operator precedence and operator overloading, but what in Merg-E is a bit different. We looked at how operators in Merg-E are downward extensible but not replaceable sets of aliases to special expression nodes. We saw how we can use this to make user space unicode operators that extend the language downwards in a least authority way. We explored how the language allows operator precedence between user defined operators, but only within a band so as not to disrupt the language primitive operators. We showed how to define capture rules for custom operators, and how to set operator precedence. We even showed how to make our operators respect and absorb modifiers like hazardous.
I'll need at least one more post to talk about parallelism models, iterators, and possibly a few more.