Version 0.3 of the Merg-E language specification : Literal operat...

This is part seventeen in a series on the 0.3 version of the language spec for the Merg-E Domain Specific Language for the InnuenDo Web 3.0 stack. I'll add more parts to the below list as the spec progresses:

part 1 : coding style, files, merging, scoping, name resolution and synchronisation
part 2 : reverse markdown for documentation
part 3 : Actors and pools.
part 4 : Semantic locks, blockers, continuation points and hazardous blockers
part 5 : Semantic lexing, DAGs, prune / ent and alias.
part 6 : DAGs and DataFrames as only data structures, and inline lambdas for pure compute.
part 7 : Freezing
part 8 : Attenuation, decomposition, and membranes
part 9 : Sensitive data in immutables and future vault support.
part 10 : Scalars and High Fidelity JSON
part 11 : Operators, expressions and precedence.
part 12 : Robust integers and integer bitwidth generic programming
part 13 : The Merg-E ownership model, capture rules, and the --trustmebro compiler flag.
part 14 : Actorcitos and structural iterators
part 15 : Explicit actorcitos, non-inline structural iterators, runtimes, and abstract scheduler pipeline.
part 16 : async functions and resources and the full use of InnuenDo VaultFS
part 17: RAM-Points, RAM-points normalization bag, and the quota-membrane.
part 18: Literal operators & Rational and Complex numbers.

In this post I'm discussing a last minute patch to the v0.3 language spec. This addition is needed because delaying it would break backward compatibility for operator extension.

Movint the core language / user extension boundary

In the previous v0.3 language spec post we defined three things:

The core language lives completely in ascii codespace
Overloadable operators live in specific groups of the unicode space past 127
Naming tokens start with an alpha character, optionally uppercase character, or one or two underscores, followed by lowercase alphanumerics and a matching number of underscores at the end.

But while I was drafting for the 0.4 spec, there two pesky character that would break all of this, the '÷' that is at unicode codepoint 247, and the '¡' at codepoint 161.

For this reason, and to give the core language some more overall space, we are making a big change to these rules.

The core language lives completely in latin-1 codespace
Overloadable operators live in specific groups of the unicode space past 255, the 128..255 range is now reserved for future core language usage.
Naming tokens can contain characters in the 223..255 range as small letters and in the 192..222 range as capital letters.

When making this change, we aren't just going to be doing so in preparation for 0.4, we are going to be practical and actually pull the ÷ and the ¡ into the language.

So what do we need the ÷ and the ¡ for? We need it for a third and a 4th base type of numbers.

Rational numbers

Merg-E being a language that tries to make cryptography a first class citizen of the runtime, we need a solid foundation of numerical types, and floats and ints aren't quite enough, even if we have bigger floats and ints than most languages (up to int16384, uint16384 and float256). What we need is a type for rational numbers. This is not a Merg-E language quirk, many number friendly languages already have a rational type.

This is where we need the ÷ character. If we want to write for example ⅓ as a rational literal, we will do so using the ÷ character.

inert rational16 third = 1 ÷ 3;

This notation doesn't just call for the codespace update. It also calls for a new concept, the concept of a literal operator. We are going to define ÷ as an operator, effectively a high precedence operator, but more specifically as a literal operator, meaning that most operator expressions are semantically invalid. Such as:

inert int8 three = 3;
inert rational16 third = 1 ÷ three;

This is invalid at semantic lex time already because ÷ is a literal operator and three is a name token, not a literal.

To fully understand how a literal operator differs from a regular operator we need to consider the three steps that Merg-E uses to lex and parse the source code. Other than a strict lexer and parser phase, parsing is concentrated in the third (skin) phase bit partial parsing happens in the first two phases (bones and muscle) too. Lexing is divided over the first two phases. In the bones phase, the overall program structure is turned into a tree and literals are lexed and normalized, and this is the phase we will process literal operators. Because this happens this early, conceptually we can consider literal
operators as very high precedence operators, but technically they don't have precedence numbers. In the muscle phase, symbol resolution happens according to the capture and least authority rules of the language. Then in the skin phase the final parsing happens, what includes operator resolution and operator precedence.

Under the hood a rational is a combination of an uint and an equally sized int. Rationals exist for all powers of two sizes from rational16 that consists of an int8 and an uint8, up to a rational32768.
But just as with integers, virtual rationals are defined from Vrational4 and up.

It is easier to think of ÷ not as an operator but as part of an expression defining the rational literal. Doing so also helps make the use of virtual rationals a bit more intuitive.

Let's look at an example:

inert rational16 myconst = 256 ÷ 48;

You may think this shouldn't fit. The signed integer 256 takes up 10 bits, so intuitively this should be a Vrational20 that shouldn't fit in a rational16 but needs a rational32.
But if you realize that 256 and 48 have a greatest common divisor, we realize that this expression can be written as :

inert rational16 myconst = 16 ÷ 3;

16 takes up 6 bits as a signed integer, so this should fit in a Vrational12 that in turn does fit in a rational16.

It is important to note that operators on integers and rationals result in rationals and operators on rationals and floats result in floats.

Complex numbers

Now that we discussed ÷ we need to look at ¡ and a 4th family of numbers, complex numbers. While less fundamental to crypto, as far as I know, complex numbers complete number theory. Getting into the theory of complex numbers goes beyond the scope of this blog post, but in short complex numbers consist of a real and an imaginary part, and the combination of these two is called a complex number.

A complex number in math is usually written like for example:

17 + 12i

Because the + sign is already an operator, and 12i isn't a valid literal, we use the ¡ literal operator to compose complex literals:

inert complex32 mycomplex = 17 ¡ 12;

Like rationals a complex number is composed of two more fundamental numbers, in this case floats. A complex32 is composed of a real float16 part and an imaginary float16 part. This allows us to have complex32 upto complex512 numbers.

I'm not going to go into details on this spec for now, and expanding integer length generics to rational length generics is a subject that might need to wait till v0.4 of the language spec.

The future: rational complex numbers?

It might be a good idea to add a 5th numeral type, specifically for rational complex numbers, or it might be futile, that remains a possibility for future versions of the language spec, but we will implement the bones code for + and ¡ in such a way that the following expression will be valid for bones but invalid for skin. We do this to keep the option open towards future versions of the language spec:

sensitive crat2048 myconst = 199 ÷  1001 ¡ 83 ÷ 77;

As stated, while it might be good to have high precision complex numbers, there is a lot of π and e (what are irrational numbers) in complex math, eliminating most but not all practical usecases. Lattice cryptography seems like it could have use for complex rationals, but the decision if that narrow usecase will be enough to add them to the language remains a decision for version 0.4 or 0.5 of the language spec.

Coming up

Ones more this post was unplanned, at least for the 0.3 version of the spec. The impact of leaving the moving of the boundary for core vs user defined in unicode would have broken backwards compatibility too much without justification.

After this post, my priority remains with completing Innuengo VaultFS and the bones/muscle/skin phases for Merg-E, the parser, and the scheduler pipeline (Yggdrasyl and Níðhöggr) for the development runtime, so expect posts on my progress first before extensions to this series on the v0.3 language specs.

Version 0.3 of the Merg-E language specification : Literal operators & Rational and Complex numbers.

Movint the core language / user extension boundary

Rational numbers

Complex numbers

The future: rational complex numbers?

Coming up