This is a description of implementation specific comun bytecode, which from now on we will just call comun bytecode. Note that this bytecode is NOT part of comun specification, it is a specific invention of this particular implementation -- other implementations may use this bytecode or a different bytecode or even no bytecode at all.
Fixed-size 2 byte (16 bits) instructions are used. The bytecode supports type environments 0, 8, 16 and 32 (this can't easily be extended, for a support of more type environments a different bytecode would have to be designed). Except for header everything (even data in code and metadata information) is encoded as a type of instruction, i.e. in the bytecode section each aligned 2 byte sequence can be read and interpreted as an instruction, without knowing the context. Not all instructions represent a direct action, e.g. instructions recording continuation of immediate constant or meta information.
All memory cells and pointer addresses are implicitly considered to be zeros, so it is e.g. not necessary to explicitly initialize a specific pointer to zero.
Jump address is address of an instruction that will be executed in the next cycle.
It is recommended for files containing binary form of this bytecode to have the extension .cmb.
The bytecode starts with a header of length 8 bytes whose meaning and possible values are following:
After the header instruction section follows. It consists of a sequence of instructions which end with the END
instruction.
After the instruction section any data may follow which have no meaning from the bytecode point of view but may be used to store extra information, e.g. debug symbols etc.
Well formed bytecode should follow these rules:
DES
instructions immediately precede all higher level constructs to mark them (e.g. signify that a jump is start of an if statement).END
instruction, at the end of bytecode (to halt program early use jump instruction to skip to the END
instruction.)JMA
instruction that skips it (this jump follows right after the function's DES
instruction). There must be exactly one RET
instruction in the function, at its end (to return early use jump instruction to skip to the RET
instruction). After the initial JMA
instructions NOP
instructions may follow and then the first non-NOP
instruction's address will be the one that's stored in CAL
instructions that call this function (this rule simplifies translation of the bytecode to specific languages).PSC
instructions, then executes the INI
instruction and only then the actual main program starts being executed. The initialization phase can happen e.g. at the very beginning of the bytecode or inside a special initialization function that's called right at start.Each instruction is of the following format:
TTTTUUMM EEPFNNNN
-------- --------
msb lsb msb lsb
1st byte 2nd byte
TTTTUUMM
: opcode, identifies instruction, is of format:
TTTT
< 2 the instruction is special and UUMM
has no special meaning.TTTT
>= 2) the opcode represents a typical stack operation and the following holds:TTTTUU
: Identifies instruction type.MM
: Identifies instruction mode, possible values are:
00
: pop 2, push 101
: pop 1, use immediate constant C, push 110
: pop 1, push 111
: push 1EE
: Type environment, possible values are:
00
: environment 001
: environment 810
: environment 1611
: environment 32P
: If 1
, the instruction is non-popping, otherwise it potetnially pops.F
: If 1
, immediate constant (C) continues in the next word (which has to be COC
), otherwise not.NNNN
: Immediate constant (C) bits, with rightmost bit being least significant.Immediate constants: some instructions make use of immediate constants embedded in the instruction code. If an instruction uses immediate constant C, its bits are held in the NNNN
part of the instruction; if these bits don't suffice for the constant, further higher bits will follow in immediately following COC
instructions. Some instructions use two immediate constants C1 and C2 -- in this case these are encoded the same way: lower half of what would normally be C represents C1, the upper half C2.
Sign interpretation of constants: if an instruction is to interpret immediate constant as signed, it will do so in the following way -- the signed value is the value of the constant in B bit two's complement encoding where B is 4 for if the instruction is PAC, otherwise 32 if the instruction's type environment is 0, otherwise it is equal to the number of bits of the type environment.
Pointer indexing: pointer indices N = 0 to 15, including both bounds, mean the special values stack top minus N. Pointer with index 16 stands for the first user pointer, 17 for second user pointer etc.
Let x, y and z be the values at top, top - 1 and top - 2 addresses before the operation is performed, respectively. C, C1 and C2 represent immediate constants. sig()
signifies interpretation of value as signed, ptrs
is the pointer table in given type environment, mem
is the memory of given type environment.
opcode | mnemonic | group | description |
---|---|---|---|
00000000 |
END |
end, end of bytecode | |
00000001 |
NOP |
no operation, do nothing | |
00000010 |
DES |
description, holds meta information | |
00000011 |
COC |
constant continuation, extra bits for C | |
00000100 |
ERR |
error, raise error specified by C | |
00000111 |
CAL |
call, call function at address C | |
00001000 |
CAE |
call external, call external function C | |
00001001 |
RET |
return, return from function | |
00001010 |
JIA |
jump if address, pop, if x then jump to addr. C | |
00001011 |
JNA |
jump if not address, pop, if not x, jump to addr. C | |
00001100 |
JMA |
jump address, jump to addr. C | |
00001111 |
INI |
initialize, do initialization (push args etc.) | |
00010000 |
PSC |
pointer set constant, ptrs[C1] = C2 |
|
00010001 |
PAC |
pointer add constant, ptrs[C1] += sig(C2) |
|
00010010 |
PAX |
pointer add, pop, ptrs[C] += sig(x) |
|
00010011 |
PCO |
pointer copy, ptrs[C1] = ptrs[C2] |
|
00010100 |
MEX |
memory set, pop, mem[ptrs[C]] = x |
|
00010101 |
MGE |
memory get, push mem[ptrs[C]] |
|
00010110 |
PCM |
pointer compare, push 0/1/2 if ptrs[C2] =/>/< ptrs[C1] |
|
00010111 |
PUX |
push xth, push *x*th value below stack top | |
00011010 |
CON |
constant, push C, pop; NOTE: pop is needed for optims. | |
00011011 |
CND |
conditional, pop 3, push z ? y : x |
|
00011100 |
SWP |
swap, pop 2, push x then y | |
00011101 |
TRA |
transfer, pop, mem[C][$] = x |
|
00011110 |
POP |
pop, pop C + 1 values | |
00011111 |
OUT |
output, pop and output | |
00100000 |
ADX |
AD |
add, pop 2, push y + x |
00100001 |
ADC |
AD |
add constant, pop 1, push x + C |
00100100 |
SUX |
SU |
subtract, pop 2, push y - x |
00100101 |
SUC |
SU |
subtract constant, pop 1, push x - C |
00101000 |
MUX |
MU |
multiply, pop 2, push y * x |
00101001 |
MUC |
MU |
multiply constant, pop 1, push x * C |
00101100 |
DIX |
DI |
divide, pop 2, push y / x |
00101101 |
DIC |
DI |
divide constant, pop 1, push x / C |
00110000 |
DSX |
DS |
divide signed, pop 2, push sig(y) / sig(x) |
00110001 |
DSC |
DS |
divide signed constant, pop 1, push sig(x) / sig(C) |
00110100 |
MOX |
MO |
modulo, pop 2, push y % x |
00110101 |
MOC |
MO |
modulo constant, pop 1, push x % C |
00111000 |
MSX |
MS |
modulo signed, pop 2, push sig(y) % sig(x) |
00111001 |
MSC |
MS |
modulo signed constant, pop 1, push sig(x) % sig(C) |
01000000 |
GRX |
GR |
greater, pop 2, push y > x |
01000001 |
GRC |
GR |
greater constant, pop 1, push x > C |
01000100 |
GEX |
GE |
greater or equal, pop 2, push y >= x |
01000101 |
GEC |
GE |
greater or equal constant, pop 1, push x >= C |
01001000 |
SMX |
SM |
smaller, pop 2, push y < x |
01001001 |
SMC |
SM |
smaller constant, pop 1, push x < C |
01001100 |
SEX |
SE |
smaller or equal, pop 2, push y <= x |
01001101 |
SEC |
SE |
smaller or equal constant, pop 1, push x <= C |
01010000 |
GSX |
GS |
greater signed, pop 2, push sig(y) > sig(x) |
01010001 |
GSC |
GS |
greater signed constant, pop 1, push sig(x) > sig(C) |
01010100 |
BSX |
BS |
greater or equal signed, pop 2, push sig(y) >= sig(x) |
01010101 |
BSC |
BS |
greater or equal signed constant, pop 1, push sig(x) >= sig(C) |
01011000 |
SSX |
SS |
smaller signed, pop 2, push sig(y) < sig(x) |
01011001 |
SSC |
SS |
smaller signed constant, pop 1, push sig(x) < sig(C) |
01011100 |
LSX |
LS |
smaller or equal signed, pop 2, push sig(y) <= sig(x) |
01011101 |
LSC |
LS |
smaller or equal signed constant, pop 1, push sig(x) <= sig(C) |
01100000 |
EQX |
EQ |
equals, pop 2, push y == x |
01100001 |
EQC |
EQ |
equals constant, pop 1, push x == C |
01100100 |
NEX |
NE |
not equals, pop 2, push y != x |
01100101 |
NEC |
NE |
not equals constant, pop 1, push x != C |
01101000 |
BAX |
BA |
binary and, pop 2, push y & x |
01101001 |
BAC |
BA |
binary and constant, pop 1, push x & C |
01101100 |
BOX |
BO |
binary or, pop 2, push `y |
01101101 |
BOC |
BO |
binary or constant, pop 1, push `x |
01110000 |
BXX |
BX |
binary xor, pop 2, push y ^ x |
01110001 |
BXC |
BX |
binary xor constant, pop 1, push x ^ C |
01110100 |
LAX |
LA |
logical and, pop 2, push y && x |
01110101 |
LAC |
LA |
logical and constant, pop 1, push x && C |
01111000 |
LOX |
LO |
logical or, pop 2, push `y |
01111001 |
LOC |
LO |
logical or constant, pop 1, push `x |
01111100 |
LXX |
LX |
logical xor, pop 2, push y ^^ x |
01111101 |
LXC |
LX |
logical xor constant, pop 1, push x ^^ C |
10000010 |
BNO |
BN |
binary not, pop 1, push ~x |
10000100 |
SRX |
SR |
shift right, pop 2, push y >> x |
10000101 |
SRC |
SR |
shift right constant, pop 1, push x >> C |
10001000 |
SLX |
SL |
shift left, pop 2, push y << x |
10001001 |
SLC |
SL |
shift left constant, pop 1, push x << C |
11110011 |
ADR |
address, push address of stack top | |
11111011 |
INU |
input unfinished, push 0 if input has finished, else 1 | |
11111111 |
INP |
input, read and push value |
Possible C values for the DES
instruction are:
Here there will be additional notes and examples to give more clarity about the bytecode.
Instruction example: let's analyze an instruction encoded as 00100001 00000001
. The first byte (00100001
) is opcode, the highest 4 bits of the opcode (0010
) are greater or equal to value 2, therefore the instruction is a typical stack operation. Thanks to this the lowest two bits of the opcode tell us more about the instruction operation mode; these bits are are 01
which signifies a mode by which instruction pops 1 value from the stack, makes use of an immediate constant (stored right in the instruction) and finally pushes 1 value back onto the stack. From the instruction table above we see the opcode actually stands for ADC
-- add constant -- the instruction's semantic is to pop a value from the stack, add a constant to it (the immediate constant stored in instruction itself) and push the result back onto the stack. Now for the second part of the instruction (00000001
): the two highest bits are 0, meaning the instruction operates in type environment 0 (the default one). The 3rd highest bit is also 0, meaning the instruction is NOT prevented from popping, i.e. it may perform a pop (and it will, by the definition of the ADC
instruction). The 4th highest bit is also 0, saying the immediate constant stored in the instruction will NOT continue, i.e. the constant is small enough to fit into this single instruction. The remaining lower 4 bits (0001
) store the immediate constant itself, in this case 1. To sum up, this instruction will work on the stack in type environment 0, it will pop one value from it, add 1 to it and push the result back onto the stack.
Instruction example 2: now let's analyze a more complex sequence of bytes:
00010001 11010000 00000011 00010001 00000011 00011111 00000011 00000000
I0a I0b I1a I1b I2a I2b I3a I3b
We may look up the sequence of opcodes here is PAC COC COC COC
, i.e. the sequence starts with a pointer add constant instruction and is followed by three constant continuation instructions that will hold additional data for the PAC
instruction. The PAC
instruction is a special one (because its opcode's highest 4 bits, 0001
, are lower than the value 2), so we can't reason anything from its opcode alone, we have to look up its definition to see what exactly it does and what operands it takes. In this case we see the instruction uses two immediate constants, denoted C1 and C2, one will say an index of a pointer and the other a constant to add to this pointer. Whenever an instruction utilizes two immediate constants, they are actually stored as one immediate constant which is to be split in the half to obtain the two constants -- this is why we have so many COC
instructions following here. We see that the 4th highest bit in bytes I0b, I1b and I2b is set, which indicates an immediate constant continuation, while the same bit in byte I3b is 0, meaning the constant stops there. So to obtain the whole immediate constant we chain the immediate constants (lowest 4 bits) stored in I3b, I2b, I1b and I0b to get 0000111100010000
; splitting this into two gives us the two immediate constants C2 = 00001111
and C1 = 00010000
. C1 holds decimal value 16, which by the described pointer indexing rules signifies the first user defined pointer (lower indices signify special stack top offset pointers). C2 is for the PAC
instruction to be interpreted as a 4 bit two's complement signed value, i.e. in this case 1111
= -1. So the whole instruction adds a value -1 to the first user pointer (i.e. shifts it one address left).
When implementing the bytecode, be careful especially about the following:
CON
instruction performs not only push of a constant, but also immediately pops it, unless its non-popping bit is set. This is to allow optimization of e.g. CON' 2; MUX
to CON 2; MUC 2
-- the latter performs multiplication by immediate constant instead of a value on the stack (something potentially simpler to optimize), however the value 2 still has to be pushed onto the stack to keep the behavior consistent with the former. The fact that the CON
instruction may perform a pop here allows us to just replace two instructions with other two instructions, without shifting the rest of the bytecode.CON
instruction performs pop and push in different order from other instructions (first pushes, then pops)! Treating CON
with code that handles other instructions may therefore lead to a bug.CON 255
(as 255 and -1 are the same values there), in type env. 0 integer size is unknown, so the same has to be done differently, for example as CON 0; SUC 1
.DSC 4294967295
under assumption that integer size will be e.g. 64 bits, this instruction will still consider 32bit two's complement and so the value here means always -1.INI
instruction which will most likely push program arguments on the type env. 0 stack.