bytecode.md 21 KB

Comun Bytecode

This is a description of implementation specific comun bytecode, which from now on we will just call comun bytecode. Note that this bytecode is NOT part of comun specification, it is a specific invention of this particular implementation -- other implementations may use this bytecode or a different bytecode or even no bytecode at all.

Basics

Fixed-size 2 byte (16 bits) instructions are used. The bytecode supports type environments 0, 8, 16 and 32 (this can't easily be extended, for a support of more type environments a different bytecode would have to be designed). Except for header everything (even data in code and metadata information) is encoded as a type of instruction, i.e. in the bytecode section each aligned 2 byte sequence can be read and interpreted as an instruction, without knowing the context. Not all instructions represent a direct action, e.g. instructions recording continuation of immediate constant or meta information.

All memory cells and pointer addresses are implicitly considered to be zeros, so it is e.g. not necessary to explicitly initialize a specific pointer to zero.

Jump address is address of an instruction that will be executed in the next cycle.

It is recommended for files containing binary form of this bytecode to have the extension .cmb.

Overall Structure

The bytecode starts with a header of length 8 bytes whose meaning and possible values are following:

  1. 67 ('C')
  2. 66 ('B')
  3. reserved
  4. reserved
  5. checksum: Sum of all bytes in the instruction section modulo 256.
  6. payload: This can be used to store any information by the compiler (e.g. its signature etc.).
  7. reserved
  8. reserved

After the header instruction section follows. It consists of a sequence of instructions which end with the END instruction.

After the instruction section any data may follow which have no meaning from the bytecode point of view but may be used to store extra information, e.g. debug symbols etc.

Well Formed Bytecode

Well formed bytecode should follow these rules:

  • There are no instructions in the bytecode that aren't defined in this document.
  • Bits in instruction codes that have no meaning and reserved header bits are set to 0.
  • DES instructions immediately precede all higher level constructs to mark them (e.g. signify that a jump is start of an if statement).
  • There is exactly one END instruction, at the end of bytecode (to halt program early use jump instruction to skip to the END instruction.)
  • Function code is preceded with JMA instruction that skips it (this jump follows right after the function's DES instruction). There must be exactly one RET instruction in the function, at its end (to return early use jump instruction to skip to the RET instruction). After the initial JMA instructions NOP instructions may follow and then the first non-NOP instruction's address will be the one that's stored in CAL instructions that call this function (this rule simplifies translation of the bytecode to specific languages).
  • The bytecode is made so that when run it firstly sets initial pointer addresses with PSC instructions, then executes the INI instruction and only then the actual main program starts being executed. The initialization phase can happen e.g. at the very beginning of the bytecode or inside a special initialization function that's called right at start.

Instruction Format

Each instruction is of the following format:

TTTTUUMM EEPFNNNN
-------- --------
msb  lsb msb  lsb
1st byte 2nd byte
  • TTTTUUMM: opcode, identifies instruction, is of format:
    • If TTTT < 2 the instruction is special and UUMM has no special meaning.
    • Else (if TTTT >= 2) the opcode represents a typical stack operation and the following holds:
    • TTTTUU: Identifies instruction type.
    • MM: Identifies instruction mode, possible values are:
      • 00: pop 2, push 1
      • 01: pop 1, use immediate constant C, push 1
      • 10: pop 1, push 1
      • 11: push 1
  • EE: Type environment, possible values are:
    • 00: environment 0
    • 01: environment 8
    • 10: environment 16
    • 11: environment 32
  • P: If 1, the instruction is non-popping, otherwise it potetnially pops.
  • F: If 1, immediate constant (C) continues in the next word (which has to be COC), otherwise not.
  • NNNN: Immediate constant (C) bits, with rightmost bit being least significant.

Immediate constants: some instructions make use of immediate constants embedded in the instruction code. If an instruction uses immediate constant C, its bits are held in the NNNN part of the instruction; if these bits don't suffice for the constant, further higher bits will follow in immediately following COC instructions. Some instructions use two immediate constants C1 and C2 -- in this case these are encoded the same way: lower half of what would normally be C represents C1, the upper half C2.

Sign interpretation of constants: if an instruction is to interpret immediate constant as signed, it will do so in the following way -- the signed value is the value of the constant in B bit two's complement encoding where B is 4 for if the instruction is PAC, otherwise 32 if the instruction's type environment is 0, otherwise it is equal to the number of bits of the type environment.

Pointer indexing: pointer indices N = 0 to 15, including both bounds, mean the special values stack top minus N. Pointer with index 16 stands for the first user pointer, 17 for second user pointer etc.

Instruction List

Let x, y and z be the values at top, top - 1 and top - 2 addresses before the operation is performed, respectively. C, C1 and C2 represent immediate constants. sig() signifies interpretation of value as signed, ptrs is the pointer table in given type environment, mem is the memory of given type environment.

opcode mnemonic group description
00000000 END end, end of bytecode
00000001 NOP no operation, do nothing
00000010 DES description, holds meta information
00000011 COC constant continuation, extra bits for C
00000100 ERR error, raise error specified by C
00000111 CAL call, call function at address C
00001000 CAE call external, call external function C
00001001 RET return, return from function
00001010 JIA jump if address, pop, if x then jump to addr. C
00001011 JNA jump if not address, pop, if not x, jump to addr. C
00001100 JMA jump address, jump to addr. C
00001111 INI initialize, do initialization (push args etc.)
00010000 PSC pointer set constant, ptrs[C1] = C2
00010001 PAC pointer add constant, ptrs[C1] += sig(C2)
00010010 PAX pointer add, pop, ptrs[C] += sig(x)
00010011 PCO pointer copy, ptrs[C1] = ptrs[C2]
00010100 MEX memory set, pop, mem[ptrs[C]] = x
00010101 MGE memory get, push mem[ptrs[C]]
00010110 PCM pointer compare, push 0/1/2 if ptrs[C2] =/>/< ptrs[C1]
00010111 PUX push xth, push *x*th value below stack top
00011010 CON constant, push C, pop; NOTE: pop is needed for optims.
00011011 CND conditional, pop 3, push z ? y : x
00011100 SWP swap, pop 2, push x then y
00011101 TRA transfer, pop, mem[C][$] = x
00011110 POP pop, pop C + 1 values
00011111 OUT output, pop and output
00100000 ADX AD add, pop 2, push y + x
00100001 ADC AD add constant, pop 1, push x + C
00100100 SUX SU subtract, pop 2, push y - x
00100101 SUC SU subtract constant, pop 1, push x - C
00101000 MUX MU multiply, pop 2, push y * x
00101001 MUC MU multiply constant, pop 1, push x * C
00101100 DIX DI divide, pop 2, push y / x
00101101 DIC DI divide constant, pop 1, push x / C
00110000 DSX DS divide signed, pop 2, push sig(y) / sig(x)
00110001 DSC DS divide signed constant, pop 1, push sig(x) / sig(C)
00110100 MOX MO modulo, pop 2, push y % x
00110101 MOC MO modulo constant, pop 1, push x % C
00111000 MSX MS modulo signed, pop 2, push sig(y) % sig(x)
00111001 MSC MS modulo signed constant, pop 1, push sig(x) % sig(C)
01000000 GRX GR greater, pop 2, push y > x
01000001 GRC GR greater constant, pop 1, push x > C
01000100 GEX GE greater or equal, pop 2, push y >= x
01000101 GEC GE greater or equal constant, pop 1, push x >= C
01001000 SMX SM smaller, pop 2, push y < x
01001001 SMC SM smaller constant, pop 1, push x < C
01001100 SEX SE smaller or equal, pop 2, push y <= x
01001101 SEC SE smaller or equal constant, pop 1, push x <= C
01010000 GSX GS greater signed, pop 2, push sig(y) > sig(x)
01010001 GSC GS greater signed constant, pop 1, push sig(x) > sig(C)
01010100 BSX BS greater or equal signed, pop 2, push sig(y) >= sig(x)
01010101 BSC BS greater or equal signed constant, pop 1, push sig(x) >= sig(C)
01011000 SSX SS smaller signed, pop 2, push sig(y) < sig(x)
01011001 SSC SS smaller signed constant, pop 1, push sig(x) < sig(C)
01011100 LSX LS smaller or equal signed, pop 2, push sig(y) <= sig(x)
01011101 LSC LS smaller or equal signed constant, pop 1, push sig(x) <= sig(C)
01100000 EQX EQ equals, pop 2, push y == x
01100001 EQC EQ equals constant, pop 1, push x == C
01100100 NEX NE not equals, pop 2, push y != x
01100101 NEC NE not equals constant, pop 1, push x != C
01101000 BAX BA binary and, pop 2, push y & x
01101001 BAC BA binary and constant, pop 1, push x & C
01101100 BOX BO binary or, pop 2, push `y
01101101 BOC BO binary or constant, pop 1, push `x
01110000 BXX BX binary xor, pop 2, push y ^ x
01110001 BXC BX binary xor constant, pop 1, push x ^ C
01110100 LAX LA logical and, pop 2, push y && x
01110101 LAC LA logical and constant, pop 1, push x && C
01111000 LOX LO logical or, pop 2, push `y
01111001 LOC LO logical or constant, pop 1, push `x
01111100 LXX LX logical xor, pop 2, push y ^^ x
01111101 LXC LX logical xor constant, pop 1, push x ^^ C
10000010 BNO BN binary not, pop 1, push ~x
10000100 SRX SR shift right, pop 2, push y >> x
10000101 SRC SR shift right constant, pop 1, push x >> C
10001000 SLX SL shift left, pop 2, push y << x
10001001 SLC SL shift left constant, pop 1, push x << C
11110011 ADR address, push address of stack top
11111011 INU input unfinished, push 0 if input has finished, else 1
11111111 INP input, read and push value

Possible C values for the DES instruction are:

  • 1: Next instruction is an if branch.
  • 2: Next instruction is else label.
  • 3: This is end of an if branch.
  • 4: Next instruction is loop start. This is used for both conditional and infinite loops; distinction between the two can be made based on whether the next instruction is jump or not.
  • 5: Next instruction is loop break.
  • 6: Next instruction is loop end.
  • 7: Next instruction is start of a fun def.
  • 8: Next instruction is exit command.
  • 9: Next instruction is goto command.
  • 10: Marks location of a label.
  • 11: Marks start of a new line in source code (for debugging).
  • 12: Marks start of a new source code file (for debugging).
  • 13: Marks end of current source code file (for debugging).
  • (reserved)
  • 32 and above: Can be used by compiler to embed any information.

Notes And Examples

Here there will be additional notes and examples to give more clarity about the bytecode.

Instruction example: let's analyze an instruction encoded as 00100001 00000001. The first byte (00100001) is opcode, the highest 4 bits of the opcode (0010) are greater or equal to value 2, therefore the instruction is a typical stack operation. Thanks to this the lowest two bits of the opcode tell us more about the instruction operation mode; these bits are are 01 which signifies a mode by which instruction pops 1 value from the stack, makes use of an immediate constant (stored right in the instruction) and finally pushes 1 value back onto the stack. From the instruction table above we see the opcode actually stands for ADC -- add constant -- the instruction's semantic is to pop a value from the stack, add a constant to it (the immediate constant stored in instruction itself) and push the result back onto the stack. Now for the second part of the instruction (00000001): the two highest bits are 0, meaning the instruction operates in type environment 0 (the default one). The 3rd highest bit is also 0, meaning the instruction is NOT prevented from popping, i.e. it may perform a pop (and it will, by the definition of the ADC instruction). The 4th highest bit is also 0, saying the immediate constant stored in the instruction will NOT continue, i.e. the constant is small enough to fit into this single instruction. The remaining lower 4 bits (0001) store the immediate constant itself, in this case 1. To sum up, this instruction will work on the stack in type environment 0, it will pop one value from it, add 1 to it and push the result back onto the stack.

Instruction example 2: now let's analyze a more complex sequence of bytes:

00010001 11010000 00000011 00010001 00000011 00011111 00000011 00000000
I0a      I0b      I1a      I1b      I2a      I2b      I3a      I3b

We may look up the sequence of opcodes here is PAC COC COC COC, i.e. the sequence starts with a pointer add constant instruction and is followed by three constant continuation instructions that will hold additional data for the PAC instruction. The PAC instruction is a special one (because its opcode's highest 4 bits, 0001, are lower than the value 2), so we can't reason anything from its opcode alone, we have to look up its definition to see what exactly it does and what operands it takes. In this case we see the instruction uses two immediate constants, denoted C1 and C2, one will say an index of a pointer and the other a constant to add to this pointer. Whenever an instruction utilizes two immediate constants, they are actually stored as one immediate constant which is to be split in the half to obtain the two constants -- this is why we have so many COC instructions following here. We see that the 4th highest bit in bytes I0b, I1b and I2b is set, which indicates an immediate constant continuation, while the same bit in byte I3b is 0, meaning the constant stops there. So to obtain the whole immediate constant we chain the immediate constants (lowest 4 bits) stored in I3b, I2b, I1b and I0b to get 0000111100010000; splitting this into two gives us the two immediate constants C2 = 00001111 and C1 = 00010000. C1 holds decimal value 16, which by the described pointer indexing rules signifies the first user defined pointer (lower indices signify special stack top offset pointers). C2 is for the PAC instruction to be interpreted as a 4 bit two's complement signed value, i.e. in this case 1111 = -1. So the whole instruction adds a value -1 to the first user pointer (i.e. shifts it one address left).

When implementing the bytecode, be careful especially about the following:

  • The CON instruction performs not only push of a constant, but also immediately pops it, unless its non-popping bit is set. This is to allow optimization of e.g. CON' 2; MUX to CON 2; MUC 2 -- the latter performs multiplication by immediate constant instead of a value on the stack (something potentially simpler to optimize), however the value 2 still has to be pushed onto the stack to keep the behavior consistent with the former. The fact that the CON instruction may perform a pop here allows us to just replace two instructions with other two instructions, without shifting the rest of the bytecode.
  • The CON instruction performs pop and push in different order from other instructions (first pushes, then pops)! Treating CON with code that handles other instructions may therefore lead to a bug.
  • Pay careful attention to handling immediate constants in type env. 0. While e.g. in type env. 8 pushing -1 onto the stack can just be translated to CON 255 (as 255 and -1 are the same values there), in type env. 0 integer size is unknown, so the same has to be done differently, for example as CON 0; SUC 1.
  • Similarly note that in type env. 0 immediate constants that are to be treated as signed are considered to be 32bit two's complement values, regardless of the target platform's actual integer size, so e.g. signed division by 4294967295 can NOT be implemented as DSC 4294967295 under assumption that integer size will be e.g. 64 bits, this instruction will still consider 32bit two's complement and so the value here means always -1.
  • A bytecode translator/interpreter will likely perform a preliminary analysis of the bytecode before processing it, e.g. to assess which type environments are ever used by the bytecode. If a type environment is never used, there is no need to allocate stack and pointer table for it. However be careful about type env. 0, which will likely be always used just due to the presence of INI instruction which will most likely push program arguments on the type env. 0 stack.