Comun Bytecode

This is a description of implementation specific comun bytecode, which from now on we will just call comun bytecode. Note that this bytecode is NOT part of comun specification, it is a specific invention of this particular implementation -- other implementations may use this bytecode or a different bytecode or even no bytecode at all.

Basics

Fixed-size 2 byte (16 bits) instructions are used. The bytecode supports type environments 0, 8, 16 and 32 (this can't easily be extended, for a support of more type environments a different bytecode would have to be designed). Except for header everything (even data in code and metadata information) is encoded as a type of instruction, i.e. in the bytecode section each aligned 2 byte sequence can be read and interpreted as an instruction, without knowing the context. Not all instructions represent a direct action, e.g. instructions recording continuation of immediate constant or meta information.

All memory cells and pointer addresses are implicitly considered to be zeros, so it is e.g. not necessary to explicitly initialize a specific pointer to zero.

Jump address is address of an instruction that will be executed in the next cycle.

It is recommended for files containing binary form of this bytecode to have the extension .cmb.

Overall Structure

The bytecode starts with a header of length 8 bytes whose meaning and possible values are following:

67 ('C')
66 ('B')
reserved
reserved
checksum: Sum of all bytes in the instruction section modulo 256.
payload: This can be used to store any information by the compiler (e.g. its signature etc.).
reserved
reserved

After the header instruction section follows. It consists of a sequence of instructions which end with the END instruction.

After the instruction section any data may follow which have no meaning from the bytecode point of view but may be used to store extra information, e.g. debug symbols etc.

Well Formed Bytecode

Well formed bytecode should follow these rules:

There are no instructions in the bytecode that aren't defined in this document.
Bits in instruction codes that have no meaning and reserved header bits are set to 0.
DES instructions immediately precede all higher level constructs to mark them (e.g. signify that a jump is start of an if statement).
There is exactly one END instruction, at the end of bytecode (to halt program early use jump instruction to skip to the END instruction.)
Function code is preceded with JMA instruction that skips it (this jump follows right after the function's DES instruction). There must be exactly one RET instruction in the function, at its end (to return early use jump instruction to skip to the RET instruction). After the initial JMA instructions NOP instructions may follow and then the first non-NOP instruction's address will be the one that's stored in CAL instructions that call this function (this rule simplifies translation of the bytecode to specific languages).
The bytecode is made so that when run it firstly sets initial pointer addresses with PSC instructions, then executes the INI instruction and only then the actual main program starts being executed. The initialization phase can happen e.g. at the very beginning of the bytecode or inside a special initialization function that's called right at start.

Instruction Format

Each instruction is of the following format:

TTTTUUMM EEPFNNNN
-------- --------
msb  lsb msb  lsb
1st byte 2nd byte

TTTTUUMM: opcode, identifies instruction, is of format:
- If TTTT < 2 the instruction is special and UUMM has no special meaning.
- Else (if TTTT >= 2) the opcode represents a typical stack operation and the following holds:
- TTTTUU: Identifies instruction type.
- MM: Identifies instruction mode, possible values are:
  - 00: pop 2, push 1
  - 01: pop 1, use immediate constant C, push 1
  - 10: pop 1, push 1
  - 11: push 1
EE: Type environment, possible values are:
- 00: environment 0
- 01: environment 8
- 10: environment 16
- 11: environment 32
P: If 1, the instruction is non-popping, otherwise it potetnially pops.
F: If 1, immediate constant (C) continues in the next word (which has to be COC), otherwise not.
NNNN: Immediate constant (C) bits, with rightmost bit being least significant.

Immediate constants: some instructions make use of immediate constants embedded in the instruction code. If an instruction uses immediate constant C, its bits are held in the NNNN part of the instruction; if these bits don't suffice for the constant, further higher bits will follow in immediately following COC instructions. Some instructions use two immediate constants C1 and C2 -- in this case these are encoded the same way: lower half of what would normally be C represents C1, the upper half C2.

Sign interpretation of constants: if an instruction is to interpret immediate constant as signed, it will do so in the following way -- the signed value is the value of the constant in B bit two's complement encoding where B is 4 for if the instruction is PAC, otherwise 32 if the instruction's type environment is 0, otherwise it is equal to the number of bits of the type environment.

Pointer indexing: pointer indices N = 0 to 15, including both bounds, mean the special values stack top minus N. Pointer with index 16 stands for the first user pointer, 17 for second user pointer etc.

Instruction List

Let x, y and z be the values at top, top - 1 and top - 2 addresses before the operation is performed, respectively. C, C1 and C2 represent immediate constants. sig() signifies interpretation of value as signed, ptrs is the pointer table in given type environment, mem is the memory of given type environment.

opcode	mnemonic	group	description
`00000000`	`END`		end, end of bytecode
`00000001`	`NOP`		no operation, do nothing
`00000010`	`DES`		description, holds meta information
`00000011`	`COC`		constant continuation, extra bits for C
`00000100`	`ERR`		error, raise error specified by C
`00000111`	`CAL`		call, call function at address C
`00001000`	`CAE`		call external, call external function C
`00001001`	`RET`		return, return from function
`00001010`	`JIA`		jump if address, pop, if x then jump to addr. C
`00001011`	`JNA`		jump if not address, pop, if not x, jump to addr. C
`00001100`	`JMA`		jump address, jump to addr. C
`00001111`	`INI`		initialize, do initialization (push args etc.)
`00010000`	`PSC`		pointer set constant, `ptrs[C1] = C2`
`00010001`	`PAC`		pointer add constant, `ptrs[C1] += sig(C2)`
`00010010`	`PAX`		pointer add, pop, `ptrs[C] += sig(x)`
`00010011`	`PCO`		pointer copy, `ptrs[C1] = ptrs[C2]`
`00010100`	`MEX`		memory set, pop, `mem[ptrs[C]] = x`
`00010101`	`MGE`		memory get, push `mem[ptrs[C]]`
`00010110`	`PCM`		pointer compare, push 0/1/2 if `ptrs[C2]` =/>/< `ptrs[C1]`
`00010111`	`PUX`		push xth, push xth value below stack top
`00011010`	`CON`		constant, push C, pop; NOTE: pop is needed for optims.
`00011011`	`CND`		conditional, pop 3, push `z ? y : x`
`00011100`	`SWP`		swap, pop 2, push x then y
`00011101`	`TRA`		transfer, pop, `mem[C][$] = x`
`00011110`	`POP`		pop, pop C + 1 values
`00011111`	`OUT`		output, pop and output
`00100000`	`ADX`	`AD`	add, pop 2, push `y + x`
`00100001`	`ADC`	`AD`	add constant, pop 1, push `x + C`
`00100100`	`SUX`	`SU`	subtract, pop 2, push `y - x`
`00100101`	`SUC`	`SU`	subtract constant, pop 1, push `x - C`
`00101000`	`MUX`	`MU`	multiply, pop 2, push `y * x`
`00101001`	`MUC`	`MU`	multiply constant, pop 1, push `x * C`
`00101100`	`DIX`	`DI`	divide, pop 2, push `y / x`
`00101101`	`DIC`	`DI`	divide constant, pop 1, push `x / C`
`00110000`	`DSX`	`DS`	divide signed, pop 2, push `sig(y) / sig(x)`
`00110001`	`DSC`	`DS`	divide signed constant, pop 1, push `sig(x) / sig(C)`
`00110100`	`MOX`	`MO`	modulo, pop 2, push `y % x`
`00110101`	`MOC`	`MO`	modulo constant, pop 1, push `x % C`
`00111000`	`MSX`	`MS`	modulo signed, pop 2, push `sig(y) % sig(x)`
`00111001`	`MSC`	`MS`	modulo signed constant, pop 1, push `sig(x) % sig(C)`
`01000000`	`GRX`	`GR`	greater, pop 2, push `y > x`
`01000001`	`GRC`	`GR`	greater constant, pop 1, push `x > C`
`01000100`	`GEX`	`GE`	greater or equal, pop 2, push `y >= x`
`01000101`	`GEC`	`GE`	greater or equal constant, pop 1, push `x >= C`
`01001000`	`SMX`	`SM`	smaller, pop 2, push `y < x`
`01001001`	`SMC`	`SM`	smaller constant, pop 1, push `x < C`
`01001100`	`SEX`	`SE`	smaller or equal, pop 2, push `y <= x`
`01001101`	`SEC`	`SE`	smaller or equal constant, pop 1, push `x <= C`
`01010000`	`GSX`	`GS`	greater signed, pop 2, push `sig(y) > sig(x)`
`01010001`	`GSC`	`GS`	greater signed constant, pop 1, push `sig(x) > sig(C)`
`01010100`	`BSX`	`BS`	greater or equal signed, pop 2, push `sig(y) >= sig(x)`
`01010101`	`BSC`	`BS`	greater or equal signed constant, pop 1, push `sig(x) >= sig(C)`
`01011000`	`SSX`	`SS`	smaller signed, pop 2, push `sig(y) < sig(x)`
`01011001`	`SSC`	`SS`	smaller signed constant, pop 1, push `sig(x) < sig(C)`
`01011100`	`LSX`	`LS`	smaller or equal signed, pop 2, push `sig(y) <= sig(x)`
`01011101`	`LSC`	`LS`	smaller or equal signed constant, pop 1, push `sig(x) <= sig(C)`
`01100000`	`EQX`	`EQ`	equals, pop 2, push `y == x`
`01100001`	`EQC`	`EQ`	equals constant, pop 1, push `x == C`
`01100100`	`NEX`	`NE`	not equals, pop 2, push `y != x`
`01100101`	`NEC`	`NE`	not equals constant, pop 1, push `x != C`
`01101000`	`BAX`	`BA`	binary and, pop 2, push `y & x`
`01101001`	`BAC`	`BA`	binary and constant, pop 1, push `x & C`
`01101100`	`BOX`	`BO`	binary or, pop 2, push `y
`01101101`	`BOC`	`BO`	binary or constant, pop 1, push `x
`01110000`	`BXX`	`BX`	binary xor, pop 2, push `y ^ x`
`01110001`	`BXC`	`BX`	binary xor constant, pop 1, push `x ^ C`
`01110100`	`LAX`	`LA`	logical and, pop 2, push `y && x`
`01110101`	`LAC`	`LA`	logical and constant, pop 1, push `x && C`
`01111000`	`LOX`	`LO`	logical or, pop 2, push `y
`01111001`	`LOC`	`LO`	logical or constant, pop 1, push `x
`01111100`	`LXX`	`LX`	logical xor, pop 2, push `y ^^ x`
`01111101`	`LXC`	`LX`	logical xor constant, pop 1, push `x ^^ C`
`10000010`	`BNO`	`BN`	binary not, pop 1, push `~x`
`10000100`	`SRX`	`SR`	shift right, pop 2, push y >> x
`10000101`	`SRC`	`SR`	shift right constant, pop 1, push x >> C
`10001000`	`SLX`	`SL`	shift left, pop 2, push y << x
`10001001`	`SLC`	`SL`	shift left constant, pop 1, push x << C
`11110011`	`ADR`		address, push address of stack top
`11111011`	`INU`		input unfinished, push 0 if input has finished, else 1
`11111111`	`INP`		input, read and push value

Possible C values for the DES instruction are:

1: Next instruction is an if branch.
2: Next instruction is else label.
3: This is end of an if branch.
4: Next instruction is loop start. This is used for both conditional and infinite loops; distinction between the two can be made based on whether the next instruction is jump or not.
5: Next instruction is loop break.
6: Next instruction is loop end.
7: Next instruction is start of a fun def.
8: Next instruction is exit command.
9: Next instruction is goto command.
10: Marks location of a label.
11: Marks start of a new line in source code (for debugging).
12: Marks start of a new source code file (for debugging).
13: Marks end of current source code file (for debugging).
(reserved)
32 and above: Can be used by compiler to embed any information.

Notes And Examples

Here there will be additional notes and examples to give more clarity about the bytecode.

Instruction example: let's analyze an instruction encoded as 00100001 00000001. The first byte (00100001) is opcode, the highest 4 bits of the opcode (0010) are greater or equal to value 2, therefore the instruction is a typical stack operation. Thanks to this the lowest two bits of the opcode tell us more about the instruction operation mode; these bits are are 01 which signifies a mode by which instruction pops 1 value from the stack, makes use of an immediate constant (stored right in the instruction) and finally pushes 1 value back onto the stack. From the instruction table above we see the opcode actually stands for ADC -- add constant -- the instruction's semantic is to pop a value from the stack, add a constant to it (the immediate constant stored in instruction itself) and push the result back onto the stack. Now for the second part of the instruction (00000001): the two highest bits are 0, meaning the instruction operates in type environment 0 (the default one). The 3rd highest bit is also 0, meaning the instruction is NOT prevented from popping, i.e. it may perform a pop (and it will, by the definition of the ADC instruction). The 4th highest bit is also 0, saying the immediate constant stored in the instruction will NOT continue, i.e. the constant is small enough to fit into this single instruction. The remaining lower 4 bits (0001) store the immediate constant itself, in this case 1. To sum up, this instruction will work on the stack in type environment 0, it will pop one value from it, add 1 to it and push the result back onto the stack.

Instruction example 2: now let's analyze a more complex sequence of bytes:

00010001 11010000 00000011 00010001 00000011 00011111 00000011 00000000
I0a      I0b      I1a      I1b      I2a      I2b      I3a      I3b

We may look up the sequence of opcodes here is PAC COC COC COC, i.e. the sequence starts with a pointer add constant instruction and is followed by three constant continuation instructions that will hold additional data for the PAC instruction. The PAC instruction is a special one (because its opcode's highest 4 bits, 0001, are lower than the value 2), so we can't reason anything from its opcode alone, we have to look up its definition to see what exactly it does and what operands it takes. In this case we see the instruction uses two immediate constants, denoted C1 and C2, one will say an index of a pointer and the other a constant to add to this pointer. Whenever an instruction utilizes two immediate constants, they are actually stored as one immediate constant which is to be split in the half to obtain the two constants -- this is why we have so many COC instructions following here. We see that the 4th highest bit in bytes I0b, I1b and I2b is set, which indicates an immediate constant continuation, while the same bit in byte I3b is 0, meaning the constant stops there. So to obtain the whole immediate constant we chain the immediate constants (lowest 4 bits) stored in I3b, I2b, I1b and I0b to get 0000111100010000; splitting this into two gives us the two immediate constants C2 = 00001111 and C1 = 00010000. C1 holds decimal value 16, which by the described pointer indexing rules signifies the first user defined pointer (lower indices signify special stack top offset pointers). C2 is for the PAC instruction to be interpreted as a 4 bit two's complement signed value, i.e. in this case 1111 = -1. So the whole instruction adds a value -1 to the first user pointer (i.e. shifts it one address left).

When implementing the bytecode, be careful especially about the following:

The CON instruction performs not only push of a constant, but also immediately pops it, unless its non-popping bit is set. This is to allow optimization of e.g. CON' 2; MUX to CON 2; MUC 2 -- the latter performs multiplication by immediate constant instead of a value on the stack (something potentially simpler to optimize), however the value 2 still has to be pushed onto the stack to keep the behavior consistent with the former. The fact that the CON instruction may perform a pop here allows us to just replace two instructions with other two instructions, without shifting the rest of the bytecode.
The CON instruction performs pop and push in different order from other instructions (first pushes, then pops)! Treating CON with code that handles other instructions may therefore lead to a bug.
Pay careful attention to handling immediate constants in type env. 0. While e.g. in type env. 8 pushing -1 onto the stack can just be translated to CON 255 (as 255 and -1 are the same values there), in type env. 0 integer size is unknown, so the same has to be done differently, for example as CON 0; SUC 1.
Similarly note that in type env. 0 immediate constants that are to be treated as signed are considered to be 32bit two's complement values, regardless of the target platform's actual integer size, so e.g. signed division by 4294967295 can NOT be implemented as DSC 4294967295 under assumption that integer size will be e.g. 64 bits, this instruction will still consider 32bit two's complement and so the value here means always -1.
A bytecode translator/interpreter will likely perform a preliminary analysis of the bytecode before processing it, e.g. to assess which type environments are ever used by the bytecode. If a type environment is never used, there is no need to allocate stack and pointer table for it. However be careful about type env. 0, which will likely be always used just due to the presence of INI instruction which will most likely push program arguments on the type env. 0 stack.

bytecode.md 21 KB Пермалинк Историја Датотека