HACKING 12 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291
  1. -*-mode:org-*-
  2. M2-Planet being based on the goal of bootstrapping the Minimal C compiler
  3. required to support C macros, structs, arrays, inline assembly and self hosting;
  4. is rather small, around 3Kloc according to sloccount
  5. * SETUP
  6. The most obvious way to setup for M2-Planet development is to clone --recursive
  7. and setup mescc-tools first (https://github.com/oriansj/mescc-tools.git)
  8. Then be sure to install any C compiler and make clone of your choice.
  9. * BUILD
  10. The standard C based approach to building M2-Planet is simply running:
  11. make M2-Planet
  12. Should you wish to verify that M2-Planet was built correctly run:
  13. make test
  14. * ROADMAP
  15. M2-Planet V1.0 is the bedrock of all future M2-Planet versions. Any future
  16. release that will depend upon a more advanced version to be compiled, will
  17. require the version prior to it to be named. V2.0 and the same properties apply
  18. To all future release of M2-Planet. All minor releases are buildable by the last
  19. major release and All major releases are buildable by the last major release.
  20. * DEBUG
  21. To get a properly debuggable binary of M2-Planet: make M2-Planet
  22. M2-Planet also can create debuggable binaries with the help of blood-elf and the
  23. --debug option. if you are comfortable with gdb, knowing that function names are
  24. prefixed with FUNCTION_ M2-Planet built binaries are quite debuggable.
  25. * Bugs
  26. M2-Planet assumes a very heavily restricted subset of the C language and many C
  27. programs will break hard when passed to M2-Planet.
  28. M2-Planet does not actually implement any primitive functionality, it is assumed
  29. that will be written in inline assembly by the programmer or leveraged via M2libc
  30. which is the C library written in the M2-Planet C subset.
  31. * Magic
  32. ** argument and local stack
  33. In M2-Planet the stack is first the EDI pointer which is preserved as should an
  34. argument be a function which returns a value, it may be overwritten and cause
  35. issues, this is followed by the previous frame's base pointer (EBP) as it will
  36. need to be restored upon return from the function call. This is then followed by
  37. the arguments which are pushed onto the stack from the left to the right,
  38. followed by the RETURN Pointer generated from the function call, after which the
  39. locals are placed upon the stack first to last followed by any Temporary values:
  40. +----------------------+
  41. EDI -> | Previous EDI pointer |
  42. +----------------------+
  43. EBP -> | Previous EBP pointer |
  44. +----------------------+
  45. 1st -> | Argument 1 |
  46. +----------------------+
  47. 2nd -> | Argument 2 |
  48. +----------------------+
  49. ... -> ........................
  50. +----------------------+
  51. Nth -> | Argument N |
  52. +----------------------+
  53. RET -> | RETURN Pointer |
  54. +----------------------+
  55. 1st -> | Local 1 |
  56. +----------------------+
  57. 2nd -> | Local 2 |
  58. +----------------------+
  59. ... -> ........................
  60. +----------------------+
  61. Nth -> | Local N |
  62. +----------------------+
  63. temps-> .......................
  64. ** AArch64 port notes
  65. Some details about design, implementation and generated code; maybe of
  66. interest for new targets, to M1 users, compiler hackers and curious
  67. minds in general.
  68. *** Target ISA related issues
  69. In the ARMv8 AArch64 A64 instruction set that we target, immediate
  70. values into instructions are not aligned to 4 bits, which is the size
  71. of the convenient single hexadecimal digit (that served well so far,
  72. for other ports). Other groups of bits are affected. For example,
  73. those to encode registers are usually 5 bits long and horror stories
  74. about non-contiguous chunks (due to endianess interactions with M1, a
  75. big bit endian language) are told, so not even using octal nor binary
  76. encodings solve our problem.
  77. Because of that, we have less flexible and reusable definitions than
  78. usual (see aarch64_defs.M1). Also, we resort to unconventional (for
  79. M2-Planet standards) workarounds and generate worse code. Anyway,
  80. neither size nor speed are high priorities and there's room for
  81. improvement.
  82. On the bright side, affected codepaths/definitions and working tactics
  83. are better known now, being this the first target of M2-Planet with
  84. such features. That might be helpful in future ports (RISC-V comes to
  85. mind, which has weird structure too... designed "so that as many bits
  86. as possible are in the same position in every instruction" but not for
  87. basic tools).
  88. Some notable workarounds are:
  89. - Create one independent definition per _needed_ operation, instead of
  90. reusing common parts like we do for other archs. The resulting set is
  91. quite small even following this simple rule consistently. See how
  92. the SKIP_INST_* family seems nicely aligned for more fine-grained
  93. hex but we don't exploit that; or the PUSH/POP ones that also kind
  94. of do, but watch out for the general case if you plan to create your
  95. own set of general purpose definitions.
  96. One interesting example reflects that creating new definitions is
  97. avoided unless readability suffers: the pair LOAD_W2_AHEAD,
  98. LSHIFT_X0_X0_X2 exists because our two main registers are in use in
  99. postfix_expr_array() and the common shift is inconvenient in this
  100. particular case. It's possible to reuse definitions (preliminary
  101. patches did this) using multiplication and addition (quite natural by
  102. the way, even if suboptimal); or dancing with the stack to fit
  103. everything into place (harder to reason about). It felt too alien in
  104. the codebase so a couple of definitions were added.
  105. - Use the register-based instructions instead of those using
  106. immediates. This forces us to generate more code in order to put the
  107. data in the register. Data is mixed with the code (not even in a
  108. fancy pool) to be loaded from and then skipped at run-time. See some
  109. of the multiple instances of the LOAD_W0_AHEAD then SKIP_32_DATA
  110. pattern.
  111. - For control flow structures, the problem about immediates bits us
  112. again (hits, bites, bytes; sorry, can't resist) for conditional
  113. PC-relative branching. The jump is arbitrary, because any amount of
  114. code can be present in any given block to be skipped. AArch64
  115. PC-relative conditional branch instructions [that I found, newbie on
  116. board!] are based on immediate values, and we have to avoid
  117. arbitrary immediate values as usual.
  118. There's an *unconditional* absolute branch instruction that accepts
  119. the target addr from a register (which we can set at will using the
  120. "load_ahead+skip" pattern). So, we construct an unconditional
  121. over-the-block jump and skip this jump with the conditional one
  122. ("inverted", more about this in a moment). The point is that now we
  123. know exactly the distance to jump: it's the size of that
  124. construction. We can define a couple of conditional branch
  125. instructions because the immediate is not arbitrary anymore, nice!
  126. Maybe this pseudo-code explains it better:
  127. if(cond) block_foo; else block_bar;
  128. more;
  129. ... is compiled to:
  130. if cond then skip past the unconditional-branch // To get to foo-code.
  131. // We know the space used by this code...
  132. set register to addr of else-label
  133. // ... and this one, that completes the jump to the alternative block.
  134. unconditional-branch to addr in register
  135. foo-code
  136. [Here we jump to the endif-label, omitted for clarity.]
  137. else-label:
  138. bar-code
  139. endif-label:
  140. more-code
  141. Similar approach is used for other control flow structures. See
  142. CBZ_X0_PAST_BR (cbz x0, #20) and CBNZ_X0_PAST_BR (cbnz x0, #20) used
  143. as part of the generation of 'if', 'for', 'do' and 'while'
  144. statements. Notice how the test is inverted: when Knight does JUMP.Z
  145. we do CBNZ (process_if); when JUMP.NZ we CBZ (process_do).
  146. CSEL was considered but required an additional register, more labels
  147. and code. A bit too invasive a change to make to the codebase.
  148. As you can imagine, the ISA colored the port development from the very
  149. beginning. It's a lot of fun to come up with basic solutions under
  150. those limitations. The port works as expected but there's room for
  151. experimentation.
  152. *** Function call
  153. The Base Pointer and its relation to arguments in function calls and
  154. locals during function execution is a bit different compared to other
  155. supported architectures. This simplifies some calculations. See how
  156. unsurprising the depths are in collect_arguments() and
  157. collect_local().
  158. Note how this calculations are related to the "push/pop size". See
  159. `Wasted stack space`.
  160. Let's follow a couple of M2-Planet functions generating code for
  161. prologue, call and epilogue with the help of some artsy-less ascii-art
  162. stack graphs for clarity. The expected stack is "full" (the stack
  163. pointer register contains the address of the last pushed element) and
  164. descending (grows towards zero).
  165. Most of the work is done by function_call(). First, we save (the
  166. generated code does it at runtime of the compiled program, but please
  167. bear with me about the point of view) three registers on the stack. We
  168. include a scratch one ("tmp" value in the graphs) that we're going to
  169. use for two different purposes. On the one hand, to store the actual
  170. stack pointer (which is going to be the reference address --Base
  171. Pointer-- during the execution of the called function). On the other
  172. hand, when the BP is already set (which can't be done right now
  173. because we need the actual BP to evaluate the arguments in caller
  174. context) we use the register to store the addr of the function to be
  175. called. The other two registers are the Link Register (X30) and Base
  176. Pointer (X17 also know as IP1) itself, to allow for recursion. Both
  177. are prefixed with "o" in the following graphs, as in "old".
  178. This structure gives us a simple reference for both the args and the
  179. locals, without extra elements between those two sets. We rely on the
  180. semantics of BLR (more on this in a bit) which doesn't use the stack
  181. to save the return address, but a register. For other archs this is
  182. not possible (or not exploited, see how for ARM-7 the LR is saved in
  183. the stack just around the call proper; this puts it between the args
  184. and the locals) so it's a difference worth documenting.
  185. ---> Address 0
  186. tmp | oLR | oBP |
  187. ^
  188. |
  189. --- SP
  190. |
  191. --- BP-to-be
  192. Now we're ready to evaluate and push arguments. Note that M2-Planet
  193. doesn't follow AAPCS64. The evaluation might involve function calls
  194. itself and arbitrary use of the stack, but everything will be like
  195. this after all.
  196. tmp | oLR | oBP | arg1 | arg2 | ... | argN |
  197. ^ ^
  198. | |
  199. --- BP-to-be --- SP (omitted from now on)
  200. At this point we set the BP from the scratch register and execute
  201. branch-and-link (BLR) to the function reusing the (now free) X16
  202. register (also know as IP0). This instruction saves the address of the
  203. next instruction on X30 (LR, which we saved earlier to allow for
  204. recursion).
  205. tmp | oLR | oBP | arg1 | arg2 | ... | argN |
  206. ^
  207. |
  208. --- BP
  209. During the called function the locals are pushed on the stack as usual
  210. in M2-Planet.
  211. tmp | oLR | oBP | arg1 | arg2 | ... | argN | loc1 | loc2 | ... | locN |
  212. ^
  213. |
  214. --- BP
  215. When the function is about to return, we remove the locals from the
  216. stack and execute the return proper, jumping to the address in LR
  217. thanks to RET. This is handled by return_result().
  218. tmp | oLR | oBP | arg1 | arg2 | ... | argN |
  219. ^
  220. |
  221. --- BP
  222. Back in function_call() we remove the args from the stack.
  223. tmp | oLR | oBP |
  224. ^
  225. |
  226. --- BP
  227. Finally, we restore the saved registers (so X16, LR and BP contain
  228. tmp, oLR and oBP again) leaving everything as it was before this
  229. journey. Well... one important thing changed: following M2-Planet
  230. conventions the value returned from the function, if any, is on X0.
  231. *** Stack pointer
  232. Due to alignment (128 bits) restriction for "push" and "pop" based on
  233. the architectural register, we initialize and use X18 as stack pointer
  234. instead.
  235. The M1 definitions referring to SP use X18; stack operations too.
  236. For example:
  237. DEFINE LDR_X0_[SP] 400240f9 is ldr x0, [x18]
  238. DEFINE PUSH_LR 5e8e1ff8 is str x30, [x18, #-8]!
  239. DEFINE INIT_SP f2030091 is mov x18, sp