HACKING 12 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289
  1. -*-mode:org-*-
  2. M2-Planet being based on the goal of bootstrapping the Minimal C compiler
  3. required to support structs, arrays, inline assembly and self hosting;
  4. is rather small, under 1.7Kloc according to sloccount
  5. * SETUP
  6. The most obvious way to setup for M2-Planet development is to clone and setup mescc-tools first (https://github.com/oriansj/mescc-tools.git)
  7. Then be sure to install any C compiler and make clone of your choice.
  8. * BUILD
  9. The standard C based approach to building M2-Planet is simply running:
  10. make M2-Planet
  11. Should you wish to verify that M2-Planet was built correctly run:
  12. make test
  13. * ROADMAP
  14. M2-Planet V1.0 is the bedrock of all future M2-Planet versions. Any future
  15. release that will depend upon a more advanced version to be compiled, will
  16. require the version prior to it to be named. V2.0 and the same properties apply
  17. To all future release of M2-Planet. All minor releases are buildable by the last
  18. major release and All major releases are buildable by the last major release.
  19. * DEBUG
  20. To get a properly debuggable binary: make M2-Planet-gcc
  21. However if you are comfortable with gdb, knowing that function names are
  22. prefixed with FUNCTION_ the M2-Planet binary is quite debuggable.
  23. * Bugs
  24. M2-Planet assumes a very heavily restricted subset of the C language and many C
  25. programs will break hard when passed to M2-Planet.
  26. M2-Planet does not actually implement any primitive functionality, it is assumed
  27. that will be written in inline assembly by the programmer or provided by the
  28. programmer durring the assembly and linking stages
  29. * Magic
  30. ** argument and local stack
  31. In M2-Planet the stack is first the EDI pointer which is preserved as should an
  32. argument be a function which returns a value, it may be overwritten and cause
  33. issues, this is followed by the previous frame's base pointer (EBP) as it will
  34. need to be restored upon return from the function call. This is then followed by
  35. the arguments which are pushed onto the stack from the left to the right,
  36. followed by the RETURN Pointer generated from the function call, after which the
  37. locals are placed upon the stack first to last followed by any Temporary values:
  38. +----------------------+
  39. EDI -> | Previous EDI pointer |
  40. +----------------------+
  41. EBP -> | Previous EBP pointer |
  42. +----------------------+
  43. 1st -> | Argument 1 |
  44. +----------------------+
  45. 2nd -> | Argument 2 |
  46. +----------------------+
  47. ... -> ........................
  48. +----------------------+
  49. Nth -> | Argument N |
  50. +----------------------+
  51. RET -> | RETURN Pointer |
  52. +----------------------+
  53. 1st -> | Local 1 |
  54. +----------------------+
  55. 2nd -> | Local 2 |
  56. +----------------------+
  57. ... -> ........................
  58. +----------------------+
  59. Nth -> | Local N |
  60. +----------------------+
  61. temps-> .......................
  62. ** AArch64 port notes
  63. Some details about design, implementation and generated code; maybe of
  64. interest for new targets, to M1 users, compiler hackers and curious
  65. minds in general.
  66. *** Target ISA related issues
  67. In the ARMv8 AArch64 A64 instruction set that we target, immediate
  68. values into instructions are not aligned to 4 bits, which is the size
  69. of the convenient single hexadecimal digit (that served well so far,
  70. for other ports). Other groups of bits are affected. For example,
  71. those to encode registers are usually 5 bits long and horror stories
  72. about non-contiguous chunks (due to endianess interactions with M1, a
  73. big bit endian language) are told, so not even using octal nor binary
  74. encodings solve our problem.
  75. Because of that, we have less flexible and reusable definitions than
  76. usual (see aarch64_defs.M1). Also, we resort to unconventional (for
  77. M2-Planet standards) workarounds and generate worse code. Anyway,
  78. neither size nor speed are high priorities and there's room for
  79. improvement.
  80. On the bright side, affected codepaths/definitions and working tactics
  81. are better known now, being this the first target of M2-Planet with
  82. such features. That might be helpful in future ports (RISC-V comes to
  83. mind, which has weird structure too... designed "so that as many bits
  84. as possible are in the same position in every instruction" but not for
  85. basic tools).
  86. Some notable workarounds are:
  87. - Create one independent definition per _needed_ operation, instead of
  88. reusing common parts like we do for other archs. The resulting set is
  89. quite small even following this simple rule consistently. See how
  90. the SKIP_INST_* family seems nicely aligned for more fine-grained
  91. hex but we don't exploit that; or the PUSH/POP ones that also kind
  92. of do, but watch out for the general case if you plan to create your
  93. own set of general purpose definitions.
  94. One interesting example reflects that creating new definitions is
  95. avoided unless readability suffers: the pair LOAD_W2_AHEAD,
  96. LSHIFT_X0_X0_X2 exists because our two main registers are in use in
  97. postfix_expr_array() and the common shift is inconvenient in this
  98. particular case. It's possible to reuse definitions (preliminary
  99. patches did this) using multiplication and addition (quite natural by
  100. the way, even if suboptimal); or dancing with the stack to fit
  101. everything into place (harder to reason about). It felt too alien in
  102. the codebase so a couple of definitions were added.
  103. - Use the register-based instructions instead of those using
  104. immediates. This forces us to generate more code in order to put the
  105. data in the register. Data is mixed with the code (not even in a
  106. fancy pool) to be loaded from and then skipped at run-time. See some
  107. of the multiple instances of the LOAD_W0_AHEAD then SKIP_32_DATA
  108. pattern.
  109. - For control flow structures, the problem about immediates bits us
  110. again (hits, bites, bytes; sorry, can't resist) for conditional
  111. PC-relative branching. The jump is arbitrary, because any amount of
  112. code can be present in any given block to be skipped. AArch64
  113. PC-relative conditional branch instructions [that I found, newbie on
  114. board!] are based on immediate values, and we have to avoid
  115. arbitrary immediate values as usual.
  116. There's an *unconditional* absolute branch instruction that accepts
  117. the target addr from a register (which we can set at will using the
  118. "load_ahead+skip" pattern). So, we construct an unconditional
  119. over-the-block jump and skip this jump with the conditional one
  120. ("inverted", more about this in a moment). The point is that now we
  121. know exactly the distance to jump: it's the size of that
  122. construction. We can define a couple of conditional branch
  123. instructions because the immediate is not arbitrary anymore, nice!
  124. Maybe this pseudo-code explains it better:
  125. if(cond) block_foo; else block_bar;
  126. more;
  127. ... is compiled to:
  128. if cond then skip past the unconditional-branch // To get to foo-code.
  129. // We know the space used by this code...
  130. set register to addr of else-label
  131. // ... and this one, that completes the jump to the alternative block.
  132. unconditional-branch to addr in register
  133. foo-code
  134. [Here we jump to the endif-label, omitted for clarity.]
  135. else-label:
  136. bar-code
  137. endif-label:
  138. more-code
  139. Similar approach is used for other control flow structures. See
  140. CBZ_X0_PAST_BR (cbz x0, #20) and CBNZ_X0_PAST_BR (cbnz x0, #20) used
  141. as part of the generation of 'if', 'for', 'do' and 'while'
  142. statements. Notice how the test is inverted: when Knight does JUMP.Z
  143. we do CBNZ (process_if); when JUMP.NZ we CBZ (process_do).
  144. CSEL was considered but required an additional register, more labels
  145. and code. A bit too invasive a change to make to the codebase.
  146. As you can imagine, the ISA colored the port development from the very
  147. beginning. It's a lot of fun to come up with basic solutions under
  148. those limitations. The port works as expected but there's room for
  149. experimentation.
  150. *** Function call
  151. The Base Pointer and its relation to arguments in function calls and
  152. locals during function execution is a bit different compared to other
  153. supported architectures. This simplifies some calculations. See how
  154. unsurprising the depths are in collect_arguments() and
  155. collect_local().
  156. Note how this calculations are related to the "push/pop size". See
  157. `Wasted stack space`.
  158. Let's follow a couple of M2-Planet functions generating code for
  159. prologue, call and epilogue with the help of some artsy-less ascii-art
  160. stack graphs for clarity. The expected stack is "full" (the stack
  161. pointer register contains the address of the last pushed element) and
  162. descending (grows towards zero).
  163. Most of the work is done by function_call(). First, we save (the
  164. generated code does it at runtime of the compiled program, but please
  165. bear with me about the point of view) three registers on the stack. We
  166. include a scratch one ("tmp" value in the graphs) that we're going to
  167. use for two different purposes. On the one hand, to store the actual
  168. stack pointer (which is going to be the reference address --Base
  169. Pointer-- during the execution of the called function). On the other
  170. hand, when the BP is already set (which can't be done right now
  171. because we need the actual BP to evaluate the arguments in caller
  172. context) we use the register to store the addr of the function to be
  173. called. The other two registers are the Link Register (X30) and Base
  174. Pointer (X17 also know as IP1) itself, to allow for recursion. Both
  175. are prefixed with "o" in the following graphs, as in "old".
  176. This structure gives us a simple reference for both the args and the
  177. locals, without extra elements between those two sets. We rely on the
  178. semantics of BLR (more on this in a bit) which doesn't use the stack
  179. to save the return address, but a register. For other archs this is
  180. not possible (or not exploited, see how for ARM-7 the LR is saved in
  181. the stack just around the call proper; this puts it between the args
  182. and the locals) so it's a difference worth documenting.
  183. ---> Address 0
  184. tmp | oLR | oBP |
  185. ^
  186. |
  187. --- SP
  188. |
  189. --- BP-to-be
  190. Now we're ready to evaluate and push arguments. Note that M2-Planet
  191. doesn't follow AAPCS64. The evaluation might involve function calls
  192. itself and arbitrary use of the stack, but everything will be like
  193. this after all.
  194. tmp | oLR | oBP | arg1 | arg2 | ... | argN |
  195. ^ ^
  196. | |
  197. --- BP-to-be --- SP (omitted from now on)
  198. At this point we set the BP from the scratch register and execute
  199. branch-and-link (BLR) to the function reusing the (now free) X16
  200. register (also know as IP0). This instruction saves the address of the
  201. next instruction on X30 (LR, which we saved earlier to allow for
  202. recursion).
  203. tmp | oLR | oBP | arg1 | arg2 | ... | argN |
  204. ^
  205. |
  206. --- BP
  207. During the called function the locals are pushed on the stack as usual
  208. in M2-Planet.
  209. tmp | oLR | oBP | arg1 | arg2 | ... | argN | loc1 | loc2 | ... | locN |
  210. ^
  211. |
  212. --- BP
  213. When the function is about to return, we remove the locals from the
  214. stack and execute the return proper, jumping to the address in LR
  215. thanks to RET. This is handled by return_result().
  216. tmp | oLR | oBP | arg1 | arg2 | ... | argN |
  217. ^
  218. |
  219. --- BP
  220. Back in function_call() we remove the args from the stack.
  221. tmp | oLR | oBP |
  222. ^
  223. |
  224. --- BP
  225. Finally, we restore the saved registers (so X16, LR and BP contain
  226. tmp, oLR and oBP again) leaving everything as it was before this
  227. journey. Well... one important thing changed: following M2-Planet
  228. conventions the value returned from the function, if any, is on X0.
  229. *** Stack pointer
  230. Due to alignment (128 bits) restriction for "push" and "pop" based on
  231. the architectural register, we initialize and use X18 as stack pointer
  232. instead.
  233. The M1 definitions referring to SP use X18; stack operations too.
  234. For example:
  235. DEFINE LDR_X0_[SP] 400240f9 is ldr x0, [x18]
  236. DEFINE PUSH_LR 5e8e1ff8 is str x30, [x18, #-8]!
  237. DEFINE INIT_SP f2030091 is mov x18, sp