Java 7 subset syntax directed Dalvik bytecode translator.

Gabriel Czernikier b7c2e82e2f java7.c 2 years ago
grammar_makeup 3cd4db794d Code rearrangement. 3 years ago
libsym_makeup 3cd4db794d Code rearrangement. 3 years ago
src b7c2e82e2f java7.c 2 years ago
.gitignore 062c898602 Writting README.md and .gitignore 3 years ago
LICENSE 4c79995f3e Migrates program name, license and copyright years. 4 years ago
Makefile 28b41dcb10 Makefile, README.md 2 years ago
README.md 28b41dcb10 Makefile, README.md 2 years ago
build_wtname.sh 6f61a8a521 move files 2 years ago

README.md

Côtehaus

Côtehaus is a house in the coast. In the land of hopes, Côtehaus is to become a Java 7 subset syntax directed Dalvik bytecode translator. Also known as a compiler.

License

It is released dually under The MIT License and The GNU General Public License version 3, with the exception of file src/main/types.h which includes copied code governed by a separate license as pointed out from that particular file. For the general case, see LICENSE file.

Developer notes

Toolchain installation

If you have a Debian system, a Makefile is in Côtehaus to automate the install of tools. So you run make debian_install_zlib and make debian_install_polarssl.

If you are wandering the Cygwin ramblings, you have to install yourself the packages gcc-core, libgcc1, mingw64-x86_64-gcc-core (that will provide the command mingw64-x86_64-gcc), mingw64-i686-gcc-core (that will provide the command mingw64-i686-gcc), libiconv-devel, bison and make. Check the installed packages by running cygcheck iconv, cygcheck gcc, gcc -v and x86_64-w64-mingw32-gcc.exe -v. If your choice is neither Debian nor Cygwin, theese instructions may turn out a bit lacking, you're unwound at luck and creativity.

Whichever way you go, get done the job of installing zlib and polarssl. Read the Makefile if you need to know more about them.

Functional definition

Côtehaus may be seen as divided in phases: input, processing and output.

Being at input phase means to take an input file and while seeing it make records of things useful to recognize about that input file. The input file is a Java program, as in this example:

class A {
	int add(int a, int b) {
		return a+b;
	}
}

The prototype Iadd(II) present in the input file (derived from int add(int a, int b)) is to Côtehaus important to record during the input phase.

After the input phase finishes, Côtehaus has a lot of records derived from things seen in the input files, like Iadd(II) and many others.

Then the processing phase begins, and the goal of this phase is to set the records in a defined order. Let's take this example: add(II) and Isubstract(II). They are 2 prototype records that might be produced but in an unexpected order, let's say Isubstract(II) is produced first and Iadd(II) is later. Côtehaus knows that the expected order is right the invert of the produced order. This time we let ourselves trust Côtehaus to determine the expected order, because from a point of view, such a job is what a compiler is to make for us. During the processing phase Iadd(II) is set first and Isubstract(II) later.

All the records are persisted to an output file at the output phase. Sending messages, respecting conventions, using bitpatterns, informing how much size a prototype's array occupies, are among things that describe the output phase and have the output file done.

Technical concerns

Relating source code to phases and items

The way how the code is written, there are naming customs used for each one of the mentioned phases. For example add_... for the input phase, build_o... for the processing phase and pack_... for the output phase.

The name of the item belongs on the Dalvik's format specification, string_id_item and proto_id_item are examples of items defined in that specification. These names might be further shortened in Côtehaus, as in str_id and proto_id, that should be easy to guess which is which one.

Figure out which item is handled on which phase by which piece of source code is explained next. Take a name stub from the shown above, like add_... for the input phase, then another name stub like proto_id. Put them together, and if you see a function named add_proto_id somewhere arround the source code, you'll know that function handle the input of a proto_id_item, of course of the input phase.

Exercise

From the table below.

Phase item
add_... str_id
build_o... proto_id
pack... class_def

Tell what should be the functions that:

  • Handle a string_id_item during the output phase.
  • Handle a proto_id_item during the processing phase.

Answer

  • Function pack_str_id handles a string_id_item during the output phase.
  • Function build_oproto_id handles a proto_id_item during the processing phase.

Input phase

To follow the example of proto_id used above, a proto_id_item may be seen as defined in theese terms: tok = elements of (ShortyDescriptor_tok, TypeDescriptor_tok, TypeDescriptor_tok, TypeDescriptor_tok). Said so, a proto_id is defined in terms of all tokens (tok). Each token is then recorded as a string with a call to add_str_id. The records placed by add_str_id are chained in a list without any expected order. As many times add_str_id is called, a countet increments to carry the quantity of records placed into that chained list. Exercise: look at the code of add_str_id and try to tell the name of the list and the name of the counter.

At the input phase, the file java7.y encodes information about what to do with a Java program input file. It leverages a subset of the Java language at its 7th version. You need a tool called Bison that puts java7.y into terms that Côtehaus doesn't understand but computer machines you use do, a fact just fulfilling to Côtehaus. So, java7.y encodes how to distinguish the several parts of a Java program. If speaking about a prototype as one of that many parts, java7.y associates a prototype with calls to add_proto_id, which in turn makes calls to add_str_id, putting the records in place, into either chained lists, one for prototypes, the other for strings.

Let's see an add_proto_id function depict:

struct proto_id_st *add_proto_id(/*things relative to a proto_id*/) {
  struct proto_id_st *proto_id = /*mallocs to a proto_id*/;
  
  /*set proto_id relative things, making malloc or using a function parameter directly:
  
  proto_id->a_member = assign a function parameter as is;
  
  proto_id->another_member = malloc based on another function parameter (apply some logic);
  
  etc.
  */
  
  /* Chain the new proto_id object i to a list with all of the rest of  proto_id's.
  "_list" is by custom part of the chain name. This is so to proto_id as to any item type.
  "_head" is by custom part of the name of the chain head for each item type
  */
  list_add(&proto_id->proto_id_list, &proto_id_head);
  
  // count the items in the chain
  proto_id_list_size++;
  
  return proto_id;
}

Processing phase

At the processing phase the ordering (sort) of records happen, taking the records somewhat closer to how Dalvik's format expects, that is, the expected order.

There are upstream functions doing the job, like build_oproto_id. Functions like that lean on downstream functions that make the bottom half of the job, as is the case of oproto_id_st_compar. Arrays gain a place in processing phase, while the records they hold are those that chained lists used to in the input phase. Arrays are to work in compass with ..._compar functions, together they accomplish the goal of sorting the records.

ostr_id_ary is an ordered array of strings. o is for ordered, _ary is for array, just name customs. str_id is the part of the name shortened from Dalvik's format spec. _idx is an atribute where an ascending numerical value is assigned to each record, considering that if more than one str_id has as its string target one with the same string value as another one, only one is taken to represent all of them (an equivalence class representative) and only such one is assigned an _idx. Because this _idx is assigned over the sorted array, then this _idx conforms to what Dalvik's format expects about array indexes.

Let's go ahead from an example based on a same one used before.

class A {
	int add(int a, int b) {
		return a+b;
	}
	
	int substract(int a, int b) {
	    return a-b;
	}
}

The break down of a prototype can be reasoned as a prototype has a shorty description, a return type, and an array of parameter types. For the case of int add(int a, int b)-derived prototype, let's allow the conceeded wisdom that its shorty description is I(II). For int substract(int a, int b) it is the same at all, I(II), because it has the same return type and the same parameter number and types.

The break down of a prototype to the deepest level is just the break down of each of its parts. So, a return type is indeed a type, an array of parameter types is an array of which each element is a type, and a type points at a string_id which in turn points at a string data to hold the string representing that type's descriptor. By the way, this way of reasoning can be refered to as recursive. In the chained list of string data, we have a boilerplate of I's records produced at the input phase.

After records are ordered all I are set contiguous. Then the boilerplate is eliminated by taking just one I from the string data array as the representative for an equivalence class for all the I's. This process happens at the processing phase and transits the whole break down. An equivalence class representative is choosen from the string ids, then from the types, and widening even beyond the extent, I(II) is the same for add and substract so from the array of shorty descriptions, just one is choosen as representative to be emitted later at output phase. Instead of each record pointing at its own copy, all point at the position (or index in Dalvik's speak and in truth in common speak), of the only one copy that represents all of them.

Indices or Offsets

In a spatial referential system as the .dex file is, items have an inherent address, a numerical value of the item's distance to a zero point or the beginin of the referential system (the whole message or the output file's beginin). If an array has a fixed length to all of its items, then each item's address can be encoded in terms of the item position or index, assigned to in an item's attribute which name contains _idx in Côtehaus's source code. If the estipulation (expectation) about an array is to have variable length items, then Côtehaus tracks the address distance each item adds to the next one, called offset or relative_offset in the source code. Probably in accordance with this reasoning, the Dalvik format requests the array of string ids are index based, whilst the array of list type arrays are offset based, just to take 2 random cases.

Items (records) in otype_array (the array of list type arrays) have each of them a fixed length and an uniformly variable length (varies its length for it has more or less items but uniformly for each item is fixed sized). relative_offset is used as for this case as by name custom for all offset based arrays, and each item is assigned a value for its relative_offset at processing phase counting the offset and the overall length (fixed+variable) of the previous item over a sorted, equivalence-class-represented array, so that equivalent list type array items are neither counted nor emitted at output phase.

Output phase

The program at the output phase carries with emitting an external representation in compliance to Dalvik's format. Right here, the the fitting of the datas in a spatial referential system overwhelms the data's meaning itself. after that fact, library calls are used that invert the relation format-meaning, turning the format into the meaning. This is just like you swap forefront and background on how you see a picture. To name one, a library call for such a purpose is on the function htole32, standing for host representation to little endian of 32 bits. Library calls like that are made from the pack_... functions.

Alignment

Alignmemt is primarily performed at output phase, though distance between arrays to be aligned is based on numbers computed as records are being blown, during previous phases. up_bounds_ary has something to do with that along processing and output phases.

Sending messages to the output, whilst an aligment unaware version would direct to emit the bits for the next array start, with alignment on board some bits are emitted without meaning but pushing the next array to a position that satisfies a condition about its distance from the start of the message (the beginin of the output file). That condition is requested by the Dalvik's format spec. Those meaningless bits are called padding.

The padding length has this formula:

padding_length=(next_item_alignment-last_up_bound_address%next_item_alignment)%next_item_alignment
  • next_item_alignment is a constant valued attribute, that changes only if the item changes (items as a prototype and a type list may have different alignment attributes so the value changes).
  • last_up_bound_address is the uppest address of the last item. There is a relation between item arrays over which an array address is pushed up by the arrays that occupy an upper space, and this array pushes up the addresses of down-placed arrays. Côtehaus tracks last_up_bound_address keeping the relation between item arrays.
  • last_up_bound_address%next_item_alignment tells the upper items' highest address excess respecting as how much it was to fall exactly aligned to next item alignment.
  • (next_item_alignment-last_up_bound_address%next_item_alignment)%next_item_alignment, the substraction takes the excess' complement, it gives how much length to make the next padding, except if there was no excess, in which case last reminder operation (that outside the parenthesis) takes on fixing up that.

Eliminating redundancies

When a set of equivalence-class items is equivalence-class-represented by any one of them, not only Côtehaus emitts just that one, but also all the references, either by index or by offset, are fixed at that one, because the choosen representant can be fully taken as the new copy to refer to, from everywhere. After that kind of factorization (boilerplate elimination) the break down of prototype id is: one copy of the many I(II) (int add(int a, int b), int substract(int a, int b) et. al.) exists and points at (refer to) a return type (indeed a type) by index. Also it points at an array of types by offset, each of which points at a type, the same copy that that one return type. The return type points at a copy of string id by index that in turn points at a copy of string data for the type descriptor I, and no other copies exists in the output file as for string data as string id representing the I type.

Still early stage / blurly looked

Vaguely speaking, pairs of source/bytecode that Côtehaus should be able to transform, from source to bytecode, are like the listed next. The bytecode listed next might be .apk wrapped, so the reader is meant to take into account only the .dex wrapped inside. Especially, ignoring the .apk PGP signature which is not in scope (don't get it wrong with the .dex header's MD5 hash, this one yes, is in scope).