Java 7 subset syntax directed Dalvik bytecode translator.

g.cze b03fbe10c4 README.md 2 days ago
grammar_makeup 3cd4db794d Code rearrangement. 2 months ago
libsym_makeup 3cd4db794d Code rearrangement. 2 months ago
src 44c1b18190 Write notes (comments) about the code. 1 week ago
test 3e29be54ea Adding a test. 2 months ago
.gitignore 062c898602 Writting README.md and .gitignore 1 month ago
LICENSE 4c79995f3e Migrates program name, license and copyright years. 1 year ago
Makefile 27ef06d4b2 Reformulating Makefile 1 month ago
README.md b03fbe10c4 README.md 2 days ago

README.md

Côtehaus

Côtehaus is a house in the coast. In the land of hopes, Côtehaus is to become a Java 7 subset syntax directed Dalvik bytecode translator.

License

It is released dually under The MIT License and The GNU General Public License version 3, with the exception of file src/types.h which includes copied code governed by a separate license. For the general case, see LICENSE file.

Developer notes

Toolchain installation

Under Debian like systems tasks are yielded to install the necessary packages. Under Cygwin you need the packages gcc-core (to get a regular Cygwin program), libgcc1 (the support library for gcc-core), mingw64-x86_64-gcc-core (to get a Win64 program through the executable mingw64-x86_64-gcc), mingw64-i686-gcc-core (to get a Win32 program through the hommonymous program), libiconv-devel (the Unicode Iconv library), bison, make. These are the banners I got running test commands:

cygcheck iconv

Found: C:\cygwin64\bin\iconv.exe
C:\cygwin64\bin\iconv.exe
  C:\cygwin64\bin\cygiconv-2.dll
    C:\cygwin64\bin\cygwin1.dll
      C:\Windows\system32\KERNEL32.dll
        C:\Windows\system32\ntdll.dll
        C:\Windows\system32\KERNELBASE.dll
  C:\cygwin64\bin\cygintl-8.dll

cygcheck gcc

Found: C:\cygwin64\bin\gcc.exe
C:\cygwin64\bin\gcc.exe
  C:\cygwin64\bin\cygwin1.dll
    C:\Windows\system32\KERNEL32.dll
      C:\Windows\system32\ntdll.dll
      C:\Windows\system32\KERNELBASE.dll
  C:\cygwin64\bin\cygiconv-2.dll
  C:\cygwin64\bin\cygintl-8.dll

gcc -v

Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-pc-cygwin/10/lto-wrapper.exe
Target: x86_64-pc-cygwin
Configured with: /mnt/share/cygpkgs/gcc/gcc.x86_64/src/gcc-10.2.0/configure --srcdir=/mnt/share/cygpkgs/gcc/gcc.x86_64/src/gcc-10.2.0 --prefix=/usr --exec-prefix=/usr --localstatedir=/var --sysconfdir=/etc --docdir=/usr/share/doc/gcc --htmldir=/usr/share/doc/gcc/html -C --build=x86_64-pc-cygwin --host=x86_64-pc-cygwin --target=x86_64-pc-cygwin --without-libiconv-prefix --without-libintl-prefix --libexecdir=/usr/lib --with-gcc-major-version-only --enable-shared --enable-shared-libgcc --enable-static --enable-version-specific-runtime-libs --enable-bootstrap --enable-__cxa_atexit --with-dwarf2 --with-tune=generic --enable-languages=c,c++,fortran,lto,objc,obj-c++ --enable-graphite --enable-threads=posix --enable-libatomic --enable-libgomp --enable-libquadmath --enable-libquadmath-support --disable-libssp --enable-libada --disable-symvers --with-gnu-ld --with-gnu-as --with-cloog-include=/usr/include/cloog-isl --without-libiconv-prefix --without-libintl-prefix --with-system-zlib --enable-linker-build-id --with-default-libstdcxx-abi=gcc4-compatible --enable-libstdcxx-filesystem-ts
Thread model: posix
Supported LTO compression algorithms: zlib zstd
gcc version 10.2.0 (GCC)

x86_64-w64-mingw32-gcc.exe -v

Using built-in specs.
COLLECT_GCC=x86_64-w64-mingw32-gcc
COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-w64-mingw32/10/lto-wrapper.exe
Target: x86_64-w64-mingw32
Configured with: /mnt/share/cygpkgs/mingw64-x86_64-gcc/mingw64-x86_64-gcc.x86_64/src/gcc-10.2.0/configure --srcdir=/mnt/share/cygpkgs/mingw64-x86_64-gcc/mingw64-x86_64-gcc.x86_64/src/gcc-10.2.0 --prefix=/usr --exec-prefix=/usr --localstatedir=/var --sysconfdir=/etc --docdir=/usr/share/doc/mingw64-x86_64-gcc --htmldir=/usr/share/doc/mingw64-x86_64-gcc/html -C --build=x86_64-pc-cygwin --host=x86_64-pc-cygwin --target=x86_64-w64-mingw32 --without-libiconv-prefix --without-libintl-prefix --with-sysroot=/usr/x86_64-w64-mingw32/sys-root --with-build-sysroot=/usr/x86_64-w64-mingw32/sys-root --disable-multilib --disable-win32-registry --enable-languages=c,c++,fortran,lto,objc,obj-c++ --enable-fully-dynamic-string --enable-graphite --enable-libgomp --enable-libquadmath --enable-libquadmath-support --enable-libssp --enable-version-specific-runtime-libs --enable-libgomp --enable-libada --with-dwarf2 --with-gcc-major-version-only --with-gnu-ld --with-gnu-as --with-tune=generic --with-cloog-include=/usr/include/cloog-isl --with-system-zlib --enable-threads=posix --libexecdir=/usr/lib
Thread model: posix
Supported LTO compression algorithms: zlib zstd
gcc version 10.2.0 (GCC)

Functional definition

Côtehaus may be seen as divided in phases: input, processing and output.

Being at input phase means to take an input file and while seeing it make records of things useful to recognize about that input file. The input file is a Java program, as in this example:

class A {
	int add(int a, int b) {
		return a+b;
	}
}

The prototype Iadd(II) present in the input file (derived from int add(int a, int b)) is to Côtehaus important to record during the input phase.

After the input phase finishes, Côtehaus has a lot of records derived from things seen in the input files, like Iadd(II) and many others.

Then the process phase begins, and the goal of this phase is to set the records in a defined order. Let's take this example: add(II) and Isubstract(II). They are 2 prototype records that might be produced but in an unexpected order, let's say Isubstract(II) is produced first and Iadd(II) is later. Côtehaus knows that the expected order is right the invert of the produced order. This time we let ourselves trust Côtehaus to determine the expected order, because from a point of view, such a job is what a compiler is to make for us. During the processing phase Iadd(II) is set first and Isubstract(II) later.

All the records are persisted to an output file at the output phase. Sending messages, respecting conventions, using bitpatterns, informing how much size a prototype's array occupies, are among things that describe the output phase and have the output file done.

Code structure

The way how the code is written, there are naming customs used for each one of the mentioned phases. For example add_... for the input phase and build_o... for the processing phase.

And the source code's forms arround each phase repeats as many times as the quantity of structures the stage targets from the Dalvik's format.

To follow the example of proto_id used above, a proto_id_item may be seen as defined in theese terms: tok = elements of (ShortyDescriptor_tok, TypeDescriptor_tok, TypeDescriptor_tok, TypeDescriptor_tok). Said so, a proto_id is defined in terms of all tokens (tok). Each token is then recorded as a string with a call to add_str_id. From initial tok = ShortyDescriptor_tok thru last of every add_str_dt(tok) is executed, all making up str_id[1..n]. Let str_id = add_str_dt(tok); step s1: str_dt={unpacked_data, unpacked_size}, str_id={str_dt}, proto_id.shorty_idx=str_id

At the input phase, there's a Bison grammar that sets up structure recognition over input text. Bison's annotations then integrate with the next stage, the processing, by allocating objects that represent the Dalvik format's structures. The former is found in the file java7.y, where the many instances of the former construction look like this:

MethodOrFieldDecl:
  Type Identifier MethodOrFieldRest
  { /* TODO
    switch($3.type) {
      case "method":
      */
      add_method_id(context_peek()->enclosing_class, add_proto_id($1, $3->formal_params->type_list),$2);
      /*
      break;
      case "field":
        add_field_id(...);
      break;
    }
    */
  }
  ;

The implementation in the C language of these calls, allocates memory and chain the new object with their same-typed ones. When an object is expected to reference another one, the call that deals with the input of the referer accomodates a pointer at the referee. For example, the goal of the following is to make-up a proto_id item, then chain it together with all the rest of them:

struct proto_id_st *add_proto_id(/*things relative to a proto_id*/) {
  struct proto_id_st *proto_id = /*mallocs to a proto_id*/;
  
  /*set proto_id relative things, making malloc or using a function parameter directly:
  
  proto_id->a_member = assign a function parameter as is;
  
  proto_id->another_member = malloc based on another function parameter (apply some logic);
  
  etc.
  */
  
  /* Chain the new proto_id object i to a list with all of the rest of  proto_id's.
  "_list" is by custom part of the chain name. This is so to proto_id as to any item type.
  "_head" is by custom part of the name of the chain head for each item type
  */
  list_add(&proto_id->proto_id_list, &proto_id_head);
  
  // count the items in the chain
  proto_id_list_size++;
  
  return proto_id;
}

At the processing phase then, the program takes on permutations. It takes the objects as they were blown from input stage and permutes as much as necessary to have these objects reordered as in accordance with the Dalvik's format. For example:

int oproto_id_st_compar(struct oproto_id_st *oproto_id_1, struct oproto_id_st *oproto_id_2) {
  if(oproto_id_1->proto_id->return_type_idx!=oproto_id_2->proto_id->return_type_idx) {
    return oproto_id_1->proto_id->return_type_idx-oproto_id_2->proto_id->return_type_idx;
  }
  if(oproto_id_1->proto_id->parameters_list->relative_offset!=oproto_id_2->proto_id->parameters_list->relative_offset) {
    return oproto_id_1->proto_id->parameters_list->relative_offset-oproto_id_2->proto_id->parameters_list->relative_offset;
  }
  return 0;
}

void
build_oproto_id() {
  oproto_id_ary = malloc(proto_id_list_size*sizeof(struct oproto_id_st));
  memset(oproto_id_ary, 0, proto_id_list_size*sizeof(struct oproto_id_st));
  struct list_head *tmp;
  unsigned int i=0;
  list_for_each(tmp, &proto_id_head) {
    (oproto_id_ary+i)->proto_id = (struct proto_id_st *)list_entry(tmp, struct proto_id_st, proto_id_list);
    i++;
  }
  qsort(oproto_id_ary, proto_id_list_size, sizeof(struct oproto_id_st), (int (*)(const void *, const void *))oproto_id_st_compar);
  i=1;
  unsigned int representative_i = 0, representative_idx = 0;
  while(representative_i<proto_id_list_size) {
    (oproto_id_ary+representative_i)->proto_id->idx = representative_idx;
    while(i!=proto_id_list_size && oproto_id_st_compar(oproto_id_ary+representative_i, oproto_id_ary+i)==0) {
      (oproto_id_ary+i)->proto_id->idx = (oproto_id_ary+representative_i)->proto_id->idx;
      i++;
    }
    if(i!=proto_id_list_size)
      (oproto_id_ary+representative_i)->next_representative = oproto_id_ary+i;
    representative_i=i++;
    representative_idx++;
  }
  bounds_move(NH_PROTO_ID_IDX, 3*sizeof(uint32_t)/*Storage Designators: ((struct str_id_st)(struct proto_id_st).shorty_idx).idx,
  ((struct type_id_st)(struct proto_id_st).return_type_idx).idx,
  ((struct type_list_st)(struct proto_id_st).parameters_list).relative_offset;
  This makes for a total of 3 uint32_t.
  */ *representative_idx);
  majors_size_ary[NH_PROTO_ID_IDX] = representative_idx;
}

Finally, the program at the output stage carries with emitting an external representation in compliance to Dalvik's format. Right here, the the fitting of the datas in a spatial referential system overwhelms the data's meaning itself. after that fact, library calls are used that invert the relation format-meaning, turning the format into the meaning. For example:

void pack_oproto_id() {
  for(struct oproto_id_st *oproto_id_representative = oproto_id_ary;oproto_id_representative!=NULL; oproto_id_representative=oproto_id_representative->next_representative){
    uint32_t packed = htole32(oproto_id_representative->proto_id->shorty_idx->idx);
    BUFFER_WRITE(buffer[NH_PROTO_ID_IDX],buf_len[NH_PROTO_ID_IDX],&packed,buf_offset[NH_PROTO_ID_IDX],sizeof(uint32_t))
    packed = htole32(oproto_id_representative->proto_id->return_type_idx->idx);
    BUFFER_WRITE(buffer[NH_PROTO_ID_IDX],buf_len[NH_PROTO_ID_IDX],&packed,buf_offset[NH_PROTO_ID_IDX],sizeof(uint32_t))
    packed = htole32(oproto_id_representative->proto_id->parameters_list->relative_offset+up_bounds_ary[NH_TYPE_LIST_IDX-1]);
    BUFFER_WRITE(buffer[NH_PROTO_ID_IDX],buf_len[NH_PROTO_ID_IDX],&packed,buf_offset[NH_PROTO_ID_IDX],sizeof(uint32_t))
  }
}

TODO digest the following notes into Developer notes section

     /* step s2: from ordered without duplicates str_id[1..n] as ostr_id[1..n] by (comparisson function, str_id[1..n].str_dt.{unpacked_data, unpacked_size}) ==> { str_id[i=1..n from ostr_id[1..n]].idx=i, str_id[j=1..n where in undisduplicated ostr_id[1..n] ostr_id[j] was a duplicate of the disduplicated ostr_id[i]].idx=i, str_id[1..n from ostr_id[1..n]].str_dt={packed_data, packed_size} }
     * step s3: up_bounds_ary[str_id_up_bound_idx..n]+=K*n in ostr_id[1..n]
     * step s4: ostr_id[i=1..n].relative_offset=str_id[i-1 from ostr_id[1..n]].str_dt.packed_size+ostr_id[i-1].relative_offset
     * step s5: up_bounds_ary[str_dt_up_bound_idx..n]+=str_id[n from ostr_id[1..n]].str_dt.packed_size+ostr_id[n].relative_offset
     */
    // PROCESS P2 minus P1 (i.e. P1 substracted from P2): from initial tok = TypeDescriptor_tok (1) thru last of every type_id={str_id} is executed, all making up type_id[1..n]
    // Let str_id = add_str_dt(tok);
    /* step s1: type_id={str_id}, proto_id.return_type_idx = type_id
     * step s2: from ordered without duplicates type_id[1..n] as otype_id[1..n] by (comparisson function, type_id[i=1..n].str_id.idx) ==> { type_id[i=1..n from otype_id[1..n]].idx=i, type_id[j=1..n where in undisduplicated otype_id[1..n] otype_id[j] was a duplicate of the disduplicated otype_id[i]].idx=i }
     * step s3: up_bounds_ary[type_id_up_bound_idx..n]+=K_2*n as n in otype_id[1..n]
     */
    // PROCESS P3 minus P1 minus P2 (i.e. P1 and P2 substracted from P3): from initial tok = TypeDescriptor_tok (2), (3), (4) thru last of every type_list[L]=type_id is executed, all making up type_list[1..n][1..L].
    // Let str_id/*(2)*/ = add_str_dt(tok/*(2)*/); str_id/*(3)*/ = add_str_dt(tok/*(3)*/);  str_id/*(4)*/ = add_str_dt(tok/*(4)*/);
    /* step s1: type_id(2)=str_id(2), type_id(3)=str_id(3), type_id(4)=str_id(4), type_list[1]=type_id(2), type_list[2]=type_id(3), type_list[3]=type_id(4), L=3 in type_list[1..L], proto_id.parameters_list=type_list
     * step s2: from ordered without duplicates type_list[1..n (not the L-dimmension)] as otype_list[1..n] by (comparisson function, type_list[1..n][1..L]) ==> { type_list[i=1..n from otype_list[1..n]].relative_offset=K_3+K_4*L as L in type_list[i-1][1..L]+type_list[i-1].relative_offset, type_list[j=1..n where in undisduplicated otype_list[1..n] otype_list[j] was a duplicate of the disduplicated otype_list[i]].relative_offset=type_list[i].relative_offset }
     * step s3: up_bounds_ary[type_list_up_bound_idx..n]+=K_3+K_4*L as L in type_list[1..n][1..L]
     */
    // PROCESS P4 minus P1 minus P2 minus P3 (i.e. P1, P2 and P3 substracted from P4): from initial proto_id thru last of every proto_id complete, all making up proto_id[1..n]
    /* step s3: from ordered without duplicates proto_id[1..n] as oproto_id[1..n] by (comparisson function, proto_id[i=1..n].{return_type_idx,parameters_list}) ==> { proto_id[i=1..n from oproto_id[1..n]].idx=i, proto_id[j=1..n where in undisduplicated oproto_id[1..n] oproto_id[j] was a duplicate of the disduplicated oproto_id[i]].idx=i }
     * step s1: up_bound_ary[proto_id_up_bound_idx..n] += K_5*n as n in oproto_id[1..n]
     */
    // PROCESS P5 (modeled after assumption type_list[1..n] lower and str_dt[1..n] higher addressed)
    // General formula for next padding to write: next_padding_to_write=(next_item_alignment-last_write_up_bound_address%next_item_alignment)%next_item_alignment
    
    // step s1: pack_write(str_id[1..n from ostr_id[1..n]], /*for address referencing from str_id to str_dt (*/ ostr_id[1..n], up_bounds_ary[str_dt_idx-1],/*) for align (*/up_bounds_ary[str_id_idx-1], alignment_ary[str_id_idx] /*)*/ )
    // step s2: pack_write(type_id[1..n from otype_id[1..n]], str_id[1..n], up_bounds_ary[type_id_idx-1], alignment_ary[type_id_idx])
    // step s3: pack_write(proto_id[1..n from oproto_id[1..n]], /*for address referencing from proto_id to type_list (*/ type_list[1..n], up_bounds_ary[type_list_idx-1],/*) for align (*/ up_bounds_ary[proto_id_idx-1], alignment_ary[proto_id_idx] /*)*/)
    // step s4: pack_write(type_list[1..n] from otype_list[1..n], up_bounds_ary[type_list_idx-1], alignment_ary[type_list_idx])
    // step s5: write_packed(str_dt[0..n], up_bounds_ary[str_dt_idx-1], alignment_ary[str_dt_idx])

Project updates

The current focus is at gathering samples of what could be Côtehaus input/output data, i.e. samples of any Android Java code and its corresponding .dex code, regardless if .apk wrapped. Mainly from F-droid site, although may be from elsewhere. The gathered stuff is listed below:

Of course, all the apps listed above are unrelated to Côtehaus.