Universal Disassembler by bzt

bzt 6cec94b563 Some failsafes in x86 11 months ago
.gitignore 647c1745e6 Initial commit 6 years ago
LICENSE e0f54e88d9 Initial commit 6 years ago
Makefile c36436e719 Fixed issue #1 6 years ago
README.md 4b857eb543 Added toc 3 years ago
aarch64.h 75a5e16a69 Added cmp aliases 3 years ago
aarch64.txt 75a5e16a69 Added cmp aliases 3 years ago
java.h 624632434b ANSI C -Wall -Wextra -Wpedantic compliant headers 5 years ago
java.txt e0f54e88d9 Initial commit 6 years ago
main.c 8802f1e876 Even further reduced size, C++ compatibility, DecodeBitMasks() for AArch64 6 years ago
txt2h.php 840193b9e7 Fix to work with latest php 3 years ago
x86_64.h 6cec94b563 Some failsafes in x86 11 months ago
x86_64.txt e0f54e88d9 Initial commit 6 years ago
z80.h 624632434b ANSI C -Wall -Wextra -Wpedantic compliant headers 5 years ago
z80.txt e0f54e88d9 Initial commit 6 years ago

README.md

Universal Disassembler by bzt

I needed a small and fast disassembler, and I was shocked when I saw there was none. All are bloated and full of dependencies, not to mention that they are incomplete, and difficult to expand. The same stands for GNU binutils, LLVM disassembler and Capstone (which by the way, dares to call itself lightweight despite being over half a megabytes in size in 'diet' mode, WTF?).

So "Good news everyone!" (TM), I've written a decent one! One that is easy to use, easy to integrate, easy to expand, and really truly lightweight. The only dependency it has is libc's sprintf(). Of course I did not write it by hand (no sane programmer would, haha :-) ). I've written a script to generate the disassembler's C source from an instruction table in a plain text file. Providing different instruction tables can create disassembler for any arbitrary architecture, hence the name, Universal Disassembler.

Here are some disassembler code sizes in bytes, compiled on x86_64 Linux with gcc 7.3.0:

Architecture Encoding type Instructions Code size
AArch64 fixed length 1164 66239
Zilog 80 variable length 194 6821
Java bytecode variable length, arguments 219 2476

As you can see, my AArch64 disassembler (which supports full ARMv8.2 instruction set) is no more than 65k. Now that you can call lightweight.

[[TOC]]

Usage

Really really simple, single C++ compatible ANSI C header file and one function only. First you have to include the required instruction set (only one at a time normally, but see pseudo-lists below), then you can use the disasm() function:

#include "aarch64.h"    // include only one of them
#include "java.h"
#include "z80.h"

addr = disasm(uint64_t addr, char *str);

You pass the memory address of the instruction and a pre-allocated buffer, and the function returns the address of the next instruction. If the string buffer is not NULL, the mnemonic and it's arguments will be written into. The buffer must be large enough for the string representation of the instruction (64 bytes surely will do, but see -n below). Because the disassembler does not know in advance whether it needs to read a byte, a word or a dword for an instraction, it is clearer that the function receives the virtual address only, without casting.

That's all folks! :-) Simplicity is the ultimate sophistication!

Integration with OS kernels and debuggers

For a use case and usage example, take a look at the mini debugger library.

The really dirty txt2h.php script converts the instruction tables in text files into ANSI C header files. The good news is you don't have to care about this script, just call it if the pre-generated C headers do not suit your needs.

php txt2h.php <instruction table text file> [-v|-i|-a] [-s|-n]

The optional "-v" second parameter makes the script verbose, reporting unreferenced argument types, lists and bitgroups.

If you pass "-i" as second parameter to the script, it will generate a slightly different (but 100% compatible) C code. The disassembler will then use three external variables for integration:

  • buf_reloc: the address of the buffer used. All addresses in the disassembly will be relative to this address.
  • dbg_label: if the instruction encodes a label, the address (minus buf_reloc) will be set to this variable (so that the caller can look it up in the symbol table). If no label associated with the instruction, it will be cleared to zero.
  • sys_fault: this boolean variable should be set to 1 by exception handlers (Page Fault or Data Abort) if reading 'addr' triggers a CPU exception inside the disassembler. For fail-safe only, should not needed.

Also a "disasm_integration" define will be defined.

Passing "-a" as second argument will generate disassembler for code analysis. In this mode the string buffer is filled up with a JSON string describing the instruction in detail (see Example outputs section below). The suggested buffer size for this mode is at least BUFSIZ, but see also the "-n" third option. In this case a "disasm_analytics" define exists.

If you pass "-s" as third argument to the conversion script, the C code will expect sprintf() to return a char * pointer to the byte after the string written instead of the number of bytes written. That will save an additional 'add' instruction after every single sprintf() call in the disassembler, thus reducing it's size even further.

With "-n" as the third argument, the disassembler will use snprintf(). Useful in conjunction with "-a". With this option disasm() will have and additional buffer size argument, and returns a 0 if buffer was not big enough to store the string.

addr = disasm(uint64_t addr, char *str, int size);

This mode is indicated by the existence of "disasm_snprintf" define.

Instruction table text file

An ascii text file holds the argument definitions and the instruction table for a specific architecture. It has 2, 3 or 4 columns, separated by the tab '\t' character. If you lack an instruction, you only need to add a line into the text file, re-generate C header and you're good to go. No C or PHP coding required at all for expansion!

Lines starting with a slash '/' handled as comments. C style multi line /* */ comments are also allowed in the text file.

Lines starting with a plus sign '+' are copied verbatim to the C header. The converter script does not include any architecture specific part, so there's a need to add functions somehow like for example disasmsysreg() for AArch64 to decode system register names, or that fucked up DecodeBitMasks() function. These additional functions must always be prefixed by 'disasm'.

Argument definitions

Lines staring with an at sign '@' are list definitions. The first column is the name of the list, the second column is a space ' ' separated list of strings. Examples:

@conds	EQ NE CS CC MI PL VS VC HI LS GE LT GT LE AL NV
@ic_op	IALLUIS IALLU ? IVAU

Two list names are reserved: '@disasm' and '@signext' (see below).

(HINT: if you really really desperately want to include multiple instruction sets within a debugger, you can rename the disasm() function for each architecture with the '@disasm' pseudo-list, like '@disasm disasm_x86'. Then you can include different disassemblers into one single C source without name conflict.)

Lines starting with '<' or '{' are argument type definitions with 2 columns. The '{' denotes an optional argument. Argument types are enclosed in '<' and '>', just like in the DDI0487 documentation for AArch64. I did not introduced a new syntax for the others, so all architecture use the same. Argument type names must be C define name safe, so for example instead of '<Xt|SP>' the text file uses a shorter, pipe-less '<XtS>' form. Similarly, '<Vd>.<T>' becames '<VdT>'. The second column is an sprintf parameter list or a list refrence if it starts with an at sign '@' and ends in an index enclosed by '[' and ']'. The character '^' means the address of the instruction disassembling, '$' the next instruction's address. Otherwise it's just an sprintf argument. For example:

<Xt>	t==31?"xzr":"x%d", t
<XtS>	t==31?"sp":"x%d", t
<Rt>	t==31?"%czr":"%c%d", (s?'x':'w'), t
<labelij1>	"0x%x", ^+(i<<2)+j
<ic>	@ic_op[c]
<c>	@conds[c]

That simple. The one letter variables you use in argument definitions define the bitgroups you can use in bitmasks (see instruction table below).

Lists are not mandatory. They are useful if specific bits can be used as an index, like choosing a register name from a list. You can specify the opcodes for different arguments as well. For example these are identical:

@something A B C
<slist> @something[s]
111111ss INST <slist>

and

<s0> "A"
<s1> "B"
<s2> "C"
11111100 INST <s0>
11111101 INST <s1>
11111110 INST <s2>

Using the 'slist' argument type in our first example, which references the list named 'something' and tells the disassembler to index it with the bits 's' in the instruction). The second, list-less version can encode any variation of instructions and arguments, but generates bigger disassembler.

Instruction table

Now the interesting part. All the other, non empty lines are instruction definitions, with 3 columns. The first column is a bitmask for the instruction. The second column is a mnemonic. The third column is a comma ',' separated list of argument types. The optional 4th column may contain extra C commands for setting bitgroup values if needed, separated by a semicolon ';' charater.

Opcode descriptions

The bitmask may contain letters, and has three special characters: '0' and '1' for masking, 'x' as don't care bit. Bits marked with the same letter are handled together as a bitgroup, where the most significant bit is on the left, and the least significant bit is on the right. Example:

0jj10000iiiiiiiiiiiiiiiiiiittttt	ADR	<Xt>, <labelij1>

The letters ('i', 'j' and 't' in our example) came from the arguments. The argument '<Xt>' defines bit 't', and all bits marked with 't' are groupped together and encoding the first argument's value. On the other hand '<labelij1>' defines 'i' and 'j' bits, so those bits belong to the second argument (but form two separated bitgroups). If immediate value is encoded, it should be marked with 'i'. That letter is special in a way, as it's value is sign extended automatically. If you need more than one sign extended bitgroups, you can define the letters for them in the '@signext' pseudo-list, like '@signext i j k'.

The argument names more or less follow the notation in DDI0487 Chapter C4. I had to simplify things and make it straight-forward for the disassembler, so there are minor differences. For example I hade to differentiate between <FPz4t> and <FPz3t> as the same 'z' bitgroup value encodes different register sizes for them.

Now let's see another example with more arguments:

s0010001jjiiiiiiiiiiiinnnnnttttt	ADD	<RtS>, <RnS>, #<i>{, LSL #<j12>}

The documentation (C4.1.1) calls the 31th bit 'sf', but we have to use single letters only. So I've choosen 's' for size flag. The register names in arguments depend on it's value. The last argument must be multiplied by 12, so I used an argument name that reflects that.

It also note mentioning that the strings '#' and 'LSL #' in argument list do not count at all. Their only sole purpose is visual aid. If arguments 'offs' (offset start) and 'offe' (offset end) defined, then the characters '[' and ']' will be handled as arguments, otherwise simply skipped just like the others. The exact output format of the argument (including the prefix string) only depends on the definition of <i> and <j12>, like:

<i>	"#0x%x", i
<j12>	"lsl #%d", j*12

Arguments are not allowed in mnemonic names, with one exception: the conditional argument '<c>', like in 'B.<c>' or in 'J<c>'. Here the bits marked by letter 'c' will choose the ending of the mnemonic from the 'conds' list. One could also define several 'B.E', 'B.NE' etc. lines, and ommit using '<c>' in instruction names at all, but that would generate a bigger disassembler.

Another example, here I haven't followed the documentation at all. In order to describe the instruction effecitively, I've splitted it up into 4 different bitmasks.

@dc_op0	? IVAC ISW
@dc_op1	CSW CISW
@dc_op2	CVAC CVAU CIVAC
<dc0>	@dc_op0[d]
<dc1>	@dc_op1[d]
<dc2>	@dc_op2[d]
<ZVA>	"ZVA"
1101010100001000011101100ddttttt	DC	<dc0>, <Xt>
110101010000100001111d10010ttttt	DC	<dc1>, <Xt>
110101010000101101110100001ttttt	DC	<ZVA>, <Xt>
110101010000101101111d1d001ttttt	DC	<dc2>, <Xt>

It's a good example of the flexibility of the instruction table: each bitmask defines 'd' bits at different positions. It's perfectly safe as long as the bitmasks are distinguishable from each other. The conversion script has your back covered, and warns you about ambiguous bitmasks. This example also demostrates that bits do not need to be contiguous, they can be separated yet they will be handled together, as you can see in the last bitmask.

Upper case letters belong to the same bitgroup as the lower case ones; but they are more significant bits. For example if you have an instruction bitmask where the index is 5 bits long of which the last two bits are more significant than the rest, you can encode it as 'iiiII'. That will be handled as '(II<<3)|iii'. Example:

0111111101jjmmmm1101J0nnnnnttttt	SQRDMLAH	<FPz4t>, <FPz3n>, <VmTs>	z=1
0111111110jmmmmm1101J0nnnnnttttt	SQRDMLAH	<FPz4t>, <FPz3n>, <VmTs>	z=2

Here the 'J' bit on the right encodes the most significant bit for the 'j' bitgroup. I also set the 'z' bitgroup in the 4th column, as the bitmask does not contain any 'z' letters (using 'zz' before the 'j' bitgroup would make the bitmasks ambiguous, and the disassembler wouldn't know which masks apply, the one with 2 'j's and 4 'm's or the one with 1 'j' and 5 'm's). I could have used more argument types not referencing 'z' at all, but disassembler much smaller this way.

Bitmasks are looked up by best match policy (where defined bits take precedence over bitgroup letters), so it is possible to have a more specific mask first, and then a catch-all mask later. For example:

0q001100100111110000zznnnnnttttt	ST4	<VtT>, <Vt2T>, <Vt3T>, <Vt4T>, <XnS>, <Qi>
0q001100100mmmmm0000zznnnnnttttt	ST4	<VtT>, <Vt2T>, <Vt3T>, <Vt4T>, <XnS>, <Xm>

Here we first match specific value for 'm' and use an immediate encoded by 'q' bitgroup, then we fall back to the register argument form where the argument is encoded in 'm'.

Instructions sharing the same argument list and similar bitmasks are automatically grouped together for smaller disassembler code size. If that happens, the conversion script exchanges the bits indexing the instruction by a tilde '~' character. Tilde is handled as a bitgroup, but since it's not a letter it will not interfere with the other bitgroups specified in the instruction table.

Prefix definitions

Prefix instructions may add string to the mnemonic (like 'lock' or 'repe') or define extra bits for bitgroups (like the REX prefix). Bitmasks for prefix instructions end in a plus sign '+', meaning more instruction bytes will follow. Examples:

11110010+	repnz
0100wrxb+

If prefix does not influence the mnemonic (like REX), then you can also specify longer instruction bitmasks including the prefix bits, but that would result in a bigger disassembler.

Additional argument bytes

An instruction may require additional bytes. The number of those bytes can be described by adding a '+num' suffix to the bitmask (where num one of 1, 2, 4 or 8). If there are more arguments, you can define more bytes separated by comma ',' character, like '+num,num...'. Their values can be referenced as 'a0'..'aN' and they are also signed extended just like 'i'. If you need to specify individual bits in those additional bytes (like for ModRM or SIB byte in x86_64), simply use a longer bitmask. The script will generate propriate C code for fixed length as well as for variable length encodings. In special cases, the last argument can be a bitgroup or a previous argument (with optional asterik '*' size specifier), describing variable length of data. Like in '+1,1,a0', meaning "two more bytes where the first describes the number of bytes following". Examples:

iiiiiiii0111cccc	J<c>	<labeli1>
11000010+2	RET	<arg16>
mmrrrnnn00000001+4	ADD	<reg>, <arg32>
00000000000000000000000010101011+4,4,a1*8	lookupswitch	<a0>, <a1>

Example outputs

In normal mode (note the first two coloumn, the address and the hex instruction are printed by main.c and not included in the string buffer returned by the library):

55981b2f44ac: d53800a7 mrs       x7, MPIDR_EL1
55981b2f44b0: 924004e7 and       x7, x7, #0x0
55981b2f44b4: b4000067 cbz       x7, 0x55981b2f44c0
55981b2f44b8: d503205f wfe
55981b2f44bc: 17ffffff b         0x55991b2f44b8
55981b2f44c0: 58004c41 ldr       x1, 0x55981b2f4e48
55981b2f44c4: d5384240 mrs       x0, CurrentEL
55981b2f44c8: f100101f subs      xsp, x0, #0x4

In code analysis mode:

558a823174ac: d53800a7 {
    "mask":"11010101001ppkkknnnnmmmmjjjttttt",
    "prefix":[],
    "instruction":[0xa7,0x00,0x38,0xd5],
    "count":1,
    "bitgroups":{"p":3, "k":0, "n":0, "m":0, "j":5, "t":7},
    "arguments":[],
    "prefixstr":"",
    "decoded":"mrs x7, MPIDR_EL1"
}
558a823174b0: 924004e7 {
    "mask":"s~~100100miiiiiijjjjjjnnnnnttttt",
    "prefix":[],
    "instruction":[0xe7,0x04,0x40,0x92],
    "count":1,
    "bitgroups":{"~":0, "s":1, "m":1, "i":0, "j":1, "n":7, "t":7},
    "arguments":[],
    "prefixstr":"",
    "decoded":"and x7, x7, #0x0"
}
558a823174b4: b4000067 {
    "mask":"s011010~iiiiiiiiiiiiiiiiiiittttt",
    "prefix":[],
    "instruction":[0x67,0x00,0x00,0xb4],
    "count":1,
    "bitgroups":{"~":0, "s":1, "i":3, "t":7},
    "arguments":[],
    "prefixstr":"",
    "decoded":"cbz x7, 0x558a823174c0"
}
558a823174b8: d503205f {
    "mask":"1101010100000011001000~0~~~11111",
    "prefix":[],
    "instruction":[0x5f,0x20,0x03,0xd5],
    "count":1,
    "bitgroups":{"~":2},
    "arguments":[],
    "prefixstr":"",
    "decoded":"wfe "
}
558a823174bc: 17ffffff {
    "mask":"~00101iiiiiiiiiiiiiiiiiiiiiiiiii",
    "prefix":[],
    "instruction":[0xff,0xff,0xff,0x17],
    "count":1,
    "bitgroups":{"~":0, "i":-1},
    "arguments":[],
    "prefixstr":"",
    "decoded":"b 0x558b823174b8"
}

Fields of the JSON string:

  • mask: string, bitmask that matched the instruction
  • prefix: byte array, prefix bytes (if any)
  • instruction: byte array, instruction bytes
  • count: int, number of instructions decoded (can be bigger than one for nops)
  • bitgroups: int struct, decoded values for each bitgroup in the instruction
  • arguments: two dimensional byte array (if any)
  • prefixstr: string, decoded prefix string (if any, same as in normal mode)
  • decoded: string, decoded instruction string (same as in normal mode)

Known issues

For the AArch64, system registers of EL3 are not decoded, just shown in their canonical form (like "S3_6_4_0_1"). If somebody really lacks this, one have to add those names to the "disasm_sysreg()" function in aarch64.txt line 98. Be warned, thanks to the ARM designers, that function is a real switch-case maze!

License

Universal Disassembler is a Free and Open Source Software. The generated Universal Disassembler Function Library ANSI C headers are licensed under the terms of the permissive MIT license, meaning you can include the disassembler in all kinds of projects including proprietary commertial software.

    Copyright (C) 2017 bzt (bztsrc@gitlab)

    Permission is hereby granted, free of charge, to any person
    obtaining a copy of this software and associated documentation
    files (the "Software"), to deal in the Software without
    restriction, including without limitation the rights to use, copy,
    modify, merge, publish, distribute, sublicense, and/or sell copies
    of the Software, and to permit persons to whom the Software is
    furnished to do so, subject to the following conditions:

    The above copyright notice and this permission notice shall be
    included in all copies or substantial portions of the Software.

    THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
    EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
    MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
    NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT
    HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY,
    WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
    OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
    DEALINGS IN THE SOFTWARE.

The txt2h.php converter script and the instruction table text files on the other hand are licensed under the terms of GPL, meaning they are copyleft. If you make modifications to them, you are obliged to share the modified sources with the Open Software community.

    Copyright (C) 2017 bzt (bztsrc@gitlab)

    This program is free software; you can redistribute it and/or modify
    it under the terms of the GNU General Public License as published by
    the Free Software Foundation; either version 3 of the License, or
    (at your option) any later version.

    This program is distributed in the hope that it will be useful,
    but WITHOUT ANY WARRANTY; without even the implied warranty of
    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
    GNU General Public License for more details.

    You should have received a copy of the GNU General Public License
    along with this program; if not, write to the Free Software
    Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.

The ANSI C header files are outputs of the script, therefore they are not derivatives of the text files or the script itself. As such copyleft does not apply, which allows more permissive licencing terms on the generated C code.