Featured image by Harrison Broadbent @ Unsplash https://unsplash.com/photos/c3YpscwJb04


A long time ago, I wrote about dumping manually an SLC NAND Flash device. I happened to get a firmware with lots of Lua scripts.

At first glance, I though it would be no big deal as Lua is a scripting language and thus readable with a simple text editor.

Turns out lua can be compiled as bytecode to improve performances (as it is often used on embedded systems, that is a major choice criteria) and maybe to obfuscate the code.

One such bytecode compiler and interpreter is LuaJIT.

Note:

Do not continue here if you're just interested in LuaJIT decompilation for your own usage. This article is for understanding purposes. Some decompilers are available here and here. And yes, everything is quite old. LuaJIT hasn't been updated since 2017.

Understanding the structure

A detailed documentation exists about the instructions constituting the bytecode : http://wiki.luajit.org/Bytecode-2.0

But some information are missing: how the functions are defined and called, how functions are distributed in the binary file and more...

To get an answer to these, I had no other choice to get my hands dirty. I used 2 methods to understand how it works: reading the source code and disassembling sample files.

Unfortunately, the built-in disassembler is minimal and not enough in my opinion. So I wrote mine in Python to get a better understanding. Note that it is compatible with the last "stable" version (2.0.5) and not the beta ones.

You'll find it here: https://gist.github.com/MickaelWalter/4b130d36040844abcb71bf69fe8d6fd4

LuaJIT's bytecode seems at first glance quite similar to assembly. They are move instructions, call instructions, jumps, arithmetic, etc... Also, they are different operand types for each instruction like direct integer value, constant integer table reference (for large numbers), constant string table reference, etc...

But the memory structure is drastically different. There are no memory address, only variable slots. Each slot can hold a variable of any type: string, int, boolean, array, etc...

The first slots are used for function arguments. A header is embedded within each function block (called proto in LuaJIT data structure) and describes the numbers of arguments, the frame size (number of total slots) among other information.

Below you will find the detailed bytecode file structure as described by the documentation. Unfortunately, this is poorly documented and the detailed information for each field are only available in the source code:

dump   = header proto+ 0U
header = ESC 'L' 'J' versionB flagsU [namelenU nameB*]
proto  = lengthU pdata
pdata  = phead bcinsW* uvdataH* kgc* knum* [debugB*]
phead  = flagsB numparamsB framesizeB numuvB numkgcU numknU numbcU
         [debuglenU [firstlineU numlineU]]
kgc    = kgctypeU { ktab | (loU hiU) | (rloU rhiU iloU ihiU) | strB* }
knum   = intU0 | (loU1 hiU)
ktab   = narrayU nhashU karray* khash*
karray = ktabk
khash  = ktabk ktabk
ktabk  = ktabtypeU { intU | (loU hiU) | strB* }

B = 8 bit, H = 16 bit, W = 32 bit, U = ULEB128 of W, U0/U1 = ULEB128 of W+1

As I said, understanding each of these structures requires diving into the source code. A good starting point is in src/lj_bcdump.h. I documented as many objects as I could from the source code in my disassembler. Check the comments.

Some (most important) pieces:

  • The file begins with the magic bytes <ESC>LJ
  • The protos (functions) are laid out one after the other in the file. The files ends with a null byte
  • The first proto in the file is the most nested function. If function A is the top level one that calls functions B and C, and if C calls D, the structure will be the following: DBCA
  • The closures (functions defined in other functions and allowed to access the local values of top function) have upvalues number defined in the proto header. Special bytecode instructions allow to access upvalues,
  • Each proto can have an array of constants, there is a distinction between numeric and other values,
  • As an instruction is 32-bit wide, large numbers (more than 16 bits) must be stored in the constant table,
  • ULEB-128 is the standard format for storing integers
  • ktab is used to store hashmaps (when using my_tab['example'] for example)

Memory layout

Understanding the concept of slots is key to read LuaJIT disassembly. They is a variable number of them and this is defined by the numparams and framesize parameters in the header.

Note:

Some functions can have a variable number of parameters (see vargargs in the lua documentation). These will not be covered here.

Usually, parameters are not overwritten with local variables. Also, a slot can contain a totally different variable throughout the execution so they are not an indicator of the number of local variables. But they can reflect the number of variables used at a same time.

To sum up, the memory layout of a proto is defined as follow (n is the number of parameters, m is the frame size):

  • The first n slots (starting at 0) contain the parameters,
  • The m-n following slots are used to temporarily store data (like registers)
  • The uvdata table (located at the end of the proto) references accessible upvalues
  • The kgc table (located at the end of the proto) contains constants
  • The knum table (located at the end of the proto) contains numeric constants

Let's say we have the following lua code (from the lua book):

-- defines a factorial function
function fact (n)
  if n == 0 then
    return 1
  else
    return n * fact(n-1)
  end
end

print("enter a number:")
a = io.read("*number")        -- read a number
print(fact(a))
A basic recursive factorial function

We have 2 protos (the "main" and the fact functions, there is no proto for C or built-in functions). Proto 1 is fact (the most nested function). The number of slots for fact will depend on the frame size chosen by the interpreter. In this case (with luajit-2.0.5 on x86-64), I have a frame size of 3. The fact function has 2 slots usable for local value handling (3 - 1 parameter).

The remaining slots can be considered as similar to general purpose registers in assembly. Their only particularity is that they are in a variable number depending on what the interpreter decided to set (it tries to minimize that value though).

In fact numerical constants that can not be stored on 16 bits (signed or unsigned) are loaded from a numeric constant table. But the table cannot be referenced by anything else than a move operation (there are several instructions to move data). The constants have to be loaded in a slot first. This fact is also valid for all the data stored in a longer term memory (dynamic or constant).

So, when doing the comparison n == 65536 or returning -33842 we need slots to hold the value the time necessary to execute the statement. The slot can then be referenced by the bytecode instruction.

The most interesting part of the code above is return n * fact(n-1). In this line a lot is going on:

  • First, the expression n - 1 is evaluated and requires a slot to hold the 1 constant and the result of the operation
  • Then a slot is required to hold the string fact (because proto call instructions reference a hashmap, called the meta table, containing the actual reference to the function). This slot then stores the return value of the function
  • At last, one of the two slots will contain the result of the operation that will be passed to the return statement

Note:

To call a method on an object, the behaviour is similar except the hashmap is contained by the object itself and only contains closures to allow references to the member variables.

Local variables are stored in a hashmap created by the interpreter. Their names are kept in the binary file to reference the values in the hashmap.

In the variable_test function listed below, a hashmap containing values with the keys b, c and d is created. Only the a parameter is stored in slot 0 throughout the execution of the function:

function variable_test (a)
  b = 5
  c = 4
  if a == 5 then
    d = 3 * c + a
    return d
  else
    return b * a
  end
end

print(variable_test(10))

Reading bytecode

Now that we know the basics about the memory layout, we can dive into actual bytecode.

If we take the disassembled version of the fact code above, we have the following listing (taken from the output of my own disassembler):

Bytecode version: 1
Flags: F_STRIP 
------------ Proto 1 -------------
---- Proto ----
Flags: 
Number of parameters: 1
Frame size: 3
Number of upvalues: 0
Number of collectable constants: 1
Number of numeric constants: 2
Number of bytecode instructions: 11
Bytecode dump: 
N	OP	A	B/D	C	Comment
001	ISNEN	0	0x0000		; JMP if {A}!=(NUM)D
002	JMP	1	0x8003		; Jump
	JMP => 6
003	KSHORT	1	0x0001		; {A} <= D (16-bit signed int)
004	RET1	1	0x0002		; return {A}
005	JMP	1	0x8005		; Jump
	JMP => 11
006	GGET	1	0x0000		; {A} <= _G[(STR)D] (see getfenv(1))
	(STR)D= "b'fact'"
007	SUBVN	2	0	1	; {A} <= {B} - (NUM)C
008	CALL	1	2	2	; {A},...,{A+B-2} <= {A}({A+1},...,{A+C-1})
009	MULVV	1	0	1	; {A} <= {B} * {C}
010	RET1	1	0x0002		; return {A}
011	RET0	0	0x0001		; return
KGC Vars: 
000	Type: KGC_STR	fact
KNum Vars: 
000	1
001	0


------------ Proto 2 -------------
---- Proto ----
Flags: PROTO_CHILD PROTO_VARARG 
Number of parameters: 0
Frame size: 3
Number of upvalues: 0
Number of collectable constants: 8
Number of numeric constants: 0
Number of bytecode instructions: 16
Bytecode dump: 
N	OP	A	B/D	C	Comment
001	FNEW	0	0x0000		; {A} <= closure(proto(D))
002	GSET	0	0x0001		; _G[(STR)D] <= {A} (see getfenv(1))
	(STR)D= "b'fact'"
003	GGET	0	0x0002		; {A} <= _G[(STR)D] (see getfenv(1))
	(STR)D= "b'print'"
004	KSTR	1	0x0003		; {A} <= (STR)D
	(STR)D= "b'enter a number:'"
005	CALL	0	1	2	; {A},...,{A+B-2} <= {A}({A+1},...,{A+C-1})
006	GGET	0	0x0004		; {A} <= _G[(STR)D] (see getfenv(1))
	(STR)D= "b'io'"
007	TGETS	0	0	5	; {A} <= {B}[(STR)C]
	(STR)C= "b'read'"
008	KSTR	1	0x0006		; {A} <= (STR)D
	(STR)D= "b'*number'"
009	CALL	0	2	2	; {A},...,{A+B-2} <= {A}({A+1},...,{A+C-1})
010	GSET	0	0x0007		; _G[(STR)D] <= {A} (see getfenv(1))
	(STR)D= "b'a'"
011	GGET	0	0x0002		; {A} <= _G[(STR)D] (see getfenv(1))
	(STR)D= "b'print'"
012	GGET	1	0x0001		; {A} <= _G[(STR)D] (see getfenv(1))
	(STR)D= "b'fact'"
013	GGET	2	0x0007		; {A} <= _G[(STR)D] (see getfenv(1))
	(STR)D= "b'a'"
014	CALL	1	0	2	; {A},...,{A+B-2} <= {A}({A+1},...,{A+C-1})
015	CALLM	0	1	0	; {A},...,{A+B-2}<={A}({A+1},...,{A+C+MULTRES})
016	RET0	0	0x0001		; return
KGC Vars: 
000	Type: KGC_CHILD	
001	Type: KGC_STR	fact
002	Type: KGC_STR	print
003	Type: KGC_STR	enter a number:
004	Type: KGC_STR	io
005	Type: KGC_STR	read
006	Type: KGC_STR	*number
007	Type: KGC_STR	a

At first, this looks quite intimidating but there is not so much going on.

Like we saw before, the most nested proto is stored first so we have to go to the last one to identify the entry point (remember that in lua all the code is intended to be called from a binary software).

First of all, the flags from the proto 2 indicate that it has child functions (actually the fact function) and takes a variable number of arguments.

We can also look at the end of the proto listing to look for constants. In particular, strings are always useful in reverse engineering.

Instructions work in 2 modes that depend of the instruction definition (a same instruction works only in 1 mode). So we have AD mode and BC mode instructions:

  • AD mode instructions have 2 arguments: A (8 bits) and D (16 bits)
  • BC mode instructions have 3 arguments: A, B and C (each 8 bits)

If we look the first instructions we can identify the definition of the fact function and its inclusion in the meta table:

N	OP	A	B/D	C	Comment
001	FNEW	0	0x0000		; {A} <= closure(proto(D))
002	GSET	0	0x0001		; _G[(STR)D] <= {A}
	(STR)D= "b'fact'"

In the code above a closure is created based on the proto embedded in the file and is stored in slot 0. Then the content of the slot 0 is placed in the meta table at the index defined by the second row of the constants table (value that happens to be our function name).

Calling the print function works quite similar:

003	GGET	0	0x0002		; {A} <= _G[(STR)D]
	(STR)D= "b'print'"
004	KSTR	1	0x0003		; {A} <= (STR)D
	(STR)D= "b'enter a number:'"
005	CALL	0	1	2	; {A},...,{A+B-2} <= {A}({A+1},...,{A+C-1})

First we load the closure from the environment by using the constant print from the constants table. The closure is stored in slot 0. Then the string enter a number: is loaded from the constants table and stored in slot 1. Finally, the closure is called.

Here the method to pass the arguments to the callee is interesting. We have no stack, so the values must be passed by something similar to a register call in assembly.

The parameters of the call instruction define the frame of slots for the arguments but also for the return values (in lua several values can be returned).

So the frame is defined as follow:

  • A is 0 and B is 1 so the return values will not be stored (because A+B-2 < 0)
  • Before being overriden by the return value, the A slot contains the closure called
  • A is 0 and C is 2 so the only parameter will be stored in slot 1 because A + 1 = 1 and A + C - 1 = 1

We can conclude that in LuaJIT, it is possible to determine the number of arguments easily by reading the function calls even if the function is external to the file.

Another interesting part is this one:

006	GGET	0	0x0004		; {A} <= _G[(STR)D]
	(STR)D= "b'io'"
007	TGETS	0	0	5	; {A} <= {B}[(STR)C]
	(STR)C= "b'read'"
008	KSTR	1	0x0006		; {A} <= (STR)D
	(STR)D= "b'*number'"
009	CALL	0	2	2	; {A},...,{A+B-2} <= {A}({A+1},...,{A+C-1})
010	GSET	0	0x0007		; _G[(STR)D] <= {A}
	(STR)D= "b'a'"

Like before, we are facing a call. But this time the call is made on an object. Let's see what is going on.

First, we get the io instance via the environment and store it in slot 0. Then we treat it as a hashmap to load the value contained at io['read'] in slot 0. Then we store the constant string *number in slot 1. Eventually, we call the closure contained in slot 0 with the string passed as argument.

This is what happens under the hood when io.read("*number") is called.

Now, when we read something from the standard input, we want to store it somewhere. Right?

This happens when GSET is executed: the return value stored in slot 0 is copied in the meta table at the entry "a". This is also interesting because that implies that original local variable names are preserved in the bytecode.

The remaining instructions work quite similarly. Just pay attention to the CALLM instruction which is nothing more than an optimisation for functions called successively, the return value of one being passed as an argument to the other.

There is last one thing we have to discuss about the bytecode: branches.

Like in assembly, loops and conditions are translated into jumps. They are specialized instructions for loops (still for optimisation) but they won't be discussed here.

As we have no clue about instruction addresses, jump instructions only work with an offset relative to the jump instruction itself.

N	OP	A	B/D	C	Comment
001	ISNEN	0	0x0000		; JMP if {A}!=(NUM)D
002	JMP	1	0x8003		; Jump
	JMP => 6
003	KSHORT	1	0x0001		; {A} <= D (16-bit signed int)
004	RET1	1	0x0002		; return {A}
005	JMP	1	0x8005		; Jump
	JMP => 11
006	GGET	1	0x0000		; {A} <= _G[(STR)D]
	(STR)D= "b'fact'"
007	SUBVN	2	0	1	; {A} <= {B} - (NUM)C
008	CALL	1	2	2	; {A},...,{A+B-2} <= {A}({A+1},...,{A+C-1})
009	MULVV	1	0	1	; {A} <= {B} * {C}
010	RET1	1	0x0002		; return {A}
011	RET0	0	0x0001		; return

The first instruction is a equality comparison. All comparisons must be followed by a jump instruction. The comparison instructions will take the decision to jump or not (there are no flags accessible from the bytecode).

So here, we compare the content of slot 0 to the number passed as an argument, which is 0. And we jump to an offset relative to the instruction next to the jump instruction. To get the relative offset from the instruction argument D, it is necessary to substract 0x8000 from it. An offset less than 0x8000 will result in a jump to previous instructions.

So the JMP instruction at 002 jumps 3 instructions starting at 003. We get to instruction 006 if the parameter given to the function does not equal 0.

The RET0 and RET1 instructions are shortcuts for the RET instruction respectively with no return value and one return value. The RET instruction allows to return an arbitrary number of values.

Note that in this context, the RET0 instruction is never executed (no execution flow goes to it). This is a default return statement added by the interpreter.

Conclusion

LuaJIT's bytecode is an intermediate language to optimise the execution of embedded lua code by avoiding the otherwise necessary language decomposition into understandable instructions for the lua interpreter.

The tool works both on textual lua code (thus making a real JIT compilation) and its own bytecode language.

The reverse engineering process of this code is rather straightforward as local variable names and function names are recoverable although parameter names are not.

I only saw it in the field once. But I found it interesting and simple enough to get a second foot in the bytecode world after reversing Python's bytecode. Both of them have completely different approaches.