Wednesday, January 23, 2008

Static Analysis of MSP430 Firmware in Perl

by Travis Goodspeed <travis at utk.edu>
at the Extreme Measurement Communications Center
of the Oak Ridge National Laboratory

That which follows is an adaption of notes which I made during the course of writing msp430static, a sort of poor man's IDA Pro for static analysis of MSP430 firmware without source code.

What Functions Look Like

A call to strcpy, such as the one which follows, is accomplished by populating r15 with the destination address and r14 with the source address, then calling the function at its hex address. In the following example, foo is the target (r15) and babe is the source (r14). See my article on IAR's MSP430 calling conventions for references to calling convention documentation for various compilers, as each compiler seems to do something different on this platform.
strcpy(foo,babe);                                                                                                                                                     
1154: 3e 40 7a 02 mov #634, r14 ;#0x027a
1158: 0f 44 mov r4, r15 ;
115a: b0 12 a4 11 call #4516 ;#0x11a4

In the unstripped binary, we'll find the code for strcpy at the address (0x11a4) called above:
000011a4 <strcpy>:
11a4: 0d 4f mov r15, r13 ;
11a6: 0c 4f mov r15, r12 ;
11a8: 6f 4e mov.b @r14, r15 ;
11aa: cd 4f 00 00 mov.b r15, 0(r13) ;
11ae: 4f 93 cmp.b #0, r15 ;r3 As==00
11b0: 07 24 jz $+16 ;abs 0x11c0
11b2: 1e 53 inc r14 ;
11b4: 1d 53 inc r13 ;
11b6: 6f 4e mov.b @r14, r15 ;
11b8: cd 4f 00 00 mov.b r15, 0(r13) ;
11bc: 4f 93 cmp.b #0, r15 ;r3 As==00
11be: f9 23 jnz $-12 ;abs 0x11b2
11c0: 0f 4c mov r12, r15 ;
11c2: 30 41 ret

The stripped binary has the function at the same address, but has no function label. In fact, there isn't even a note (in msp430-objdump) that the address is the beginning of a function.
   11a4:       0d 4f           mov     r15,    r13     ;                                                                                                               
11a6: 0c 4f mov r15, r12 ;
11a8: 6f 4e mov.b @r14, r15 ;
11aa: cd 4f 00 00 mov.b r15, 0(r13) ;
11ae: 4f 93 cmp.b #0, r15 ;r3 As==00
11b0: 07 24 jz $+16 ;abs 0x11c0
11b2: 1e 53 inc r14 ;
11b4: 1d 53 inc r13 ;
11b6: 6f 4e mov.b @r14, r15 ;
11b8: cd 4f 00 00 mov.b r15, 0(r13) ;
11bc: 4f 93 cmp.b #0, r15 ;r3 As==00
11be: f9 23 jnz $-12 ;abs 0x11b2
11c0: 0f 4c mov r12, r15 ;
11c2: 30 41 ret
It's easy enough to detect the presence of this code in a stripped executable by looking for "mov r15, r12" or "0x0c 0xf4" and comparing the bytes that follow. I can't stress enough the importance of endian-awareness: the second column is composed of bytes, not words. As a word, "mov r15,r12" is 0xf40c. When in doubt, double-check yourself with the Single Line MSP430 Assembler.

Note that calling conventions vary considerably across the many MSP430 compilers and even among versions of the same compiler, depending upon optimization options and inlining. Don't expect all calls to look like this: check for yourself.

Before looking at a decompilation of the above, notice that a reasonably large string of bytes {6f 4e, cd 4f 00 00, 4f 92} appears twice. This duplicity might be removed by another optimizer, but it shows that something in the code is sufficiently intrinsic to the function to appear twice in one function. Perhaps it will remain consistent across compilers? In point of fact, this expanse of code copies a byte from the address contained within r14 to the address contained within r13. The final word compares the byte that was copied to zero. In the first usage, the function jumps to the end in the event that the comparison is zero. In the second usage, which follows the incrementing of both r14 and r15, the jump is backward if the comparison is not zero. A rough approximation in psuedo-C follows
char* strcpy(char* dest, char* src){
a=dest; //mov r15, r13
b=dest; //mov r15, r12

c=*src; //mov.b @r14, r15
*a=c; //mov.b r15, 0(r13)
if(c==0) //cmp.b #0, r15
goto ret; //jz $+16

do{
src++; //inc r14
a++; //inc r13
c=*src; //mov.b @r14, r15
*a=c; //mov.b r15, 0(r13)
}while(c!=0) //cmp.b #0, r15
// jnz #-12

ret:
return b; //mov r12, r15
} //ret
In the decompilation, I refer to r15 as both dest and c, as its purpose changes completely. Variables are passed as dest=r15 and src=r14, as GCC allocates parameters in the order r15, r14, r13, r12. The result, for strcpy() the destination address, is returned in r15.

It is apparent that this could be written a bit more compactly by merging the first and second stanzas. Also, the use of the indirect auto-increment addressing mode (As=11, of the form @Rn+) could eliminate the "inc r14" line. The instructions might also be reordered, and any number of register combinations might be used to hold intermediate values. It's not possible to detect all the ways in which strcpy() might be implemented, but it shouldn't be too difficult to detect the different ways in which it will be implemented. After all, it's far easier to fix an overflow vulnerability than to hide it; is it not?

Testing my theory, I disassembled the same program, this time compiled with IAR's compiler (V4.09A/W32). Grepping for cmp.b yielded a single line, at address 0xF86E, in the codeblock which follows.
   f864:       0f 4c           mov     r12,    r15     ;                                                                                                               
f866: 0e 4c mov r12, r14 ;
f868: 1c 53 inc r12 ;
f86a: fe 4d 00 00 mov.b @r13+, 0(r14) ;
f86e: ce 93 00 00 cmp.b #0, 0(r14) ;r3 As==00
f872: f9 23 jnz $-12 ;abs 0xf866
f874: 0c 4f mov r15, r12 ;
f876: 30 41 ret
Those that have read my article on the register usage of IAR will note that the ABI is different in the code sample above. IAR fixed the register allocation order in October of 2007, and it now allocates registers in the order r12, r13, r14, r15.

This code is heavily--but imperfectly--optimized, so it's a bit difficult to decompile by hand. It all becomes clear when you realize that r12 is post-incremented and the original value is loaded into r14, the destination address for each character. Unlike GCC, the indirection post-increment addressing mode is used, but on the very next line we see that this necessitates another RAM access! Perhaps the cache will take care of it, but this means that IAR makes three memory accesses--one write and two reads--for every two that GCC makes. I'd recommend hand optimization for this function, if my stronger recommendation wasn't to scrap it as a troublemaker.

The decompiled code follows,
char* strcpy(char* dest, char* src){
char *a,
*b=dest; //mov r12, r15
do{
a=dest; //mov r12, r14
dest++; //inc r12
*(a++)=*src; //mov.b @r13+, 0(r14)
}while(0!=*src); //cmp #0, 0(r14)
//jnz $-12
return b; //mov r15, r12
//ret
}
I'd be willing to bet that the original is quite a bit denser in C, but this ought to be easy enough to understand.

So how do we do a generalized search for this, one which will recognize most implementations by most compilers? I propose a pattern that looks for the following:
  1. The use of two registers, source pointer SRC and destination pointer DEST.
  2. A mov.b instruction with SRC as the source. (Call the destination FOO)
  3. A mov.b instruction with DEST as the destination. (Call the source BAR.)
  4. A cmp.b instruction involving the immediate zero and the register SRC, DEST, FOO, or BAR.
I don't demand an inc for SRC as it might be auto-incremented (@Rn+), and I don't demand one for DEST as it might be copied from another variable in an IAR post-increment (cvar++).

It's possible to add more rules which describe the preceding examples. For example, both of these examples move their first parameter to a temporary register and, later, move it back. Both follow the cmp.b with a jnz. I advise against making any matching pattern too strict, as it'll result in false negatives. Keeping things loose might result in false positives, but those false positives will be fertile ground for exploits of their own, even if they aren't strcpy().

It's also worth noting that a ruleset that's complex is easy to sneak by, either intentionally or accidentally. Suppose this pattern were modified to exclude strncpy(). The following strcpy() implementation would skate by, undetected.
char *strcpy(char *dest, char *src){
return strncpy(dest,src,0x1000);
}
By keeping rules loose--but perhaps prioritized--it's easy to catch such actions. After all, what byte-wise copying until reaching zero is not suspicious?

Recognizing Functions from Perl

Now that the hand analysis is complete, it's time to bring perl into the mix. Instructions are recognized as one of two types: code and IVT entries. I ignore the .data section for now, but a little tweaking of the regular expressions would make it match. I make the assumption that every function begins after a 'ret' and ends with a 'ret'. This isn't strictly true, but it suffices for this article and ought only to miss the first function in memory, assuming everything is built with C.

The first step is to recognize individual lines. I used the following regular expressions in an early revision:
Match an instruction:
# 11b6: 6f 4e mov.b @r14, r15 ;
# 11b8: cd 4f 00 00 mov.b r15, 0(r13) ;
# 1111: 22222222222 33333 44444444444444 555555
/\s(....):\s(.. .. .. ..)\s\t([^ \t]{2,5})\s(.*);?(.*)/
Match an IVT entry:
# fffe: 00 11 interrupt service routine at 0x1100
/[\s\t]*(....):[\s\t]*(.. ..)[\s\t]*interrupt service routine at 0x(....)/
Although I don't strictly need to parse so much detail to recognize strcpy(), it will be helpful when I add features.

Once lines are recognized, they are loaded into a list of strings, indexed by the integer (not hex-string) value of the first field. I make a list of strings, rather than objects, because most comparisons can be performed by regular expressions. This is fine for a 16-bit microcontroller, but might be prohibitively expensive for something larger.

Routines are recognized--as I've previously stated--by assuming that they reside between ret statements. This assumption makes things quite easy to implement, but results in the loss of the first function as well as the concatenation of functions--such as main()--which do not return. In the following example main [118E to 11A0] and strcpy [11A4 to 11C2] are combined into a single listing:
   118e:       31 40 00 0a     mov     #2560,  r1      ;#0x0a00
1192: 04 41 mov r1, r4 ;
1194: 92 43 00 02 mov #1, &0x0200 ;r3 As==01
1198: b0 12 40 11 call #4416 ;#0x1140
119c: b0 12 68 11 call #4456 ;#0x1168
11a0: 30 40 c4 11 br #0x11c4 ;
11a4: 0d 4f mov r15, r13 ;
11a6: 0c 4f mov r15, r12 ;
11a8: 6f 4e mov.b @r14, r15 ;
11aa: cd 4f 00 00 mov.b r15, 0(r13) ;
11ae: 4f 93 cmp.b #0, r15 ;r3 As==00
11b0: 07 24 jz $+16 ;abs 0x11c0
11b2: 1e 53 inc r14 ;
11b4: 1d 53 inc r13 ;
11b6: 6f 4e mov.b @r14, r15 ;
11b8: cd 4f 00 00 mov.b r15, 0(r13) ;
11bc: 4f 93 cmp.b #0, r15 ;r3 As==00
11be: f9 23 jnz $-12 ;abs 0x11b2
11c0: 0f 4c mov r12, r15 ;
11c2: 30 41 ret
This happens because main() returns not by "ret" but by branching to 0x11C4, which is __stop_progExec__ in the firmware being analyzed. An alternate method would be to look for call targets, assuming that 0x11A4 is the beginning of a function because some other instruction calls it.

By searching by call targets, my script correctly recognizes the second function of the preceding example, but it no longer recognizes main(), which in GCC is called by "BR #addr" and not "CALL #addr". A quick check on a small GCC program shows that absolute jumps are only used for main() and non-user functions. Thus, by looking for "CALL #addr" and "BR #addr", it is possible to find the entry points of most if not all functions.

Once functions can be been identified, it isn't very difficult to add an output mode for Graphviz. The following image is a call tree in which main() calls two functions which call strcpy(). Dangerous functions and calls are labeled in red. The two islands on the right--which prevent this from being a Tree in the graph theory sense--exist in assembly as infinite loops.

Further, it's also useful to produce memory maps which detail memory usage. These can be produces from the database by dumping to a graphics programming language. My first revision published to LaTeX/PSTricks. This looks beautiful, but rendering everything as vector art quickly makes a complex memory map unmanageable. My solution was a rewrite that prints raw postscript. Both are shown below.

Conclusion

I've named the tool msp430static, and I intend to publish a revision as soon as I clean up the code. It's a decent hack at this point, but a hack isn't maintainable and I shudder to think at how I'll comprehend these few hundred lines of perl in three months' time without mush revision.

My redesign will feature an SQL backend, such that the input file needn't be reparsed for each minor revision. This will also allow for scripting in languages other than perl. A single command will stock a database of a defined schema, a second will analyze the database, and others will produce output or analysis. I intend to do most analysis in a self-contained perl script, but clients may be written in a variety of languages as appropriate. I'm undecided as to whether I'll make the tool architecture-agnostic in this revision. It's possible, but perhaps that's more appropriate for a later revision. Potential clients include a modified msp430simu and a GTK# GUI.

I don't intend to make a public release of the present version, but I'll send individual copies by email upon request.

Sunday, January 13, 2008

MSP430simu and LaTeX, part 2

by Travis Goodspeed <travis at utk.edu>
at the Extreme Measurement Communications Center
of the Oak Ridge National Laboratory

After writing my previous article, Tracing with MSP430simu, LaTeX, and PowerPoint, I found that I had left a few things unsaid. I'll cover them here, refraining from reiterating the basics.

addr2line


An essential part of any debugger that will be used by mere mortals--which includes a reverse engineer when working on anything other than his pet project--is the ability to print function names instead of hexadecimal addresses. In unix, this is accomplished by addr2line, which may be called thusly:
karen% msp430-addr2line -e overflow.elf 01182 -f -s
myfn
mysource.c:15
karen%
This tells me that 0x1181 is a machine-language instruction from line 15 of mysource.c, which is somewhere within the function myfn(). In my franken-python, I grab function name and file/line number with
    def addr2line(self,addr):
import os
app = os.popen("msp430-addr2line -e overflow.elf 0x%04X -f -s " % addr, "r");
text = app.read();
app.close();
return text;
Note that I hardwired the executable filename, application, and platform. This is very bad practice, but as I warned in my first article, this is a quick hack to generate conference slides. Use something better in your own implementation. (Even if you just need to generate conference slides of your own.)

stack traces


Printing a stack trace is just as easy, if we make the--perhaps incorrect--assumption that a stack trace is just a list of pointers that begins at the top of RAM and grows downward until the address contained within the SP (Stack Pointer) register. By use of addr2line(), it isn't difficult to get a human-readable stack trace.

def stacktrace(self,sp):
s=int(sp);
trace='';
if int(sp)>0x200: #make sure it's in ram
for l in range(0xA00, s-2, -2): #scaled from sp to top of ram
fn=self.addr2line(self.getint(l));
trace+=("0x%04X %s" % (l, fn));
trace+="SP %s" % self.addr2line(int(self.PC));
return trace;


This code is supposed to count each even line in the range [0xA00,s], printing each address whose contents is a function name. I'm unfamiliar with Python.

_reset_vector__


The reset-vector loads RAM with values from ROM. This is essential in a real system, as you'll want an application to run again after RAM has cleared, but it's terribly inconvenient when trying to view an execution trace. The first several frames will be nothing but globals being initialized. For this reason, I added a simple if() statement that refuses to print a frame in batch mode if -1==str.find(fn,"reset_vector").

By searching for _vector(), it's possible to drop all vector handlers, though I've only encountered _reset_vector__ in this project.

Stack Variables


The above is all well and good if the stack contains only points of execution; however, my presentation required that stack variables be shown. Good heavens, how can that be done? Trying addr2line on 0x200 gives:
karen% msp430-addr2line -e simu/overflow.elf 0x202 -f -s
__data_start
??:0
karen%
Thus, any entry that's a pointer to a global variable will give __data_start as its function name, even though it can't supply a line number. I get more luck with
karen% nice msp430-objdump -g simu/overflow.elf | grep 0x202
char foo[12]:uint16 /* 0x202 */;
karen%


As for the question of what data-type is on the stack, I have yet to come up with an adequate solution. I could run objdump into a database, but I'm still left the problem of determining whether the 0x0202 on the stack is a pointer to foo[], an integer, two characters, etc. Expect a third installment of my msp430simu series detailing a solution, but with my slides due in twenty-two hours and my coffee-tin empty, I'll have to overlook it for this draft.

(I'll likely solve this by watching for PUSH and POP instructions. This would let me see the difference between a local variable and a function call.)

As mentioned in my prior article, the presentation has to create a new frame whenever a watched variable is changed. I'm trying to demonstrate a stack overflow, so it's essential that a presentation frame be generated when the stack changes. As such, my actual demonstration prints more than that which is presented here. I print the function name as "RAM" for anything less than the SP, "STACK" for anything greater than that but less than 0x0A00.

A sample stack dump slide block--prior to formatting--follows:
stackdump screenshot

Hacking the Debugger!


At some point, I made a modification that conflicts with the debugging framework that's included with msp430simu. Rather than try to reconcile the changes, I just discontinued use of it. Note that TEST() and END_TEST must still be called in main() to keep the function from dying.

What could be cooler, in a hacking demo, than to have the hack mess with the debugger from inside of a simulation? Looking at test_puts(), you'll find a while-loop that copies a string to TEST_TEXTOUT. To call it from assembly, just load the first character's address in R15 and jump to the address of test_puts, which is 0x1140 in my present revision but likely won't remain that for long. In machine language, this can be accomplished in no more than eight bytes: four to load the string's address in R15, and four to jump to the function. (Assuming, of course, that the fixed addresses are known.) Other hacks are certainly possible. Try disassembling some functions with msp430-objdump foo.elf -d | less to see what you can come up with.

Commentary


Example slide
As depicted above, my slides now properly render, showing most of what's needed from the machine. Only two items remain: commentary and section titles. These two features might have been implemented by a better stack analyzer, but my solution was to specify the slide section through a watched variable. A 16-bit integer is set at the entrance to a function which I'd like to watch, which is the index of a string in my perl script. Commentary is then loaded by calling \input{} in LaTeX on the appropriate include file, which contains anything I would like to be in the left box.

Conclusion


A snazzy draft which simulates a stack overflow attack is available as msp430simu_tidc08.pdf. As with all of my presentations, it makes little sense without spoken commentary, but it's a good example of what can be done with a little bit of work and a CPU simulator.

Wednesday, January 2, 2008

Tracing with MSP430simu, LaTeX, and PowerPoint

by Travis Goodspeed <travis at utk.edu>
at the Extreme Measurement Communications Center
of the Oak Ridge National Laboratory

I have need of a decent MSP430 simulator, and I can't seem to find any documentation for msp430simu, which is an auxiliary part of the MSPGCC project. What follows are some unorganized observations that I've made regarding the code and my use of it to render LaTeX for my upcoming presentation at TIDC '08. Forgive me if it isn't terribly coherent or well organized: It's better than nothing and it ought to save you a lot of time if you find yourself in the same position as I find myself.

The simulator runs code targeted toward a msp430x135, built with mspgcc. IAR could likely be used as well, see TI EZ430 in Linux with IAR Kickstart for details.

Caveat lector--I had no prior experience with Python, and I wrote this code as a quick hack to generate my slides, not as something to release or maintain.

Be sure to download and review the msp430simu code. This article will make little sense without it. I expect my readers to get their hands dirty!
cvs -d:pserver:anonymous@mspgcc.cvs.sourceforge.net:/cvsroot/mspgcc login
cvs -z3 -d:pserver:anonymous@mspgcc.cvs.sourceforge.net:/cvsroot/mspgcc co msp430simu


core.disassemble


After grabbing the source, the Makefile was sufficiently self-explanatory to get an example project up and running. I wished to demonstrate a string copy by dumping the result to LaTeX, so I threw in a dump to disassemble the code and spit it to a file. This worked well, except that my code would behave unpredictably. Adding a single branch would change the execution time from 23 ticks to a timeout after a few thousand ticks!

My mistake was in forgetting to copy PC before calling core.disassemble, which--God only know why--advances the PC to the next address. Thus, whenever I disassembled an address, I'd accidentally advance the PC twice after executing a single instruction!

The repaired code, cited below, copies PC and decodes the copy. This may be called without damaging the execution, and does not alter the simulation's results.
    def printme(self,step,FILE):
#Decode a copy of the pc, so the PC itself isn't advanced.
pc=core.PC(self,int(self.PC));
name, args, execfu, cycles=self.disassemble(pc);

#FILE I/O goes here


RAM


Variables within RAM can be read between instruction executions by calling core.memory.get(). By calling this between instructions, it's possible to watch variables. In my case, rather than stepping through hundreds of slides to get to the fun stuff, I can instead only print a slide when the watched variables change. How cool is that?

The only difficult part here is that you have to know the address of the object you wish to view. Luckily, with optimizations disabled, GCC begins them at the start of ram--0x0200 for the msp430x135. Just as in the heap of an architecture with memory to spare for malloc(), global variables begin at the bottom and grow upward while the stack begins at the top of RAM and grows downward. My globals follow:
int r=0xbeef;
char *foo="Hello world.";
const char *bar="Hey.";


Of course, I'm liable to screw this up if I predict the compiler's actions, so I double-check variable addresses with gdb. In the case of int r=0xBEEF being the first global, I find that:
(gdb) x/h 0x200
0x200 <_r>: 0xbeef
(gdb)


Note that the default value--0xBEEF in this instance--does not exist during the beginning of actual execution of the program. (An early revision of this article erroneously stated that this was never set. This mistake was a result of a bug in my code.) Value--as all of RAM--is initialized to 0x0000. It is only loaded to its specified value during the resetvector function, which is generated by the compiler.

Following _r are two strings. Rather, two pointers to strings, which belong to the two global strings that I instantiated:
(gdb) x/xh 0x200
0x200 <_r>: 0xbeef
(gdb)
0x202 <foo>: 0x1170
(gdb)
0x204 <bar>: 0x117d
(gdb) x/s 0x1170
0x1170 <test_puts+48>: "Hello world."
(gdb) x/s 0x117d
0x117d <test_puts+61>: "Hey."
(gdb)


Note the common C fallacy that I accidentally committed. Not only my const char* but also my char* are RAM pointers to ROM strings. The values will be loaded by the resetvector, but it will make tracing more difficult when I add string value dumping later.

Thus, I change my C code to
#define bar "Hey."
int r=0xbeef;
char foo[]="Hello world.";


And I now get in GDB:
(gdb) x/xh 0x200
0x200 <_r>: 0xbeef
(gdb) x/s 0x202
0x202 <foo>: "Hello world."
(gdb)


Now the string foo exists in RAM at 0x202. That is to say that &(foo[0])==202; earlier, &foo==202. This is much easier to find by address when watching variables.

core.memory.get(addr, bytemode=0)



Grabbing an integer is easy, just make a function like the following:
    def getint(self, addr, bytemode=0):
return self.memory.get(addr, bytemode);


To grab a string--which in my examples is a character array rather than a pointer to a character--, read and concatenate a series of integers.
    def getstr(self, addr):
str='';
c=1;
while(c!=0):
c=self.getint(addr,1);
addr+=1;
str+=chr(c);
return str;

Note that bytemode is set to 1 so as to receive single bytes rather than full (16-bit) words. Printing self.getstr(0x202) while running the simulation runs strcpy(foo,bar) gives me the following:
...
getstr()=Hello w
getstr()=Hello w
getstr()=Hello w
getstr()=Hello w
getstr()=Hello wo
getstr()=Hello wo
getstr()=Hello wo
getstr()=Hello wo
getstr()=Hello wor
getstr()=Hello wor
getstr()=Hello wor
getstr()=Hello wor
getstr()=Hello worl
...


This works, but it's instruction-accurate. Few people have the patience to sit through a 90-minute lecture on machine-language. No one has the patience to sit through such a lecture when it takes four slides to copy a byte. To keep my audience awake, my code only prints a slide when something interesting happens, so the result is not a frame-by-frame record of execution but just slices of time at which watched variables change.

For use in LaTeX, it's necessary to sanitize the string output, particularly if it is to later be corrupted. As this is for a conference presentation--rather than a paper--I use Beamer to generate a PDF slideshow, pdf2oo to generate an OpenOffice Impress presentation, and OpenOffice to export to PowerPoint.

Continued



This article is a continued as MSP430simu and LaTeX, part 2.