Wednesday, May 28, 2014

Mach-O filetype identification

I wanted to write a quick and dirty file-type identifier for Mach-O, turns out this is more tricky than I expected. From /usr/share/file/magic/mach:
# $File: mach,v 1.9 2009/09/19 16:28:10 christos Exp $
# Mach has two magic numbers, 0xcafebabe and 0xfeedface.
# Unfortunately the first, cafebabe, is shared with
# Java ByteCode, so they are both handled in the file "cafebabe".
# The "feedface" ones are handled herein.
and from /usr/share/file/magic/cafebabe:
# Since Java bytecode and Mach-O universal binaries have the same magic number, the test
# must be performed in the same "magic" sequence to get both right.  The long
# at offset 4 in a mach-O universal binary tells the number of architectures; the short at
# offset 4 in a Java bytecode file is the JVM minor version and the
# short at offset 6 is the JVM major version.  Since there are only 
# only 18 labeled Mach-O architectures at current, and the first released 
# Java class format was version 43.0, we can safely choose any number
# between 18 and 39 to test the number of architectures against
# (and use as a hack). Let's not use 18, because the Mach-O people
# might add another one or two as time goes by...
GAAAH! Unsurprisingly more than one engineer wanted to use the cutesy "cafebabe" for their magic string. I ended up using this regex, which will also match Java bytecode, but was good enough for my purpose:
^(cffaedfe|cefaedfe|feedface|feedfacf|cafebabe)
The full Mach-O filetype doco is here. The various magic byte strings are as follows:
  • cefaedfe: Mach-O Little Endian (32-bit)
  • cffaedfe: Mach-O Little Endian (64-bit)
  • feedface: Mach-O Big Endian (32-bit)
  • feedfacf: Mach-O Big Endian (64-bit)
  • cafebabe: Universal Binary Big Endian. These fat binaries are archives that can include binaries for multiple architectures, but typically contain PowerPC and Intel x86.

No comments: