Awesome
Conan, a detective which aggregate some clues to recognize MIME type
conan
is a re-implementation of the famous command file
:
$ file --mime image.png
image/png
$ conan.file --mime image.png
image/png
This program/library (see libmagic
) is widely used on protocols to transmit
the MIME type of a file. It permits then to call the right program to
manipulate the given file.
For instance, the HTTP protocol transmits via the Content-Type
field the
MIME type of the body:
HTTP/1.1 200 OK
Content-Type: image/png
Content-Length: 4096
<your image>
You can find the usage of file
into many places such as your web browser (to
be able to execute the right application to interpret the given file) or your
Mail User Agent.
However, file
is pretty old (1987), its implementation is in C, it was not
formalized and it does not have a standard. Take file
as an engine for the
file recognition can be a risk (segmentation fault, undefined behavior,
unstable release process, etc.)
But file
was involved for several years and it contains a great extensible
database which can be reliable due to its seniority. So, some famous
softwares decided to re-implement a subset of file
/libmagic
which is less
expressive/powerful but it does the job in a certain expectation.
The DSL - libmagic
The file
's database use a certain language described by man magic
:
- a line describe an operation
- an operation is:
- a test of a certain value at a certain position into the given file
- an anchor
- a jump instruction to an anchor
- a MIME value
- a strength value
These operations are organized as a tree. An operation is prepended by a
level (>
) and, from it, we are able to construct the decision tree which
describes multiple paths to recognize the MIME type of the given file.
For instance:
[0]
> [1]
>> [2]
> [3]
>> [4]
>>> [5]
produces this decision tree:
[0]
| \
[1][3]
| |
[2][4]
|
[5]
The test operation
An operation is usually a test which compares the data starting at a particular
offset in the file with a byte value, a string or a numeric value. If the test
succeeds, we continue along the path according to your decision tree. For
instance, if operation-0
succeeds, we will try [1]
and [3]
.
Along the process, we will aggregate multiple solutions which have a priority - see the strength value - and we will choose the highest one.
The test operations of the following fields:
- offset: A number specifying the offset (in bytes) into the file of the data
which is to be tested. This offset can be relative from the previous
operation's offset if it begins with
&
. - type: The type of the data to be tested. We implemented many types such as
byte
,short
,long
,string
ordate
. - test: the value to be compared with the value from the file.
- message: the message to be printed if the comparison succeeds.
Let's play!
An example is more intersting than the theory. Let's try to recognize a
zlib
archive. According to [RFC1950][] (where CMF and FLG are the first bytes
in MSB order of an zlib
archive):
The FCHECK value must be such that CMF and FLG, when viewed as a 16-bit unsigned integer stored in MSB order (CMF*256 + FLG), is a multiple of 31.
Then, the first byte should have a CM = 8:
CM (Compression method) This identifies the compression method used in the file. CM = 8 denotes the "deflate" compression method with a window size up to 32K.
And the RFC precises that CM should not be equal to 15
(as a reserved value),
so we can consider that CM & 0x80
(the most significant bit) should not be
equal to 1.
Finally, we have 3 tests to do:
- the 16-bits number (big-endian order) must be a multiple of 31
- CM which is the 4 most significant bits of the first byte must be equal to 8
- CM should not be equal to 15 and its most significant bit should not be equal to 1
In our syntax and according to the idea of a decision tree, we must test
step by step these assertions. At the end, we can say that the file is
probably an application/zlib
:
0 beshort%31 =0
>0 byte&0xf =8
>>0 byte&0x80 =0
!:mime application/zlib
Now, let's play with conan
:
open Rresult
let zlib =
{file|0 beshort%31 =0
>0 byte&0xf =8
>>0 byte&0x80 =0
!:mime application/zlib
|file}
let tree = R.failwith_error_msg @@ Conan_unix.tree_of_string zlib
let () =
if Array.length Sys.argv >= 2
then
let m = R.failwith_error_msg @@
Conan_unix.run_with_tree tree Sys.argv.(1) in
match Conan.Metadata.mime m with
| Some v -> Fmt.pr "%s\n%!" v
| None -> Fmt.epr "MIME type not found.\n%!"
else Fmt.epr "%s <filename>" Sys.argv.(0)
This little program will only recognize "application/zlib" according to our description above. Of course, the DSL can be more complex than that!
Complex recognition
Indirect offset
Offsets do not need to be constant, but can also be read from the file being
examined. If the first character following the last >
is a parenthesis then
the string inner is interpreted as an indirect offset. value at that offset is
read, and is used again as an offset in the file.
For instance, such tree will do an indirection from the unsigned long number
(little-endian) value available at the offset 0x3c
:
0 string MZ
>0x18 leshort >0x3f
>>(0x3c.l) string PE\0\0 PE executable (MS-Windows)
>>(0x3c.l) string LE\0\0 LX executable (OS/2)
You should check the man magic
to see the syntax and available types. You are
able to apply a calculation if the indirect offset can not be used directly
such as this example when we multiple the indirect offset with 512:
>0x18 leshort <0x40
>>(4.s*512) leshort 0x014c COFF executable (MS-DOS, DJGPP)
>>(4.s*512) leshort !0x014c MZ executable (MS-DOS)
Relative offset
Moreover you can specify an offset relative to the end of the last up-level
field using &
as a prefix to the offset:
0 string MZ
>0x18 leshort >0x3f
>>(0x3c.l) string PE\0\0 PE executable (MS-Windows)
>>>&0 leshort 0x14c for Intel 80386
>>>&0 leshort 0x184 for DEC Alpha
And, of course, indirect and relative offsets can be combined.
Jump and recursion
It is possible to define a "named" magic instance that can be called from another use magic entry, like a subroutine call. The offset of the subroutine is relative to the caller.
To be able to call a subroutine, we use the use
operation with the name of
the subroutine. You don't need to define the subroutine before the caller.
Indeed, file
and conan
collects all subroutines first and process then the
decision tree.
This is a simple example to determine if a length of the given file is odd or even:
0 name even
>0 byte x even
>>1 use odd
0 name odd
>0 byte x odd
>>1 use even
0 byte x
>0 use odd
Other operations
The libmagic
DSL implements many things but as we said, a standard of it does
not exist. We mostly tried to do a reverse engineering on it to implement
operations. Some of them are not implemented - due to the lack of definitions
or just because we did not find them into the file
's database. Some others are
explicitely not implemented because we judge them as a hack instead of an
homogene feature.
Then, we are mostly focus to deliver the MIME type instead of a full
description of the given file. file
shows you many things such as the size
of the image, the bitrate of the sound, etc. We tried to implement them but
we are more focused on the MIME recognition.
Experimental
According to what we said above, conan
is experimental and for the usage
point of view, it can leak exceptions such as Unimplemented feature
.
Then, even if a big work was done about types where we try to unify type of the expected value and type of the test, the type expected by the message still is weak (for many reasons). In other words, even if we can parse and process the decision tree, we still are able to fail when we print out messages (because we can not unify the type of the value and the expected type from the given message).
Finally, file
does not describe any standards about the database and man
pages are a bit obsolete according to what the file
command do. For these
reasons, it's hard to prove/and say that we have the same behavior than file
.
We try to be close to what it does, but in some edge cases, we can not ensure
that we will produce the same result as file
.
Also, we did not discovered everything from the given database. Even if we can parse and generate a decision tree from the database, some specific execution paths can lead to an unexpected failure. We are prompted to fix them step by step of course. Feel free to test and write an issue!
MirageOS support
The other goal of conan
is to be able to integrate the database into an
unikernel and to give an opportunity for an application (such as a web server)
to recognize MIME types of files.
syscalls
As any MirageOS projects, conan
abstracts required syscalls to introspect
a file. In this way, conan.string
exists and it is able to recognize the MIME
type of a given string
(instead of a file). lwt
support exists too which
manipulate a stream.
Database
conan
is able to parse a database and serialize it as a full OCaml value. The
distribution provides 2 databases:
- the
file
's database - the previous database without extra paths which don't not tag the MIME type
The second is lighter than the first and should be used only to get the MIME type. Indeed, any information such as the size of the image or the bitrate of the sound are deleted.
For instance, an unikernel for Solo5 with the ligher database is around 6 MB.
You can also build your own special database. If you know that you want to
recognize only few objects, you can merge tree
values for these objects and
make a smaller database:
#require "conan-unix" ;;
#require "conan-database" ;;
let tree0 = Conan_compress.tree
let tree1 = Conan_ocaml.tree
let tree2 = Conan_audio.tree
let tree = List.fold_left Conan.Tree.merge Conan.tree.empty
[ tree0; tree1; tree2 ]
let recognize_ocaml_or_archive_or_audio filename =
Conan_unix.run_with_tree tree filename