This week I went crazy about file formats. I tried to understand specifications of many popular formats like MP3, FLV, PDF. Its amazing to see that no matter how complex these technologies are or the algorithms they use to store media efficiently, at the lower level it is just a clever arrangement of bits that makes sense and with a bit of experimentation and hacking around MP3 format (a Hex Editor is a invaluable tool in this), I was able to read them in PHP without using any extension. The source has been put on GitHub.
Binary File Reader
The native method for reading a binary file is unpack(). The problem with it was that it can’t handle variable length chunks, and I found it tough to understand the format of packing codes. Unluckily, I realized it quite late (damn!), that I can create the reader more efficiently by using unpack() function. (Gist)
A background On ID3 Tags
Like I said, tags are nothing but just an arrangement of bytes which makes sense. As the official spec describes, the first three bytes, are fixed, which are “ID3”. Next two bytes declare version, one byte for flags and next four bytes for total length of the tags that follow. I found, no much use to the first 10 bytes, especially the flag byte is completely obscure of what its purpose is.
Next what follows is a series of frames with header and body which declare the actual content. The header has four characters for its Frame ID, followed by four bytes for size of body, two bytes for flag and next follows the body of tag. It is more clear from the picture below.
00 00 00 0C is the size of tag body ( 12 bytes ),
48 65 represent flag bits which is described in spec and the next 12 bytes ( “Heavy Metal” ) form the body of tag. Many of such frames make up the information about the MP3 file. Some frames have further formatting in their “body” like APIC which represents the Album art.
Constructing an ID3 Reader
Once you understand the spec, creating a reader is very simple.The first step should be to read the header bytes.
1 2 3 4 5 6
The constructor in
ID3Tags_Reader.php, initializes a
BinaryFileReader object with a map of first 10 bytes. As explained, ID3 is fixed 3-byte string followed by version, flag and total size of tag body (which is casted to an integer). Once header is read we can start reading tags.
ReadAllTags() method defines a similar map for reading frames,
1 2 3 4 5 6
“Body” uses an option to define a variable length string which depends upon “Size” (Keep in mind to type cast “Size” to integer). A while loop follows to read all tags defined in
Reading Album Art
The Album art or Attachment Picture, in official sense, refers to a picture of albums, songs we see in our music players. The body of APIC has a special formatting described in the spec. The problem in reading was how to create a File handle from string for
BinaryFileReader. While the thing could have easily been achieved by
unpack(), I would not let my work get un-noticed :).
PHP provides a method by which we can create artificial streams without using files. They are so flexible that you can create them out of strings, http resource, standard input etc. To create a stream here, we can simply use “data://” like,
To read the image data, the map we can use is,
1 2 3 4 5 6
MimeType, Content Description and FileName have no specific size but are just null-terminated strings and BinaryData which contains the main image content is rest of the remaining file.