Why are the Microsoft Office file formats so complicated?
An excellent post from Joel Spolsky about the file format specifications that were recently released by Microsoft (to comply with or avoid more anti-trust / anti-competition mess).
Last week, Microsoft published the binary file formats for Office. These formats appear to be almost completely insane. The Excel 97-2003 file format is a 349 page PDF file. But wait, that’s not all there is to it!
This is a perfect example of the complexity of parsing, and dealing with file formats (particularly binary file formats) in general. As Joel describes it:
A normal programmer would conclude that Office’s binary file formats:
- are deliberately obfuscated
- are the product of a demented Borg mind
- were created by insanely bad programmers
- and are impossible to read or create correctly.
You’d be wrong on all four counts.
Read the article for more insight about parsing and the kind of data that you’re likely to find in the wild. While you’re at it, his Back to Basics post covers similar ground with regard to proper programming skills, and also gets into the issues of file formats (binary versus XML, and how you build code that reads it).
Joel is another (technical) person whose writing I really enjoy. In the course of digging through his page a bit, I also was reminded of the Best Software Writing I compilation that he assembled, a much needed collection because of the lack of well chosen words on the topic.