Why are the Microsoft Office file formats so complicated?

An excellent post from Joel Spolsky about the file format specifications that were recently released by Microsoft (to comply with or avoid more anti-trust / anti-competition mess).

Last week, Microsoft published the binary file formats for Office. These formats appear to be almost completely insane. The Excel 97-2003 file format is a 349 page PDF file. But wait, that’s not all there is to it!

This is a perfect example of the complexity of parsing, and dealing with file formats (particularly binary file formats) in general. As Joel describes it:

A normal programmer would conclude that Office’s binary file formats:

  • are deliberately obfuscated
  • are the product of a demented Borg mind
  • were created by insanely bad programmers
  • and are impossible to read or create correctly.

You’d be wrong on all four counts.

Read the article for more insight about parsing and the kind of data that you’re likely to find in the wild. While you’re at it, his Back to Basics post covers similar ground with regard to proper programming skills, and also gets into the issues of file formats (binary versus XML, and how you build code that reads it).

Joel is another (technical) person whose writing I really enjoy. In the course of digging through his page a bit, I also was reminded of the Best Software Writing I compilation that he assembled, a much needed collection because of the lack of well chosen words on the topic.

Saturday, March 15, 2008 | parse  

Visualizing Data Book CoverVisualizing Data is my 2007 book about computational information design. It covers the path from raw data to how we understand it, detailing how to begin with a set of numbers and produce images or software that lets you view and interact with information. When first published, it was the only book(s) for people who wanted to learn how to actually build a data visualization in code.

The text was published by O’Reilly in December 2007 and can be found at Amazon and elsewhere. Amazon also has an edition for the Kindle, for people who aren’t into the dead tree thing. (Proceeds from Amazon links found on this page are used to pay my web hosting bill.)

Examples for the book can be found here.

The book covers ideas found in my Ph.D. dissertation, which is the basis for Chapter 1. The next chapter is an extremely brief introduction to Processing, which is used for the examples. Next is (chapter 3) is a simple mapping project to place data points on a map of the United States. Of course, the idea is not that lots of people want to visualize data for each of 50 states. Instead, it’s a jumping off point for learning how to lay out data spatially.

The chapters that follow cover six more projects, such as salary vs. performance (Chapter 5), zipdecode (Chapter 6), followed by more advanced topics dealing with trees, treemaps, hierarchies, and recursion (Chapter 7), plus graphs and networks (Chapter 8).

This site is used for follow-up code and writing about related topics.