An excellent post from Joel Spolsky about the file format specifications that were recently released by Microsoft (to comply with or avoid more anti-trust / anti-competition mess).
Last week, Microsoft published the binary file formats for Office. These formats appear to be almost completely insane. The Excel 97-2003 file format is a 349 page PDF file. But wait, that’s not all there is to it!
This is a perfect example of the complexity of parsing, and dealing with file formats (particularly binary file formats) in general. As Joel describes it:
A normal programmer would conclude that Office’s binary file formats:
- are deliberately obfuscated
- are the product of a demented Borg mind
- were created by insanely bad programmers
- and are impossible to read or create correctly.
You’d be wrong on all four counts.
Read the article for more insight about parsing and the kind of data that you’re likely to find in the wild. While you’re at it, his Back to Basics post covers similar ground with regard to proper programming skills, and also gets into the issues of file formats (binary versus XML, and how you build code that reads it).
Joel is another (technical) person whose writing I really enjoy. In the course of digging through his page a bit, I also was reminded of the Best Software Writing I compilation that he assembled, a much needed collection because of the lack of well chosen words on the topic.
A New York Times article from February about the difficulty of removing your personal information from Facebook. I believe that in the days that followed Facebook responded by making it ever-so-slightly possible to actually remove your account (though still not very easy).
Further, there is the network effect of information that’s not “just” your own. Deleting a Facebook profile does not appear to delete posts you’ve made to “the wall” of any friends, for instance. Do you own those comments? Does your friend? It’s a somewhat similar situation in other areas—even if I chose not to have a Gmail account, because I don’t like their data retention policy, all my email sent to friends with Gmail accounts is subject to those terms I’m unhappy with.
Regardless, this is an enormous issue as we put more of our data online. What does it mean to have this information public? What happens when you change your mind?
Facebook stands out because it’s a scenario of starting college (at age 17 or 18 or now even earlier), having a very different view of what’s public and private, and that evolving over time. You may not care to have things public at the time, but one of the best things about college (or high school, for that matter) is that you move on. Having a log of your outlook, attitude, and photos to prove it that is stored on a a company’s servers means that there are more permanent memories of the time which are out of your control. (And you don’t know who else beside Facebook is storing it—search engine caches, companies doing data mining, etc. all take a role here.) Your own memories might be lost to alcohol or willful forgetfulness, but digital copies don’t behave the same way.
The bottom line is an issue of ownership of one’s own personal information. At this point, we’re putting more information online—whether it’s Facebook or having all your email stored by Gmail—but we haven’t figured out what that really means.
One of the chapters that I had to cut from Visualizing Data was about scenarios—building interactive “what if” tools that help you quickly try out several possibilities. This is one of the most useful aspects of dynamic visualization—being able to try out different ideas in a quick way (and safe, as in non-destructive, since Undo is always nearby). Hopefully I’ll be able to cover this sometime soon.
At any rate, one such scenario-building tool is Slate’s Delegate Calculator, where you can drag primitive sliders back and forth and see the possibilities for delegate outcomes for Hillary and Obama.
I’ve seen complaints about its math, but it seems to do an OK job for a big-picture look at the likelihood of different outcomes. Getting the math 100% is impossible (unless you have a far more complicated interface) because the delegate selection process is different in each state. It appears that none of the states wanted to be seen using the same approach as another, and with fifty states going their own way, things got pretty random (Texas: we’ll have a caucus and a primary).
I think that’s enough posting about politics for a bit.
Found this on Slashdot, but their headline—“Microsoft Developing News Sorting Based On Political Bias” made it sound a lot more interesting than it may be. The idea of mining text data to tease out mythical media biases and leanings sounds fascinating. What sort of axes could be determined? Could we see how different kinds of language are used, or ways that particular code words or phrases infect news coverage?
Unfortunately, the research project from Microsoft looks like it’s just procuring link counts from “liberal” and “conservative” blogs, and gauging the vigor of commentary on either side. Does this make you uneasy yet?
- We are politically binary: the world has devolved into conservative and liberal! (Or not, yet why do people insist on it?) The representation seems almost entirely U.S.-centric, right down to the red and blue coloring on either side. Red states! Blue states! Red blogs! Blue Blogs! A maleficent Dr. Seuss has infected our political outlook.
- What about those other axes, where are they? Of all the things to cull from political discourse, liberal vs. conservative must be one of the least interesting. Did you need a team of six from Microsoft, plus all the computing power at their disposal, to tell you that one article or another ruffled more feathers on either side of this simplified spectrum?
- There’s so much to be learned from propagation of phrases and ideas in the news; why hasn’t there been a more sophisticated representation of it? (Because it’s hard?) The Daily Show has shown this successfully (queueing several people in order repeating something like “axis of evil” or something about “momentum” for a candidate).
- Blogs are not real. When you turn off the computer, they go away. The internet is not a place, and is too divorced from actual reality to be a useful gauge on most social phenomena. Using blogs as input for a kind of meta-study seems like a poor way to acquire data.
The problems I cite are a bit unfair since they haven’t posted much on their site (looks like they’re presenting a paper…soon?) so the reaction is just based on what they’ve provided. I knew Sumit Basu back at the Media Lab and I think it’s safe to assume there’s more going on…
But what about these bigger issues?
Halfway through The Fog of War by Errol Morris (of The Thin Blue Line, or the Apple “Switch” ad campaign depending on your persuasion), Robert S. McNamara (Secretary of Defense for the Kennedy and Johnson administrations) describes proportionality in war:
Why was it necessary to drop the nuclear bomb if [General Curtis] LeMay was burning up Japan? And he went on from Tokyo to firebomb other cities. 58% of Yokohama. Yokohama is roughly the size of Cleveland. 58% of Cleveland destroyed. Tokyo is roughly the size of New York. 51% percent of New York destroyed. 99% of the equivalent of Chattanooga, which was Toyama. 40% of the equivalent of Los Angeles, which was Nagoya. This was all done before the dropping of the nuclear bomb, which by the way was dropped by LeMay’s command.
The gruesome description is abetted by a different kind of proportionality—that when placed in the context of size with regard to U.S. cities, these numbers become more “real.” I found this set particularly striking for how ordinary the cities were—Cleveland and Chattanooga, in addition to the usual New York and Los Angeles. The huge metropolitan areas may be too abstract for many, but Cleveland!?—those are actual people!
The entire transcript is also on Errol Morris’ site—amazing. Why don’t more studios do this? It’s great to be able to study it more closely, and was enough to convince me to purchase (rather than just rent) the movie.
Article from the Chronicle of Higher Education about course selection (competition, class lotteries, etc).
Every college has a hot-ticket class. Maybe it’s the subject matter (serial killers! sailing!) or maybe it’s a celebrity professor (George Tenet! Toni Morrison!). Whatever it is, everybody wants to get in.
And, of course, not everybody can. So how do you decide who gets a seat and who’s disappointed?
If you’re Patricia de Castries, you make everybody sleep outside your door. Ms. de Castries, assistant director of the Stanford Language Center, teaches a wildly popular wine-tasting course at the university. Often more than 100 would-be connoisseurs compete for the 60 spots, so on the eve of registration students show up with pillows and sleeping bags, hoping to get their names on the list. “It’s tough,” says Ms. de Castries, “but if you want to be in the class, you do it.”
Covers the range from MIT’s technical approach to Wharton’s free market approach, where students at the latter bid on courses using a point system. Sadly, the article now seems to be blocked except for those academic-types who have access to a subscription.
(Thanks Eugene)
I’ve always been uncomfortable with the idea of David Brin’s The Transparent Society, because it provides an over-simplified version of a very complex problem. While it appeals to our general obsession with finding simple solutions, it fails to actually address a very real problem. Rather than a revolutionary or provocative idea, it’s in fact an argument for maintaining the status quo.
I’ve never quite been able to parse it out properly, but was pleased to see that Bruce Schneier (Chuck Norris of the security industry) addressed Brin’s argument this week for Wired:
When I write and speak about privacy, I am regularly confronted with the mutual disclosure argument. Explained in books like David Brin’s The Transparent Society, the argument goes something like this: In a world of ubiquitous surveillance, you’ll know all about me, but I will also know all about you. The government will be watching us, but we’ll also be watching the government. This is different than before, but it’s not automatically worse. And because I know your secrets, you can’t use my secrets as a weapon against me.
This might not be everybody’s idea of utopia — and it certainly doesn’t address the inherent value of privacy — but this theory has a glossy appeal, and could easily be mistaken for a way out of the problem of technology’s continuing erosion of privacy. Except it doesn’t work, because it ignores the crucial dissimilarity of power.
Schneier’s most recent book is Beyond Fear (which I’ve not yet had a chance to read) and also has an excellent monthly mailing list (that I read all the time) that covers topics like privacy and security. He is a gifted writer who can explain both the subtleties of the privacy debate as well as the complexities of security in terms that are informative for technologists and interesting for anyone else.
Chapters 9 and 10 (acquire and parse) are secretly my favorite parts of Visualizing Data. They’re a grab bag of useful bits based on many years of working with information (previous headaches)… the sort of things that come up all the time.
Page 327 (Chapter 10) has some discussion about little endian versus big endian, the way in which different computer architectures (Intel vs. the rest of the world, respectively) handle multi-byte binary data. I won’t repeat the whole section here, though I have two minor errata for that page.
First, an error in formatting which lists network byte order, rather than network byte order. The other problem is that I mention that little endian versions of Java’s DataInputStream class can be found on the web for little more than a search for DataInputStreamLE. As it turns out, that was a big fat lie, though you can find a handful if you search for LEDataInputStream (even though that’s a goofier name).
To make it up to you, I’m posting proper DataInputStreamLE (and DataOutputStreamLE) which are a minor adaptation of code from the GNU Classpath project. They work just like DataInputStream and DataOutputStream, but just swap the bytes around for the Intel-minded. Have fun!
DataInputStreamLE.java
DataOutputStreamLE.java
I’ve been using these for a project and they seem to be working, but let me know if you find errors. In particular, I’ve not looked closely at the UTF encoding/decoding methods to see if there’s anything endian-oriented in there. I tried to clean it up a bit, but the javadoc may also be a bit hokey.
(Update) Household historian Shannon on the origin of the terms:
The terms “big-endian” and “little-endian” come from Gulliver’s Travels by Jonathan Swift, published in England in 1726. Swift’s hero Gulliver finds himself in the midst of a war between the empire of Lilliput, where people break their eggs on the smaller end per a royal decree (Protestant England) and the empire of Blefuscu, which follows tradition and breaks their eggs on the larger end (Catholic France). Swift was satirizing Henry VIII’s 1534 decision to break with the Roman Catholic Church and create the Church of England, which threw England into centuries of both religious and political turmoil despite the fact that there was little doctrinal difference between the two religions.
The United Nations has just launched a new web site to house all their data for all you kids out there who wanna crush Hans Rosling. The availability of this sort of information has been a huge problem in the past (Hans’ talks are based on World Bank data that costs real money), and while the U.N. has been pretty good about making things available, a site whose sole purpose is to disseminate usable data is enormous.
Dominic Allemann has developed a Swiss version of the zipdecode example from chapter six of Visualizing Data. This is the whole point of the book—to actually try things out and adapt them in different ways and see what you can learn from it.
Switzerland makes an interesting example because it has far fewer postal codes than the U.S., though the dots are quite elegant on their own. With fewer data points, I’d be inclined to 1) change the size of the individual points to make them larger without making them overwhelming, 2) or work with the colors to make the contrast more striking, since changing the point size is likely to be too much), and 3) get the text into mixed case (in this example, Gossau SG instead of GOSSAU SG). Something as minor as avoiding ALL CAPS helps get us away from the representation looking too much like COMPUTERS and DATABASES, and instead into something meant for regular humans. Finally, 4) with the smaller (and far more regular) data set, it’s not clear if the zoom even helps—could even be better off without it.
Thanks to Dominic for passing this along; it’s great to see!
I’m in the midst of rolling out a web site redesign. The former site (un)design was assembled just after finishing my Ph.D. I expected it to be bad enough to force myself to make a proper site. Three and a half years passed, with even friends who weren’t designers (including my future mother-in-law) taking exception. The redesign was done by my friend Eugene Kuo, who couldn’t deal with it any longer.
I’m currently building out the design and hooking up all the pages (including a handful of projects that weren’t linked before). The navigation at the top will slowly begin to work as this process continues. For instance, the “projects” link currently points to my old site, which is missing anything I’ve done in the past four years. The big images on the home page will soon be rotating through projects, while the new projects page will provide a better visual overview of what’s inside.
At any rate, thanks to Eugene and keep an eye out…
I’ve not had a chance to try these out with an actual project yet, but the Google Chart API seems to be a decent way to get Tufte® compliant chart images using simple web requests. Just pack the info for the chart’s appearance and data into a specially crafted URL and you’re set.
It’s a nice idea for a service, and I also appreciate that Google has kept it simple, rather than implementing it through a series of obfuscated and strangely-crafted embedded JavaScript (like, say, Google maps or their newer search APIs after discontinuing the SOAP protocol).
Given the number of data points provided, it would be difficult to refute the findings depicted in this chart.
Related work can be found here and here. While later research findings (by latecomers who foolishly claim to have invented the approach) here and here.
Thanks to Raelynn Miles for the original link.
Beautiful info graphic from a September 2007 article about the restoration of the Guggenheim, depicting the cracks in the concrete walls. From the image:
Since the Guggenheim Museum opened in 1959, Frank Lloyd Wright’s massive spiral facade has been showing signs of cracking, mainly from seasonal temperature fluctuations that caus the concrete walls, built without expansion joints, to contract and expand.
The image is partly striking for the contrast between the NYT-style geometric graphic and pale colors mixed with the organic shape of the cracks. Wonderful.

Sent from one of my former students at CMU (you know who you are, drop me a line if it was you…I’ve lost the original message!)