Writing

So much for “wonderfully simple”

In contrast to the clarity and simplicity of the New York Times info graphic mentioned yesterday, the example currently on their home page is an example of the opposite:

This is helpful because it clarifies the point I tried to make about what was nice about the other graphic. Because of space limitations, this graphic is small, and the information is stored across multiple panels. So at the top there are a pair of tabs. Then within the tabs we have a pair of buttons. Two tabs, four buttons, just to get through four possible pieces of data. That’s the sort of combinatoric magic we see in Microsoft Windows preference panels:

snap1.gif

While the organization in the info graphic makes conceptual sense—first you must choose one of two states, then choose one of the candidates—it makes little cognitive sense. We’re choosing between one of four options. Just give them to us! For a pair of items beneath another pair of items, there’s no need to establish a sense of hierarchy. If there were a half dozen states, and a half dozen candidates, then that might make sense. Just because the data is technically hierarchic, or arranged in a tree, that doesn’t mean that it’s the best representation for it.

The solution? Just give us the four options. No sliding panels, trap doors, etc. Better yet, superimpose the Clinton and Obama data on a single map as different colors, and have a pair of buttons (not tabs!) that let the viewer quickly swap between Indiana and North Carolina.

(This only covers the interaction model, without getting into the way the data itself is presented, colors chosen, laid out, etc. The lack of population density information in the image makes the maps themselves nearly worthless.)

Tuesday, May 6, 2008 | infographics, interact, politics  

Average Distance to the Nearest Road in the Conterminous United States

Got an email over the weekend from Tom Vanderbilt, who had seen the All Streets piece, and was kind enough to point me to this map (PDF) from the USGS that depicts the average distance to the nearest road across the continental 48 states. (He’s currently working on a book titled Traffic: Why We Drive the Way We Do (and What It Says About Us) to be released this fall).

And too bad I just learned the word conterminous, but had I used that in the original project description, we would have missed (or been spared) the Metafilter discussion of whether “lower 48” was accurate terminology.

roadproximity2.jpg

A really interesting map, which of course also shows the difference between something thrown together in a few hours and actual research. In digging around for the map’s source, I found that exactly a year go, they also published a paper in Science describing their broader work:

Roads encroaching into undeveloped areas generally degrade ecological and watershed conditions and simultaneously provide access to natural resources, land parcels for development, and recreation. A metric of roadless space is needed for monitoring the balance between these ecological costs and societal benefits. We introduce a metric, roadless volume (RV), which is derived from the calculated distance to the nearest road. RV is useful and integrable over scales ranging from local to national. The 2.1 million cubic kilometers of RV in the conterminous United States are distributed with extreme inhomogeneity among its counties.

The publication even includes a response and a response to the response—high scientific drama! Apparently some lads feel that “roadless volume does not explicitly address ecological processes.” So let that be a warning to all you non-explicit addressers.

For those lucky to have access to the journal online, the supplementary information includes a time lapse video of a section of Colorado, and its roadless volume since 1937. As with all things, it’s much more interesting to see how this changes over time. A map of all streets in the lower 48 isn’t nearly as engaging as a sequence of the same area over several years. The latter story is simply far more compelling.

Tuesday, May 6, 2008 | allstreets, feedbag, mapping  

Unicode, character encodings, and the declining dominance of Western European character sets

Computers know nothing but numbers. As humans we have varying levels of skill in using numbers, but most of the time we’re communicating with words and phrases. So in the early days of computing, the earliest software developers had to find a way to map each character—a letter Q, the character #, or maybe a lowercase b—into a number. A table of characters would be made, usually either 128 or 256 of them, depending on whether data was stored or transmitted using 7 or 8 bits. Often the data would be stored as 7 bits, so that the eighth bit could be used as a parity bit, a simple method of error correction (because data transmission—we’re talking modems and serial ports here—was so error prone).

Early on, such encoding systems were designed in isolation, which meant that they were rarely compatible with one another. The number 34 in one character set might be assigned to “b”, while in another character set, assigned to “%”. You can imagine how that works out over an entire message, but the hilarity was lost on people trying to get their work done.

In the 1960s, the American National Standards Institute (or ANSI) came along and set up a proper standard, called ASCII, that could be shared amongst computers. It was 7 bits (to allow for the parity bit) and looked like:

  0 nul    1 soh    2 stx    3 etx    4 eot    5 enq    6 ack    7 bel
  8 bs     9 ht    10 nl    11 vt    12 np    13 cr    14 so    15 si
 16 dle   17 dc1   18 dc2   19 dc3   20 dc4   21 nak   22 syn   23 etb
 24 can   25 em    26 sub   27 esc   28 fs    29 gs    30 rs    31 us
 32 sp    33  !    34  "    35  #    36  $    37  %    38  &    39  '
 40  (    41  )    42  *    43  +    44  ,    45  -    46  .    47  /
 48  0    49  1    50  2    51  3    52  4    53  5    54  6    55  7
 56  8    57  9    58  :    59  ;    60  <    61  =    62  >    63  ?
 64  @    65  A    66  B    67  C    68  D    69  E    70  F    71  G
 72  H    73  I    74  J    75  K    76  L    77  M    78  N    79  O
 80  P    81  Q    82  R    83  S    84  T    85  U    86  V    87  W
 88  X    89  Y    90  Z    91  [    92  \    93  ]    94  ^    95  _
 96  `    97  a    98  b    99  c   100  d   101  e   102  f   103  g
104  h   105  i   106  j   107  k   108  l   109  m   110  n   111  o
112  p   113  q   114  r   115  s   116  t   117  u   118  v   119  w
120  x   121  y   122  z   123  {   124  |   125  }   126  ~   127 del

The lower numbers are various control codes, and the characters 32 (space) through 126 are actual printed characters. An eagle-eyed or non-Western reader will note that there are no umlauts, cedillas, or Kanji characters in that set. (You’ll note that this is the American National Standards Institute, after all. And to be fair, those were things well outside their charge.) So while the immediate character encoding problem of the 1960s was solved for Westerners, other languages would still have their own encoding systems.

As time rolled on, the parity bit became less of an issue, and people were antsy to add more characters. Getting rid of the parity bit meant 8 bits instead of 7, which would double the number of available characters. Other encoding systems like ISO-8859-1 (also called Latin-1) were developed. These had better coverage for Western European languages, by adding some umlauts we’d all been missing. The encodings kept the first 0–127 characters identical to ASCII, but defined characters numbered 128–255.

However this still remained a problem, even for Western languages, because if you were on a Windows machine, there was a different definition for characters 128–255 than there was on the Mac. Windows used what was called Windows 1252, which was just close enough to Latin-1 (embraced and extended, let’s say) to confuse everyone and make a mess. And because they like to think different, Apple used their own standard, called Mac Roman, which had yet another colorful ordering for characters 128–255.

This is why there are lots of web pages that will have squiggly marks or odd characters where em dashes or quotes should be found. If authors of web pages include a tag in the HTML that defines the character set (saying essentially “I saved this on a Western Mac!” or “I made this on a Norwegian Windows machine!”) then this problem is avoided, because it gives the browser a hint at what to expect in those characters with numbers from 128–255.

Those of you who haven’t fallen asleep yet may realize that even 200ish characters still won’t do—remember our Kanji friends? Such languages usually encode with two bytes (16 bits to the West’s measly 8), providing access to 65,536 characters. Of course, this creates even more issues because software must be designed to no longer think of characters as a single byte.

In the very early 90s, the industry heavies got together to form the Unicode consortium to sort out all this encoding mess once and for all. They describe their charge as:

Unicode provides a unique number for every character,
no matter what the platform,
no matter what the program,
no matter what the language.

They’ve produced a series of specifications, both for a wider character set (up to 4! bytes) and various methods for encoding these character sets. It’s truly amazing work. It means we can do things like have a font (such as the aptly named Arial Unicode) that defines tens of thousands of character shapes. The first of these (if I recall correctly) was Bitstream Cyberbit, which was about the coolest thing a font geek could get their hands on in 1998.

The most basic version of Unicode defines characters 0–65535, with the first 0–255 characters defined as identical to Latin-1 (for some modicum of compatibility with older systems).

One of the great things about the Unicode spec is the UTF-8 encoding. The idea behind UTF-8 is that the majority of characters will be in that standard ASCII set. So if the eighth bit of a character is a zero, then the other seven bits are just plain ASCII. If the eighth bit is 1, then it’s some sort of extended format. At which point the remaining bits determine how many additional characters (usually two) are required to encode the value for that character. It’s a very clever scheme because it degrades nicely, and provides a great deal of backward compatibility with the large number of systems still requiring only ASCII.

Of course, assuming that ASCII characters will be most predominant is to some repeating the same bias as back in the 1960s. But I think this is an academic complaint, and the benefits of the encoding far outweigh the negatives.

Anyhow, the purpose of this post was to write that Google reported yesterday that Unicode adoption on the web has passed ASCII and Western European. This doesn’t mean that English language characters have been passed up, but rather that the number of pages encoded using Unicode (usually in UTF-8 format), has finally left behind the archaic ASCII and Western European formats. The upshot is that it’s a sign of us leaving the dark ages—almost 20 years since the internet was made publicly available, and since the start of the Unicode consortium, we’re finally starting to take this stuff seriously.

The Processing book also has a bit of background on ASCII and Unicode in an Appendix, which includes more about character sets and how to work with them. And future editions of vida will also cover such matters in the Parse chapter.

Tuesday, May 6, 2008 | parse, unicode, updates, vida  

Another delegate calculator

Wonderfully simple delegate calculator from the New York Times. Addresses a far simpler question than the previously mentioned Slate calculator, but bless the NYT for realizing that something that complicated was no longer necessary.

delegate-5101.jpg

Good example of throwing out extraneous information to tell a story more directly: a quick left and right drag provides a more accurate depiction than the horse race currently in the headlines.

Monday, May 5, 2008 | election, politics, scenarios  

Doin’ stats for the C’s

A New York Times piece by the Freakonomics guys about Mike Zarren, the 32-year-old numbers guy for the Boston Celtics. While statistics has become more-or-less mainstream for baseball, the same isn’t quite true for basketball or football (though that’s changing too). They have better words for it than me:

This probably makes good sense for a sport like baseball, which is full of discrete events that are easily measured… Basketball, meanwhile, might seem too hectic and woolly for such rigorous dissection. It is far more collaborative than baseball and happens much faster, with players shifting from offense one moment to defense the next. (Hockey and football present their own challenges.)

But that’s not to say that something can be gained by looking at the numbers:

What’s the most efficient shot to take besides a layup? Easy, says Zarren: a three-pointer from the corner. What’s one of the most misused, misinterpreted statistics? “Turnovers are way more expensive than people think,” Zarren says. That’s because most teams focus on the points a defense scores from the turnover but don’t correctly value the offense’s opportunity cost — that is, the points it might have scored had the turnover not occurred.

Of course, the interesting thing about sports is that at their most basic, they cannot be defined by statistics or numbers. Take the Celtics, who just won the first round of the playoffs. Given their ability, the Celtics should have dispensed with the Hawks more quickly, rather than needing all seven games of the series to win the necessary four. The coach in the locker room of any Hoosiers ripoff will tell you it doesn’t matter what’s on the stat sheets, it matters who shows up that day. It’s the same reason that owners cannot buy a trophy even in a sport that has no salary cap. Or, if you’re like some of my in-laws-to-be (all Massachusetts natives), you might suspect that the fix is in (“How much money do those guys make per game?”) Regardless, it’s the human side of the sport, not the numbers, that make it worth watching. (And I don’t mean the soft-focus ESPN “Outside the Lines” version of the “human” side of the sport. Yech.)

In the meantime, maybe the Patriots or the Sox are hiring…

(Passed along by Andy Oram, my editor for vida)

Monday, May 5, 2008 | sports  

Flash file formats opened?

Via Slashdot, word that Adobe is opening the SWF and FLV file formats through the Open Screen Project. On first read this seemed great—Adobe essentially re-opening the SWF spec. It was released under a less onerous license by Macromedia ca. 1998, but then closed back up again once it became clear that the other vector graphics for the web proposals from Microsoft and others would not be an actual competitor. At the time, Microsoft had submitted a binary format called VML to the W3C, and the predecessor to SVG (called PGML) had also been proposed by then-rival Adobe and friends.

On second read it looks like they’re trying to kill Android before it has a chance to get rolling. So history rhymes ten years later. (Shannon informs me that this may qualify as a pantoum).

But to their credit (I’m shocked, actually), both specs are online already:

The SWF (Flash file format) specification

The FLV (Flash video file format) specification

….and more important, without any sort of click-through license. (“By clicking this button you pledge your allegiance to Adobe Systems and disavow your right to develop for products and platforms not controlled or approved by Adobe or its partners. The aforementioned transferral of rights also applies to your next of kin as well as your extended network of business partners and/or (at Adobe’s discretion) lunch dates.”)

I’ve never been nuts about using “open” as prefix for projects, especially as it relates to big companies hyping what do-gooders they are. It makes me think of the phrase “compassionate conservatism”. The fact that “compassionate” has to be added is more telling than anything else. They doth protest too much.

Thursday, May 1, 2008 | parse  

Design and the Elastic Mind

Perhaps three months late for an announcement, and at the risk of totally reckless narcissism, I should mention that four of my projects are currently on display in the Design and the Elastic Mind exhibition at the Museum of Modern Art in New York. My work notwithstanding, I hear that the show is generating lots of foot traffic and positive reviews, which is a well-deserved compliment to curator Paola Antonelli.

There’s a New York Times article and slide show (too much linking to the Times lately, weird…) and a writeup in the International Herald Tribune that even mentions my Humans vs. Chimps piece.

The first wall as you enter the show is all of Chromosome 18, done in the style of this piece.

chr18-elastic-510b.jpg

It’s a 3 pixel font at 150 dpi, so there are 37.5 letters per inch in either direction, and the wall is about 20 feet square, making 75 million letters total. Paola and her staff asked whether it was OK to put the text on the piece itself, which I felt was fine, as the nature of the piece is about scale, and the printing would not detract from that. The funny side effect of this was watching people at the opening take one another’s picture in front of the piece, mostly probably not realizing that the wall itself was part of the exhibition. Perhaps my most popular work so far, given the number of family photos in which it will be found.

Former classmate Ron Kurti also took a nice detail shot:

chr18-placard-kurti-510.jpg

Also in the show is the previously mentioned Humans vs. Chimps project as seen below:

chimp-510.jpg

This image is about three feet wide so you can read the letters accurately. It’s found next to an identically sized print of isometricblocks depicting the CFTR region of the human genome (the area implicated in connection to Cystic Fibrosis). The image was first developed for a Nature cover.

isometricblocks-510.jpg

Finally, the Pac-Man print of distellamap is printed floor to ceiling on another wall in the exhibition. Unfortunately there was a glitch in the printing that caused the lines connecting portions of the code to be lost (because they’re too thin to see at a distance), but no matter.

pacman-crop-510.jpg

Much moreso than my own work, however, by far the most exciting for me is the number of projects built with Processing that are in the show. It’s a bit humbling and the sort of thing that makes me excited (and relieved) to have some time this summer to devote to Processing itself.

Wednesday, April 30, 2008 | iloveme  

Google Underwater

So that might not be the awesome name that they’ll be using, but CNET is rumormongering about Google cooking up something oceanographic along the lines of Maps or Earth. Their speculation includes this lovely image from the Lamont-Doherty Earth Observatory (LDEO) of Columbia University.

underwatertiles_510.jpg

Unlike most people with a heartbeat, I didn’t find Google Maps particularly interesting on arrival. I was a fan of the simplicity of Yahoo Maps at the time (but no longer, eek!) and Microsoft’s Terraserver had done satellite imagery for a few years. But the same way that Google Mars shows us something we’re even less familiar with than satellite imagery of Earth, there’s something really exciting about possibility of seeing beneath the oceans.

Wednesday, April 30, 2008 | mapping, rumors, water  

Me blog big linky

Kottke and Freakonomics were kind enough to link over here, which has brought more queries about salaryper. Rather than piling onto the original web page, I’ll add updates to this section of the site.

I didn’t include the project’s back story with the 2008 version of the piece, so here goes:

Some background for people who don’t watch/follow/care about baseball:

When I first created this piece in 2005, the Yankees had a particularly bad year, with a team full of aging all-stars and owner George Steinbrenner hoping that a World Series trophy could be purchased for $208 million. The World Champion Red Sox did an ample job of defending their title, but as the second highest paid team in baseball, they’re not exactly young upstarts. The Chicago White Sox had an excellent year with just one third the salary of the Yankees, while the Cardinals are performing roughly on par with what they’re paid. Interestingly, the White Sox went on to win the World Series. The performance of Oakland, which previous years has far exceeded their overall salary, was a story, largely about their General Manager Billy Beane, told in the book Moneyball.

Some background for people who do watch/follow/care about baseball:

I neglected to include a caveat on the original page that this is a really simplistic view of salary vs. performance. I created this piece because the World Series victory of my beloved Red Sox was somewhat bittersweet in the sense that the second highest paid team in baseball finally managed to win a championship. This fact made me curious about how that works across the league, with raw salaries and the general performance of the individual teams.

There are lots of proportional things that can be done too—the salaries especially exist across a wide range (the Yankees waaaay out in front, followed the another pack of big market teams, then everyone else).

There are far more complex things about how contracts work over multiple years, how the farm system works, and scoring methods for individual players that could be taken into consideration.

This piece was thrown together while watching a game, so it’s perhaps dangerously un-advanced, given the amount of time and energy that’s put into the analysis (and argument) of sports statistics.

That last point is really important… This is fun! I encourage people to try out their own methods of playing with the data. For those who need a guide on building such a beast, the book has all the explanation and all the code (which isn’t much). And if you adapt the code, drop me a line so I can link to your example.

I have a handful of things I’d like to try (such as a proper method for doing proportional spacing at the sides without overdoing it), though the whole point of the project is to strip away as much as possible, and make a straightforward statement about salaries, so I haven’t bothered coming back to it since it succeeds in that original intent.

Wednesday, April 30, 2008 | salaryper, updates, vida  

Updated Salary vs. Performance for 2008

It’s April again, which means that there are messages lurking in my inbox asking about the whereabouts of this year’s Salary vs. Performance project (found in Chapter 5 of the good book). I got around to updating it a few days ago, which means now my inbox has changed to suggestions on how the piece might be improved. (It’s tempting to say, “Hey! Check out the book and the code, you can do anything you’d like with it! It’s more fun that way.” but that’s not really what they’re looking for.)

One of the best messages I’ve received so far is from someone who I strongly suspect is a statistician, who was wishing to see a scatter plot of the data rather than its current representation. Who else would be pining for a scatterplot? There are lots of jokes about the statistically inclined that might cover this situation, but… we’re much too high minded to let things devolve to that (actually, it’s more of a pot-kettle-black situation). If prompted, statisticians usually tell better jokes about themselves anyways.

At any rate, as it’s relevant to the issue of how you choose representations, my response follows:

Sadly, the scatter plot of the same data is actually kinda uninformative, since one of your axes (salary) is more or less fixed all season (might change at the trade deadline, but more or less stays fixed) and it’s just the averages that move about. So in fact if we’re looking for more “accurate”, a time series is gonna be better for our purposes. In an actual analytic piece, for instance, I’d do something very different (which would include multiple years, more detail about the salaries and how they amortize over time, etc).

But even so, making the piece more “correct” misses the intentional simplifications found in it, e.g. it doesn’t matter whether a baseball team was 5% away from winning, it only matters whether they’ve won. At the end of the day, it’s all about the specific rankings, who gets into the playoffs, and who wins those final games. Since the piece isn’t intended as an analytical tool, but something that conveys the idea of salary vs. performance to an audience who by and large cares little about 1) baseball and 2) stats. That’s not to say that it’s about making something zoomy and pretty (and irrelevant), but rather, how do you engage people with the data in a way that teaches them something in the end and gets them thinking about it.

Now to get back to my inbox and the guy who would rather have the data sonified since he thinks this visual thing is just a fad.

Tuesday, April 29, 2008 | examples, represent, salaryper  

All Streets Error Messages

Some favorite error messages while working on the All Streets project (mentioned below). I was initially hoping to use Illustrator to open the generated PDF files (generated from Processing), but Venus informed me that it was not to be:

illustrator-sucks-balls.png

I’m having difficulties as well. Why did I pay for this software?

Generally, Photoshop is far better engineered so I was hoping that it would be able to rasterize the PDF file instead, never mind the vectors and all.

photoshops-own-balls.png

Oh come on… Just admit that you ran out of memory and can’t deal. Meanwhile, Eugene was helping out with the site, from the other end of iChat:

aim-error-none.png

Oh well.

Sunday, April 27, 2008 | allstreets, software  

The Advantages of Closing a Few Doors

From the New York Times, a piece about Predictably Irrational from Dan Ariely. I’m somewhat fascinated by the idea of our general preoccupation with holding on to things, particularly as it relates to retaining data (see previous posts referencing Facebook, Google, etc.)

Our natural tendency is to keep everything, in spite of the consequences. Storage capacity in the digital realm is only getting larger and cheaper (as its size in the physical realm continues to get smaller), which only seeks to feed off this tendency further. Perhaps this is also why more individuals don’t question Google claiming a right to keep messages from their Gmail account after the messages, or even the account, have been deleted.

Ariely’s book describes a set of experiments performed at M.I.T.:

[Students] played a computer game that paid real cash to look for money behind three doors on the screen… After they opened a door by clicking on it, each subsequent click earned a little money, with the sum varying each time.

As each player went through the 100 allotted clicks, he could switch rooms to search for higher payoffs, but each switch used up a click to open the new door. The best strategy was to quickly check out the three rooms and settle in the one with the highest rewards.

Even after students got the hang of the game by practicing it, they were flummoxed when a new visual feature was introduced. If they stayed out of any room, its door would start shrinking and eventually disappear.

They should have ignored those disappearing doors, but the students couldn’t. They wasted so many clicks rushing back to reopen doors that their earnings dropped 15 percent. Even when the penalties for switching grew stiffer — besides losing a click, the players had to pay a cash fee — the students kept losing money by frantically keeping all their doors open.

(Emphasis mine.) I originally came across the article via Mark Hurst, who adds:

I’ve said for a long time that the solution to information overload is to let the bits go: always look for ways to delete, defer, or otherwise avoid bits, so that the few that remain are more relevant and easier to handle. This is the core philosophy of Bit Literacy.

Put another way, do we need to take more personal responsibility for subjecting ourselves to the “information overload” that people so happily buzzword about? Is complaining about the overload really an issue of not doing enough spring cleaning at home?

Sunday, April 27, 2008 | retention  

Restroom information graphics

bacon-510.jpg

I like neither bacon nor these machines, so I wish they would always provide this helpful explanation (or warning).

Friday, April 25, 2008 | infographics  

The Earth at night

Via mailing list, Oswald Berthold passes along images and a short article of the Earth from space as compiled by NASA, highlighting city lights in particular.

Tokyo Bay

The collection is an update to the Earth Lights image developed a few years ago (and which made its way ’round the interwebs at the time).

For the more technical, a presentation from the NOAA titled Low Light Imaging of the Earth at Night provides greater detail about the methods used to produce such images. Also includes a couple interesting historical examples (such as the first image they created) as well as comparisons of city growth over time based on changes in the data.

Of course many conclusions can be drawn from seeing map data such as this. Look at the difference between North and South Korea, for instance (original image from globalsecurity.org).

North and South Korea by night

Apparently this is a favorite of former U.S. Secretary of Defense Donald Rumsfeld:

Mr Rumsfeld showed the picture to illustrate how backward the northern regime really is - and how oppressed its people are. Without electricity there can be none of the appliances that make life easy and that we take for granted, he said.

“Except for my wife and family, that is my favourite photo,” said Mr Rumsfeld.

“It says it all. There’s the south, the same people as the north, the same resources north and south, and the big difference is in the south it’s a free political system and a free economic system.

I’ve vowed to myself not to make this page be about politics so I won’t get into the fatuous arguments of a warmonger (oops), but I think the fascinating thing is that

  1. This image, this “information graphic,” would be of such great importance to a person that he would see fit to even mention it in reference to photos of his wife and children. This is a strong statement for any image, even if he is being dramatic.
  2. The use of images to make or score political points. There’s some great stuff buried in recent Congressional testimony about the Iraq War, for instance, that I want to get to soon.

In regards to #1, I’m trying to think of other images to which people maintain such a personal relationship (particularly those whose job is not info graphics—Tufte’s preoccupation with Napoleon’s March doesn’t count.)

As for #2, hopefully we’ll get to that a bit later.

Friday, April 25, 2008 | mapping, physical, politics  

All Streets

all streetsNew work, now posted. All of the streets in the lower 48 United States: an image of 26 million individual road segments. This began as an example I created for one of my students in the fall of 2006, and I just recently got a chance to document it properly.

Nothing particularly genius about this piece—it’s mostly just a matter of collecting the data and creating the image. But it’s one of those cases where even in a (relatively) raw format, the data itself is quite striking.

The data in this piece comes from the U.S. Census Bureau’s TIGER/Line data files. The data is first parsed and filtered (to remove non-street features) using Perl. Next, using Processing, the latitude and longitude coordinates are transformed using an Albers equal-area conic projection (which gives it that curvy surface-of-the-Earth look that we’re used to), and then plotted to an enormous image that’s saved to the disk. The steps are similar to the preprocessing stages described in Chapter 6 of Visualizing Data.

I had originally hoped to use this piece to show patterns in street naming, but I didn’t manage to find as much as I had hoped. For instance, names of local trees and flowers being tied to the local geographic regions where they’re found. However, cookie cutter suburban neighborhood developments seem to have obliterated any causation. “Magnolia” is such a nice sounding, outdoorsy word; who wouldn’t want it adorning their street corner? Local flora be damned.

There are, however, a few other interesting tidbits in the data that I hope to cover in a future project. Real work be damned.

Friday, April 25, 2008 | allstreets  
Older Posts »
Book

Visualizing Data Book CoverVisualizing Data is my book about computational information design. It covers the path from raw data to how we understand it, detailing how to begin with a set of numbers and produce images or software that lets you view and interact with information. Unlike nearly all books in this field, it is a hands-on guide intended for people who want to learn how to actually build a data visualization.

The text was published by O’Reilly in December 2007 and can be found at Amazon and elsewhere. People who have purchased the book can find the examples here.

The book covers ideas found in my Ph.D. dissertation, which is basis for Chapter 1. The next chapter is an extremely brief introduction to Processing, which is used for the examples. but applies them to a series of examples, first starting with a simple mapping project (Chapter 3) to place data points on a map of the United States. Of course, the idea is not that lots of people want to visualize data for each of 50 states. Instead, it’s a jumping off point for learning how to lay out data spatially.

The chapters that follow cover six more projects, such as salary vs. performance (Chapter 5), zipdecode (Chapter 6), followed by more advanced topics dealing with trees, treemaps, hierarchies, and recursion (Chapter 7), plus graphs and networks (Chapter 8).

This site will be used for follow-up code and writing about related topics.