Writing

Melting Ants for Science (or, Solenopsis invicta as Dross)

Another visualization from the see-through fish category, a segment from Sunday Morning about Dr. Walter Tschinkel who studies the structure of ant colonies using aluminum casts. Three easy steps: Heat aluminum to 1200 degrees, pour it down an ant hole, and dig away carefully to reveal the intricate structure of the interior:

What amazing structures! Whenever you think you’ve made something that looks “good,” you can count on nature to dole out humility. Maybe killing the ants in the process is a little way to get the control back. Um, or something.

(Pardon the crappy video quality and annoying ad… Tried to tape the real version from my cable box, but @#$%*! Comcast has CBS marked as a 5c protected “premium” channel. Riiiight.)

Thursday, May 29, 2008 | physical, science  

Summerschool in Wiesbaden

sv-summerschool.jpg

Scholz & Volkmer is running a Summerschool program this July and is looking for eight students from USA and Europe. (Since “summer school” is one word, you may have already guessed that it’s based in Germany.) This is the group behind the SEE Conference that I spoke at in April. (Great conference, and the lectures are online, check ‘em out.)

The program is run by their Technical Director (Peter), who is a great guy. They’re looking for topics like data visualization, mobile applications, interaction concepts, etc. and are covering flight and accomodations plus a small stipend during your four week stay. Should be a great time.

Tuesday, May 27, 2008 | opportunities  

Schneier, Terrorists and Accuracy

Some thoughtful comments passed along by Alex Hutton regarding the last post:

Part of the problem with point technology solutions is in the policies of implementation.  IMHO, we undervalue the subject matter expert, or operate as a denigrated bureaucracy which does not allow the subject matter expert the flexibility to make decisions.  When that happens, the decision is left to technology (and as you point out, no technology is a perfect decision maker).

I thought it was apropos that you brought in the Schneier example.  I’ve been very much involved in a parallel thought process in the same industry as he, and we (my partner and I) are coming to a solution that attempts to balance technology, point human decision, and the bureaucracy within which they operate.

If you believe the Bayesians, then the right Bayesian network mimics the way the brain processes qualitative information to create a belief (or in the terms of Bayesians, a probability statement used to make a decision).  As such, the current way we use the technology (that policy of implementation, above) is faulty because it minimizes that “Human Computational Engine” for a relatively unsophisticated, unthinking technology.  That’s not to say that technologies like facial recognition are worthless - computational engines, even less magic ones that aren’t 99.99% accurate, are valid pieces of prior information (data).

Now in the same way, Human Computational Engines are also less than perfectly accurate.  In fact, they are not at all guaranteed to work the same way twice - even by the same person unless that person is using framework to provide rigor, rationality, and consistency in analysis.

So ideally, in physical security (or information security where Schneier and I come from) the imperfect computer detection engine is combined with a good Bayesian network and well trained/educated/experienced subject matter experts to create a more accurate probability statement around terrorist/non-terrorist - one that at least is better at identifying cases where more information is needed before a person is prevented from flying, searched and detained.  While this method, too, would not be 100% infallible (no solution will ever be), it would create a more accurate means of detection by utilizing the best of the human computational engine.

I believe the Bayesians, just 99.99% of the time.

Thursday, May 15, 2008 | bayesian, feedbag, mine, security  

Human Computation (or “Mechanical Turk” meets “Family Feud”)

richard_dawson.jpgComputers are really good at repetitive work. You can ask a computer to multiply two numbers together seven billion times and not only will it not complain, it’ll probably have seven billion answers for you a few seconds later. Ask a person to do the same thing and they’ll either walk away at the outset, realizing the ridiculousness of the task, or they’ll get through the first few tries and lose interest. But even the fact that a human can recognize the ridiculousness of the task is important. Humans are good at lots of things—like identifying a face in a crowd—that cannot be addressed by computation with the same level of accuracy.

Visualization is about the interface between what humans are good at, and what computers are good at. First, the computer can crunch all seven billion numbers, then present the results in a way that we can use our own perceptual skills to identify what’s important or interesting. (This is also why the design of a visualization is a fundamentally human task, and not something to be left to automation.)

This is also the subject of Luis von Ahn’s work at Carnegie Mellon. You’re probably familiar with CAPTCHA images—usually wavy numbers and letters that you have to discern when signing up for a webmail account or buying tickets from Ticketmaster. The acronym stands for “Completely Automated Public Turing Test to Tell Computers and Humans Apart,” a clever mouthful referring to Alan Turing’s work in discerning man or machine. (I encourage you to read about them, but this is already getting long so I won’t get into it here.)

More interesting than CAPTCHA, however, is the whole notion that’s behind it: that it’s an example of relying on humans to do what they’re best at, though it’s a task that’s difficult for computers. (Sure, in recent weeks, people have actually found ways to “break” CAPTCHAs in specific cases, but that’s not important here.) For instance, the work was extended to the Google Image Labeler, described as follows:

You’ll be randomly paired with a partner who’s online and using the feature. Over a two-minute period, you and your partner will:

  • View the same set of images.
  • Provide as many labels as possible to describe each image you see.
  • Receive points when your label matches your partner’s label. The number of points will depend on how specific your label is.
  • See more images until time runs out.

Prior to this, most image labeling systems had to do with getting volunteers to name or tag images individually. As you can imagine, the quality of tags suffer considerably because of everything from differences in how people perceive or describe what they see, to individuals who try to be a little too clever in choosing tags. With the Image Labeler game, that’s turned around backwards, where there is a motivation to use tags that match the other person, thus minimizing the previous problems. (It’s “Mechanical Turk” meets “Family Feud”.) They’ve also applied the same ideas to scanning books—where fragments of text that cannot be recognized by software are instead checked by multiple people.

More recently, von Ahn’s group has expanded these ideas in Games With A Purpose, a site that addresses these “casual games” more directly. The new site is covered in this New Scientist article, which offers additional tidbits (perspective? background? couldn’t think of the right word).

You can also watch Luis’ Google Tech Talk about Human Computation, which if I’m not mistaken, led to the Image Labeler project.

(We met Luis a couple times while at CMU and watched the Superbowl with his awesome fiancée Laura, cheering on her hometown Chicago Bears against those villainous Colts. We were happy when he received a MacArthur Fellowship for his work—just the sort of person you’d like to get such an award that highlights people who often don’t quite fit in their field.)

Mommy can we play infringing on my civil liberties?Returning to the earlier argument, algorithms to identify a face in a crowd are certainly improving. But without a significant breakthrough, their usefulness will be significantly limited. One commonly hyped use for such systems is airport security. Bruce Schneier explains the problem:

Suppose this magically effective face-recognition software is 99.99 percent accurate. That is, if someone is a terrorist, there is a 99.99 percent chance that the software indicates “terrorist,” and if someone is not a terrorist, there is a 99.99 percent chance that the software indicates “non-terrorist.” Assume that one in ten million flyers, on average, is a terrorist. Is the software any good?

No. The software will generate 1000 false alarms for every one real terrorist. And every false alarm still means that all the security people go through all of their security procedures. Because the population of non-terrorists is so much larger than the number of terrorists, the test is useless. This result is counterintuitive and surprising, but it is correct. The false alarms in this kind of system render it mostly useless. It’s “The Boy Who Cried Wolf” increased 1000-fold.

Given the number of travelers at Boston Logan in 2006, that would be two “terrorists” identified per day. (And with Schneier’s one in ten million is a terrorist figure, that would be two or three terrorists per year…clearly too generous, which makes the face detection accuracy even worse than how he describes it.) I find myself thinking about the 99.99% accuracy number as I stare at the back of heads lined up at the airport security checkpoint—itself a human problem, not a computational problem.

Thursday, May 15, 2008 | cs, games, human, perception, security  

Gender and Information Graphics

Just received this in a message from a journalism grad student studying information graphics:

I have looked at 2 years worth of Glamour (and Harper’s Bazaar too) magazines for my project and it shows that Glamour and other women’s magazines have less amount of information graphics in the magazines compared to men’s magazines, such as GQ and Esquire. Why do you think that is? Do you think that is gender-related at all?

I hadn’t really thought about it much. For the record, my reply:

My fiancée (who knows a lot more about being female than I do) pointed out that such magazines have much less practical content in general, so it may have more to do with that than a specific gender thing. Though she also pointed out that, for instance, in today’s news about the earthquake in China, she felt that women might be more inclined to read a story with the faces of those affected than one with information graphics tallying or describing the same.

I think you’d need to find something closer to a male equivalent of Glamour so that you can cover your question and remove the significant bias you’re getting for the content. Though, uh, a male equivalent of Glamour may not really exist… But perhaps there are better options.

And as I was writing this, she responded:

Finding a male equivalent of Glamour is hard but they actually do have some hard-hitting stories near the back in every issue that sometimes might be overshadowed by all the fashion and beauty stuff. Actually, finding a female equivalent of GQ or Esquire is also hard because they sort of have a niche of their own too. I have to agree with your fiancée too, because, I studied Oprah’s magazines a little in my previous study and sometimes it is really about what appeals to their audience.

Well, my study does not imply causality and it sometimes might be hard to differentiate if the result was due to gender differences or content. So, it’s interesting to find all these out, and actually men’s magazines have about 5 times more information graphics than women’s magazines which is amazing.

Wow—five times more. (At least amongst the magazines that she mentioned.)

My hope in posting this (rather than just sharing the contents of my inbox…can you tell that I’m answering mail today?) is that someone else out there knows more about the subject. Please drop me a line if you do; I’d like to know more and to post a follow-up.

Monday, May 12, 2008 | gender, inbox, infographics  

Glagolitic Capital Letter Spidery Ha

spidery-170x205.pngA great Unicode in 5 Minutes presentation from Mark Lentczner at Linden Lab. He passed it along after reading this dense post, clearly concerned about the welfare of my readers.

(Searching out the image for the title of this post also led me to a collection of Favourite Unicode Codepoints. This seems ripe for someone to waste more time really tracking down such things and documenting them.)

Mark’s also behind Context Free, one of the “related initiatives” that we have listed on Processing.org.

Context Free is a program that generates images from written instructions called a grammar. The program follows the instructions in a few seconds to create images that can contain millions of shapes.

Grammars are covered briefly in the Parse chapter of vida, with the name of the language coming from a specific variety called Context Free Grammars. The magical (and manic) part of grammars is that their rules tend to be recursive and layered, which leads to a certain kind of insanity as you try to tease out how the rules work. With Context Free, Mark has instead turned this dizziness into the basis for creating visual form.

Updated 14 May 08 to fix the glyph. Thanks to Paul Oppenheim, Spidery Ha Devotee, for the correction.

Monday, May 12, 2008 | feedbag, languages, parse, unicode  

So much for “wonderfully simple”

In contrast to the clarity and simplicity of the New York Times info graphic mentioned yesterday, the example currently on their home page is an example of the opposite:

This is helpful because it clarifies the point I tried to make about what was nice about the other graphic. Because of space limitations, this graphic is small, and the information is stored across multiple panels. So at the top there are a pair of tabs. Then within the tabs we have a pair of buttons. Two tabs, four buttons, just to get through four possible pieces of data. That’s the sort of combinatoric magic we see in Microsoft Windows preference panels:

snap1.gif

While the organization in the info graphic makes conceptual sense—first you must choose one of two states, then choose one of the candidates—it makes little cognitive sense. We’re choosing between one of four options. Just give them to us! For a pair of items beneath another pair of items, there’s no need to establish a sense of hierarchy. If there were a half dozen states, and a half dozen candidates, then that might make sense. Just because the data is technically hierarchic, or arranged in a tree, that doesn’t mean that it’s the best representation for it.

The solution? Just give us the four options. No sliding panels, trap doors, etc. Better yet, superimpose the Clinton and Obama data on a single map as different colors, and have a pair of buttons (not tabs!) that let the viewer quickly swap between Indiana and North Carolina.

(This only covers the interaction model, without getting into the way the data itself is presented, colors chosen, laid out, etc. The lack of population density information in the image makes the maps themselves nearly worthless.)

Tuesday, May 6, 2008 | infographics, interact, politics  

Average Distance to the Nearest Road in the Conterminous United States

Got an email over the weekend from Tom Vanderbilt, who had seen the All Streets piece, and was kind enough to point me to this map (PDF) from the USGS that depicts the average distance to the nearest road across the continental 48 states. (He’s currently working on a book titled Traffic: Why We Drive the Way We Do (and What It Says About Us) to be released this fall).

And too bad I just learned the word conterminous, but had I used that in the original project description, we would have missed (or been spared) the Metafilter discussion of whether “lower 48” was accurate terminology.

roadproximity2.jpg

A really interesting map, which of course also shows the difference between something thrown together in a few hours and actual research. In digging around for the map’s source, I found that exactly a year ago, they also published a paper in Science describing their broader work:

Roads encroaching into undeveloped areas generally degrade ecological and watershed conditions and simultaneously provide access to natural resources, land parcels for development, and recreation. A metric of roadless space is needed for monitoring the balance between these ecological costs and societal benefits. We introduce a metric, roadless volume (RV), which is derived from the calculated distance to the nearest road. RV is useful and integrable over scales ranging from local to national. The 2.1 million cubic kilometers of RV in the conterminous United States are distributed with extreme inhomogeneity among its counties.

The publication even includes a response and a response to the response—high scientific drama! Apparently some lads feel that “roadless volume does not explicitly address ecological processes.” So let that be a warning to all you non-explicit addressers.

For those lucky to have access to the journal online, the supplementary information includes a time lapse video of a section of Colorado, and its roadless volume since 1937. As with all things, it’s much more interesting to see how this changes over time. A map of all streets in the lower 48 isn’t nearly as engaging as a sequence of the same area over several years. The latter story is simply far more compelling.

Tuesday, May 6, 2008 | allstreets, feedbag, mapping  

Unicode, character encodings, and the declining dominance of Western European character sets

Computers know nothing but numbers. As humans we have varying levels of skill in using numbers, but most of the time we’re communicating with words and phrases. So in the early days of computing, the earliest software developers had to find a way to map each character—a letter Q, the character #, or maybe a lowercase b—into a number. A table of characters would be made, usually either 128 or 256 of them, depending on whether data was stored or transmitted using 7 or 8 bits. Often the data would be stored as 7 bits, so that the eighth bit could be used as a parity bit, a simple method of error correction (because data transmission—we’re talking modems and serial ports here—was so error prone).

Early on, such encoding systems were designed in isolation, which meant that they were rarely compatible with one another. The number 34 in one character set might be assigned to “b”, while in another character set, assigned to “%”. You can imagine how that works out over an entire message, but the hilarity was lost on people trying to get their work done.

In the 1960s, the American National Standards Institute (or ANSI) came along and set up a proper standard, called ASCII, that could be shared amongst computers. It was 7 bits (to allow for the parity bit) and looked like:

  0 nul    1 soh    2 stx    3 etx    4 eot    5 enq    6 ack    7 bel
  8 bs     9 ht    10 nl    11 vt    12 np    13 cr    14 so    15 si
 16 dle   17 dc1   18 dc2   19 dc3   20 dc4   21 nak   22 syn   23 etb
 24 can   25 em    26 sub   27 esc   28 fs    29 gs    30 rs    31 us
 32 sp    33  !    34  "    35  #    36  $    37  %    38  &    39  '
 40  (    41  )    42  *    43  +    44  ,    45  -    46  .    47  /
 48  0    49  1    50  2    51  3    52  4    53  5    54  6    55  7
 56  8    57  9    58  :    59  ;    60  <    61  =    62  >    63  ?
 64  @    65  A    66  B    67  C    68  D    69  E    70  F    71  G
 72  H    73  I    74  J    75  K    76  L    77  M    78  N    79  O
 80  P    81  Q    82  R    83  S    84  T    85  U    86  V    87  W
 88  X    89  Y    90  Z    91  [    92  \    93  ]    94  ^    95  _
 96  `    97  a    98  b    99  c   100  d   101  e   102  f   103  g
104  h   105  i   106  j   107  k   108  l   109  m   110  n   111  o
112  p   113  q   114  r   115  s   116  t   117  u   118  v   119  w
120  x   121  y   122  z   123  {   124  |   125  }   126  ~   127 del

The lower numbers are various control codes, and the characters 32 (space) through 126 are actual printed characters. An eagle-eyed or non-Western reader will note that there are no umlauts, cedillas, or Kanji characters in that set. (You’ll note that this is the American National Standards Institute, after all. And to be fair, those were things well outside their charge.) So while the immediate character encoding problem of the 1960s was solved for Westerners, other languages would still have their own encoding systems.

As time rolled on, the parity bit became less of an issue, and people were antsy to add more characters. Getting rid of the parity bit meant 8 bits instead of 7, which would double the number of available characters. Other encoding systems like ISO-8859-1 (also called Latin-1) were developed. These had better coverage for Western European languages, by adding some umlauts we’d all been missing. The encodings kept the first 0–127 characters identical to ASCII, but defined characters numbered 128–255.

However this still remained a problem, even for Western languages, because if you were on a Windows machine, there was a different definition for characters 128–255 than there was on the Mac. Windows used what was called Windows 1252, which was just close enough to Latin-1 (embraced and extended, let’s say) to confuse everyone and make a mess. And because they like to think different, Apple used their own standard, called Mac Roman, which had yet another colorful ordering for characters 128–255.

This is why there are lots of web pages that will have squiggly marks or odd characters where em dashes or quotes should be found. If authors of web pages include a tag in the HTML that defines the character set (saying essentially “I saved this on a Western Mac!” or “I made this on a Norwegian Windows machine!”) then this problem is avoided, because it gives the browser a hint at what to expect in those characters with numbers from 128–255.

Those of you who haven’t fallen asleep yet may realize that even 200ish characters still won’t do—remember our Kanji friends? Such languages usually encode with two bytes (16 bits to the West’s measly 8), providing access to 65,536 characters. Of course, this creates even more issues because software must be designed to no longer think of characters as a single byte.

In the very early 90s, the industry heavies got together to form the Unicode consortium to sort out all this encoding mess once and for all. They describe their charge as:

Unicode provides a unique number for every character,
no matter what the platform,
no matter what the program,
no matter what the language.

They’ve produced a series of specifications, both for a wider character set (up to 4! bytes) and various methods for encoding these character sets. It’s truly amazing work. It means we can do things like have a font (such as the aptly named Arial Unicode) that defines tens of thousands of character shapes. The first of these (if I recall correctly) was Bitstream Cyberbit, which was about the coolest thing a font geek could get their hands on in 1998.

The most basic version of Unicode defines characters 0–65535, with the first 0–255 characters defined as identical to Latin-1 (for some modicum of compatibility with older systems).

One of the great things about the Unicode spec is the UTF-8 encoding. The idea behind UTF-8 is that the majority of characters will be in that standard ASCII set. So if the eighth bit of a character is a zero, then the other seven bits are just plain ASCII. If the eighth bit is 1, then it’s some sort of extended format. At which point the remaining bits determine how many additional characters (usually two) are required to encode the value for that character. It’s a very clever scheme because it degrades nicely, and provides a great deal of backward compatibility with the large number of systems still requiring only ASCII.

Of course, assuming that ASCII characters will be most predominant is to some repeating the same bias as back in the 1960s. But I think this is an academic complaint, and the benefits of the encoding far outweigh the negatives.

Anyhow, the purpose of this post was to write that Google reported yesterday that Unicode adoption on the web has passed ASCII and Western European. This doesn’t mean that English language characters have been passed up, but rather that the number of pages encoded using Unicode (usually in UTF-8 format), has finally left behind the archaic ASCII and Western European formats. The upshot is that it’s a sign of us leaving the dark ages—almost 20 years since the internet was made publicly available, and since the start of the Unicode consortium, we’re finally starting to take this stuff seriously.

The Processing book also has a bit of background on ASCII and Unicode in an Appendix, which includes more about character sets and how to work with them. And future editions of vida will also cover such matters in the Parse chapter.

Tuesday, May 6, 2008 | parse, unicode, updates, vida  

Another delegate calculator

Wonderfully simple delegate calculator from the New York Times. Addresses a far simpler question than the previously mentioned Slate calculator, but bless the NYT for realizing that something that complicated was no longer necessary.

delegate-5101.jpg

Good example of throwing out extraneous information to tell a story more directly: a quick left and right drag provides a more accurate depiction than the horse race currently in the headlines.

Monday, May 5, 2008 | election, politics, scenarios  

Doin’ stats for the C’s

A New York Times piece by the Freakonomics guys about Mike Zarren, the 32-year-old numbers guy for the Boston Celtics. While statistics has become more-or-less mainstream for baseball, the same isn’t quite true for basketball or football (though that’s changing too). They have better words for it than me:

This probably makes good sense for a sport like baseball, which is full of discrete events that are easily measured… Basketball, meanwhile, might seem too hectic and woolly for such rigorous dissection. It is far more collaborative than baseball and happens much faster, with players shifting from offense one moment to defense the next. (Hockey and football present their own challenges.)

But that’s not to say that something can be gained by looking at the numbers:

What’s the most efficient shot to take besides a layup? Easy, says Zarren: a three-pointer from the corner. What’s one of the most misused, misinterpreted statistics? “Turnovers are way more expensive than people think,” Zarren says. That’s because most teams focus on the points a defense scores from the turnover but don’t correctly value the offense’s opportunity cost — that is, the points it might have scored had the turnover not occurred.

Of course, the interesting thing about sports is that at their most basic, they cannot be defined by statistics or numbers. Take the Celtics, who just won the first round of the playoffs. Given their ability, the Celtics should have dispensed with the Hawks more quickly, rather than needing all seven games of the series to win the necessary four. The coach in the locker room of any Hoosiers ripoff will tell you it doesn’t matter what’s on the stat sheets, it matters who shows up that day. It’s the same reason that owners cannot buy a trophy even in a sport that has no salary cap. Or, if you’re like some of my in-laws-to-be (all Massachusetts natives), you might suspect that the fix is in (“How much money do those guys make per game?”) Regardless, it’s the human side of the sport, not the numbers, that make it worth watching. (And I don’t mean the soft-focus ESPN “Outside the Lines” version of the “human” side of the sport. Yech.)

In the meantime, maybe the Patriots or the Sox are hiring…

(Passed along by Andy Oram, my editor for vida)

Monday, May 5, 2008 | sports  

Flash file formats opened?

Via Slashdot, word that Adobe is opening the SWF and FLV file formats through the Open Screen Project. On first read this seemed great—Adobe essentially re-opening the SWF spec. It was released under a less onerous license by Macromedia ca. 1998, but then closed back up again once it became clear that the other vector graphics for the web proposals from Microsoft and others would not be an actual competitor. At the time, Microsoft had submitted a binary format called VML to the W3C, and the predecessor to SVG (called PGML) had also been proposed by then-rival Adobe and friends.

On second read it looks like they’re trying to kill Android before it has a chance to get rolling. So history rhymes ten years later. (Shannon informs me that this may qualify as a pantoum).

But to their credit (I’m shocked, actually), both specs are online already:

The SWF (Flash file format) specification

The FLV (Flash video file format) specification

….and more important, without any sort of click-through license. (“By clicking this button you pledge your allegiance to Adobe Systems and disavow your right to develop for products and platforms not controlled or approved by Adobe or its partners. The aforementioned transferral of rights also applies to your next of kin as well as your extended network of business partners and/or (at Adobe’s discretion) lunch dates.”)

I’ve never been nuts about using “open” as prefix for projects, especially as it relates to big companies hyping what do-gooders they are. It makes me think of the phrase “compassionate conservatism”. The fact that “compassionate” has to be added is more telling than anything else. They doth protest too much.

Thursday, May 1, 2008 | parse  
Book

Visualizing Data Book CoverVisualizing Data is my book about computational information design. It covers the path from raw data to how we understand it, detailing how to begin with a set of numbers and produce images or software that lets you view and interact with information. Unlike nearly all books in this field, it is a hands-on guide intended for people who want to learn how to actually build a data visualization.

The text was published by O’Reilly in December 2007 and can be found at Amazon and elsewhere. People who have purchased the book can find the examples here.

The book covers ideas found in my Ph.D. dissertation, which is basis for Chapter 1. The next chapter is an extremely brief introduction to Processing, which is used for the examples. but applies them to a series of examples, first starting with a simple mapping project (Chapter 3) to place data points on a map of the United States. Of course, the idea is not that lots of people want to visualize data for each of 50 states. Instead, it’s a jumping off point for learning how to lay out data spatially.

The chapters that follow cover six more projects, such as salary vs. performance (Chapter 5), zipdecode (Chapter 6), followed by more advanced topics dealing with trees, treemaps, hierarchies, and recursion (Chapter 7), plus graphs and networks (Chapter 8).

This site will be used for follow-up code and writing about related topics.