Writing

More NASA Observations Acquire Interest

Some additional followup from Robert Simmon regarding the previous post. I asked more about the “amateur Earth observers” and the intermediate data access. He writes:

The original idea was sparked from the success of amateur astronomers discovering comets. Of course amateur astronomy is mostly about making observations, but we (NASA) already have the observations: the question is what to do with them–which we really haven’t figured out. One approach is to make in-situ observations like aerosol optical thickness (haziness, essentially), weather measurements, cloud type, etc. and then correlate them with satellite data. Unfortunately, calibration issues make this data difficult to use scientifically. It is a good outreach tool, so we’re partnering with science museums, and the GLOBE program does this with schools.

We don’t really have a good sense yet of how to allow amateurs to make meaningful analyses: there’s a lot of background knowledge required to make sense of the data, and it’s important to understand the limitations of satellite data, even if the tools to extract and display it are available. There’s also the risk that quacks with and axe to grind will willfully abuse data to make a point, which is more significant for an issue like climate change than it is for the face on Mars, for example. That’s just a long way of saying that we don’t know yet, and we’d appreciate suggestions.

I’m more of a “face on Mars” guy myself. It’s unfortunate that the quacks even have to be considered, though not surprising from what I’ve seen online. Also worth checking out:

Are you familiar with Web Map Service (WMS)?
http://www.opengeospatial.org/standards/wms
It’s one of the ways we distribute & display our data, in addition to KML.

And one last followup:

Here’s another data source for NASA satellite data that’s a bit easier than the data gateway:
http://daac.gsfc.nasa.gov/techlab/giovanni/

and examples of classroom exercises using data, with some additional data sources folded in to each one:
http://serc.carleton.edu/eet/

The EET holds an “access data workshop” each year in late spring, you may be interested in attending next year.

And with regards to guidelines, Mark Baltzegar (of The Cyc Foundation) sent along this note:

Are you familiar with the ongoing work within the W3C’s Linking Open Data project? There is a vibrant community actively exposing and linking open data.
http://richard.cyganiak.de/2007/10/lod/
http://esw.w3.org/topic/SweoIG/TaskForces/CommunityProjects/LinkingOpenData

More to read and eat up your evening, at any rate.

Thursday, July 31, 2008 | acquire, data, feedbag, parse  

NASA Observes Earth Blogs

Robert Simmon of NASA caught this post about the NASA Earth Observatory and was kind enough to pass along some additional information.

Regarding the carbon emissions video:

The U.S. carbon emissions data were taken from the Vulcan Project:
http://www.purdue.edu/eas/carbon/vulcan/index.php

They distribute the data here:
http://www.purdue.edu/eas/carbon/vulcan/research.html

In addition to the animation (which was intended to show the daily cycle and the progress of elevated emissions from east to west each morning), we published a short feature about the project and the dataset, including some graphs that remove the diurnal cycle.
http://earthobservatory.nasa.gov/Study/AmericanCarbon/

American Carbon is an example of one of our feature articles, which are published every month or so. We try to cover current research, focusing on individual scientists, using narrative techniques. The visualizations tie in closely to the text of the story. I’m the primary visualizer, and I focus on presenting the data as clearly as possible, rather than allowing free-form investigation of data. We also publish daily images (with links to images at the original resolution), imagery of natural hazards emphasizing current events (fires, hurricanes, and dust storms, for example), nasa press releases, a handful of interactive lessons, and the monthly global maps of various parameters. We’re in the finishing stages of a redesign, which will hopefully improve the navigation and site usability.

Also some details about the difficulties of distributing and handling the data:

These sections draw on data from wide and varied sources. The raw data is extremely heterogeneous, formats include: text files, HDF, matlab, camera raw files, GRADS, NetCDF, etc. All in different projections, at different spatial scales, and covering different time periods. Some of them are updated every five minutes, and others are reprocessed periodically. Trying to make the data available—and current—through our site would be overly ambitious. Instead, we focus on a non-expert audience interested in space, technology, and the environment, and link to the original science groups and the relevant data archives. Look in the credit lines of images for links.

Unfortunately the data formats can be very difficult to read. Here’s the main portal for access to NASA Earth Observing System data:
http://esdis.eosdis.nasa.gov/index.html

and the direct link to several of the data access interfaces:
http://esdis.eosdis.nasa.gov/dataaccess/search.html

And finally, something closer to what was discussed in the earlier post:

With the complexity of the science data, there is a place for an intermediate level of data: processed to a consistent format and readable by common commercial or free software (intervention by a data fairy?). NASA Earth Observations (NEO) is one attempt at solving that problem: global images at 0.1 by 0.1 degrees distributed as lossless-compressed indexed color images and csv files. Obviously there’s work to be done to improve NEO, but we’re getting there. We’re having a workshop this month to develop material for “amateur Earth observers” which will hopefully help us in this area, as well.

This speaks to the audience I tried to address with Visualizing Data in particular (or with Processing in general). There is a group of people who want access to data that’s more low-level than what’s found in a newspaper article, but not as complicated as raw piles of data from measuring instruments that are only decipherable by the scientists who use them.

This is a general theme, not specific to NASA’s data. And I think it’s a little more low-level than requiring that everything be in mashup-friendly XML or JSON feeds, but it seems worthwhile to start thinking about what the guidelines would be for open data distribution. And with such guidelines in place, we can browbeat organizations to play along! Since that would be, uh, a nice way to thank them for making their data available in the first place.

Thursday, July 31, 2008 | acquire, data, feedbag  

Processing 0143 and a status report

Just posted Processing 0143 to the download page. This is not yet the stable release, so please read revisions.txt, which describes the signficant changes in the releases since 0135 (the last “stable” release, and the current default download).

I’ve also posted a status report:

Some updates from the Processing Corporation’s east coast tower high rise offices in Cambridge, MA.

We’re working to finish Processing 1.0. The target date is this Fall, meaning August or September. We’d like to have it done as early as possible so that Fall classes can make use of it. In addition to the usual channels, we have a dozen or so people who are helping out with getting the release out the door. We’ll unmask these heros at some point in the future.

I’m also pleased to announce that I’m able to focus on Processing full time this Summer with the help of a stipend provided by Oblong Industries. They’re the folks behind the gesture-controlled interface you see in Minority Report. (You can find more about them with a little Google digging.) They’re funding us because of their love of open source and they feel that Processing is an important project. As in, there are no strings attached to the funding, and Processing is not being re-tooled for gesture interfaces. We owe them our enormous gratitude.

The big things for 1.0 include the Tools menu, better compile/run setup (what you see in 0136+), bringing back P2D, perhaps bringing back P3D with anti-aliasing, better OpenGL support, better library support, some major bug fixes (outstanding threading problems and more).

If you have a feature or bug that you want fixed in time for 1.0, now is the time to vote by making sure that it’s listed at http://dev.processing.org/bugs.

I’ll try to post updates more frequently over the next few weeks.

Monday, July 28, 2008 | processing  

Wordle me this, Batman

I’ve never really been fond of tag clouds, but Wordle, by MacGyver of software (and former drummer for They Might Be Giants) Jonathan Feinberg gives the representation an aesthetic nudge lacking in most representations. The application creates word clouds from input data submitted by users. I was reminded of it yesterday by Eugene, who submitted Lorem Ipsum:

lorem-500.png

I had first heard about it from emailer Bill Robertson, who had uploaded Organic Information Design, my master’s thesis. (Which was initially flattering but quickly became terrifying when I remembered that it still badly needs a cleanup edit.)

organic-500.jpg

A wonderful tree shape! Can’t decide which I like better: “information” as the stem or “data” as a cancerous growth in the upper-right.

Mr. Feinberg is also the reason that Processing development has been moving to Eclipse (replacing emacs, some shell scripts, two packages of bazooka bubble gum and the command line) because of his donation of a long afternoon helping set up the software in the IDE back when I lived in East Cambridge, just a few blocks from where he works at IBM Research.

Wednesday, July 23, 2008 | inbox, refine, represent  

Blood, guts, gore and the data fairy

The O’Reilly press folks passed along this review (PDF) of Visualizing Data from USENIX magazine. I really appreciated this part:

My favorite thing about Visualizing Data is that it tackles the whole process in all its blood, guts, and gore. It starts with finding the data and cleaning it up. Many books assume that the data fairy is going to come bring you data, and that it will either be clean, lovely data or you will parse it carefully into clean, lovely data. This book assumes that a significant portion of the data you care about comes from some scuzzy Web page you don’t control and that you are going to use exactly the minimum required finesse to tear out the parts you care about. It talks about how to do this, and how to decide what the minimum required finesse would be. (Do you do it by hand? Use a regular expression? Actually bother to parse XML?)

Indeed, writing this book was therapy for that traumatized inner child who learned at such a tender young age that the data fairy did not exist.

Wednesday, July 23, 2008 | iloveme, parse, reviews, vida  

NASA Earth Observatory

carbon.jpgSome potentially interesting data from NASA passed along by Chris Lonnen. The first is the Earth Observatory, which includes images of things like Carbon Monoxide, Snow Cover, Surface Temperature, UV Exposure, and so on. Chris writes:

I’m not sure how useful they would be to novices in terms of usable data (raw numbers are not provided in any easy to harvest manner), but the information is
still useful and they provide for a basic, if clunky, presentation that follows the basic steps you laid out in your book. They data can be found here, and they occasionally compile it all into interesting visualizations. My favorite being the carbon map here.

The carbon map movie is really cool, though I wish the raw data were available since the strong cyclical effect seen in the animation needs to be separated out. The cycles dominates the animation to such an extent that it’s nearly the only takeaway from the movie. For instance, each cycle is a 24 hour period. Instead of showing them one after another, show several days adjacent one another, so that we can compare 3am with one day to 3am the next.

For overseas readers, I’ll note that the images and data are not all U.S.-centric—most cover the surface of the Earth.

I asked Chris about availability for more raw data, and he did a little more digging:

The raw data availability is slim. From what I’ve gathered you need to contact NASA and have them give you clearance as a researcher. If you were looking for higher quality photography for a tutorial NASA Earth Observations has a newer website that I’ve just found which offers similar data in the format of your choice at up to 3600 x 1800. For some sets it will also offer you data in CSV or CSV for Excel.

If you needed higher resolutions that that NASA’s Visible Earth offers some TIFF’s at larger sizes. A quick search for .tiff gave me an 16384 x 8192 map of the earth with city lights shining, which would be relatively easy to filter out from the dark blue background. These two websites are probably a bit more helpful.

Interesting tidbits for someone interested in a little planetary digging. I’ve had a few of these links sitting in a pile waiting for me to finish the “data” section of my web site; in the meantime I’ll just mention things here.

Update 31 July 2008: Robert Simmon from NASA chimes in.

Saturday, July 19, 2008 | acquire, data, inbox, science  

Brains on the Line

I was reminded this morning that Mario Manningham, a wide receiver who played for Michigan was rumored to have scored a 6 (out of 50) on the Wonderlic, an intelligence test administered in some occupations (and now pro football) to check the mental capability of job candidates. Intelligence tests are strange beasts, but after watching my niece working on similar problems—for fun—during her summer vacation last week, the tests caught my eye more than when I first heard about it.

Manningham was once a promising undergrad receiver for U of M, but has in recent years proven himself to be a knucklehead, loafing through plays and most recently making headlines for marijuana use and an interview on Sirius radio described as “… arrogant and defensive. When asked about the balls he dropped in big spots, he responded, ‘What about the ball I caught?’” So while an exceptionally score on a standardized test might suggest dyslexia, the guy’s an egotistical bonehead even without mitigating factors.

Most people don’t associate brains with football, but in recent years teams have begun to use a Wonderlic test while scouting, which consists of 50 questions to be completed in 12 minutes. Many of the questions are multiple choice, but the time is certainly a factor when completing the tests. A score of 10 is considered “literate”, while 20 is said to coincide with average intelligence (an IQ of 100, though now we’re comparing one somewhat arbitrary numerically scored intelligence test with another).

In another interesting twist, the test is also administered to players the day of the NFL combine—which means they first spend the day running, jumping, benching, interviewing, and lots of other -ings, before they sit down and take an intelligence test. It’s a bit like a medical student running a half marathon before taking the boards.

Wonderlic himself says that basically, the scores decrease as you move further away from the ball, which is interesting but unsurprising. It’s sort of obvious that a quarterback needs to be on the smarter side, but I was curious to see what this actually looked like. Using this table as a guide, I then grabbed this diagram from Wikipedia showing a typical formation in a football game. I cleaned up the design of the diagram a bit and replaced the positions with their scores:

positions1.png

Offense is shown in blue, defense in red. You can see the quarterback with a 24, the center (over 6 feet and around 300 lbs.) averaging higher at 25, and the outside linemen even a little higher. Presumably this is because the outside linemen need to mentally quick (as well as tough) to read the defense and respond to it. Those are the wide receivers (idiot loud mouths) with the 17s on the outside.

To make the diagram a bit clearer, I scaled each position based on its score:

positions2.png

That’s a little better since you can see the huddle around the ball and where the brains need to be for the system of protection around it. With the proportion, I no longer need the numbers, so I’ve switched back to using the initials for each position’s title:

positions3.png

(Don’t tell Tufte that I’ve used the radius, not the proportional area, of the circle as the value for each ellipse! A cardinal sin that I’m using in this case to improve proportion and clarify a point.)

I’ll also happily point out that the linemen for the Patriots all score above average for their position:

Player Position Year Score
Matt Light left tackle 2001 29
Logan Mankins left guard 2005 25
Dan Koppen center 2003 28
Stephen Neal right guard 2001 31
Nick Kaczur right tackle 2005 29

A position-by-position image for a team would be interesting, but I’ve already spent too much time thinking about this. The Patriots are rumored to be heavy on brains, with Green Bay at the other end of the spectrum.

An ESPN writeup about the test (and testing in general) can be found here, along with a sample test here.

One odd press release from Wonderlic even compares scores per NFL position with private sector job titles. For instance, a middle linebacker scores like a hospital orderly, while an offensive tackle is closer to a marketing executive. Fullbacks and halfbacks share the lower end with dock hands and material handlers.

During the run-up to Super Bowl XXXII in 1998, one reporter even dug up the Wonderlic scores for the Broncos and Packers, showing Denver with an average score of 20.4 compared to Green Bay’s 19.6. As defending champions, the Packers were favored but wound up losing 31-24.

Nobody cited test scores in the post-game coverage.

Wednesday, July 16, 2008 | football, sports  

Eric Idle on “Scale”

Scale is one of the most important themes in data visualization. In Monty Python’s The Meaning of Life, Eric Idle shares his perspective:

The lyrics:

Just remember that you’re standing on a planet that’s evolving
And revolving at nine hundred miles an hour,
That’s orbiting at nineteen miles a second, so it’s reckoned,
A sun that is the source of all our power.
The sun and you and me and all the stars that we can see
Are moving at a million miles a day
In an outer spiral arm, at forty thousand miles an hour,
Of the galaxy we call the ‘Milky Way’.

Our galaxy itself contains a hundred billion stars.
It’s a hundred thousand light years side to side.
It bulges in the middle, sixteen thousand light years thick,
But out by us, it’s just three thousand light years wide.
We’re thirty thousand light years from galactic central point.
We go ’round every two hundred million years,
And our galaxy is only one of millions of billions
In this amazing and expanding universe.

The universe itself keeps on expanding and expanding
In all of the directions it can whizz
As fast as it can go, at the speed of light, you know,
Twelve million miles a minute, and that’s the fastest speed there is.
So remember, when you’re feeling very small and insecure,
How amazingly unlikely is your birth,
And pray that there’s intelligent life somewhere up in space,
‘Cause there’s bugger all down here on Earth.

Wednesday, July 16, 2008 | music, scale  

Postleitzahlen in Deutschland

germany-contrast-small.pngMaximillian Dornseif has adapted Zipdecode from Chapter 6 of Visualizing Data to handle German postal codes. I’ve wanted to do this myself since hearing about the OpenGeoDB data set which includes the data, but thankfully he’s taken care of it first and is sharing it with the rest of us along with his modified code.

(The site is in German…I’ll trust any of you German readers to let me know if the site actually says that Visualizing Data is the dumbest book he’s ever read.)

Also helpful to note that he used Python for preprocessing the data. He doesn’t bother implementing a map projection, as done in the book, but the Python code is a useful example of using another language when appropriate, and how the syntax differs from Processing:

# Convert opengeodb data for zipdecode
fd = open('PLZ.tab')
out = []
minlat = minlon = 180
maxlat = maxlon = 0

for line in fd:
    line = line.strip()
    if not line or line.startswith('#'):
        continue
    parts = line.split('\t')
    dummy, plz, lat, lon, name = parts
    out.append([plz, lat, lon, name])
    minlat = min([float(lat), minlat])
    minlon = min([float(lon), minlon])
    maxlat = max([float(lat), maxlat])
    maxlon = max([float(lon), maxlon])

print "# %d,%f,%f,%f,%f" % (len(out), minlat, maxlat, minlon, maxlon)
for data in out:
    plz, lat, lon, name = data
    print '\t'.join([plz, str(float(lat)), str(float(lon)), name])

In the book, I used Processing for most of the examples (with a little bit of Perl) for sake of simplicity. (The book is already introducing a lot of new material, why hurt people and introduce multiple languages while I’m at it?) However that’s one place where the book diverges from my own process a bit, since I tend to use a lot of Perl when dealing with large volumes of text data. Python is also a good choice (or Ruby if that’s your thing), but I’m tainted since I learned Perl first, while a wee intern at Sun.

Tuesday, July 15, 2008 | adaptation, vida, zipdecode  

Parsing Numbers by the Bushel

While taking a look at the code mentioned in the previous post, I noticed two things. First, the PointCloud.pde file drops directly into OpenGL-specific code (rather than Processing API) for sake of speed to draw thousands and thousands of points. It’s further proof that I need to finish the PShape class for Processing 1.0, which will automatically handle this sort of thing automatically.

Second is a more general point about parsing. This isn’t intended as a nitpick on Aaron’s code (it’s commendable that he put his code out there for everyone to see—and uh, nitpick about). But seeing how it was written reminded me that most people don’t know about the casts in Processing, particularly when applied to whole arrays, and this can be really useful when parsing data.

To convert a String to a float (or int) in Processing, you can use a cast, for instance:

String s = "667.12";
float f = float(s);

This also in fact works with String[] arrays, like the kind returned by the split() method while parsing data. For instance, in SceneViewer.pde, the code currently reads:

String[] thisLine = split(raw[i], ",");
points[i * 3] = new Float(thisLine[0]).floatValue() / 1000;
points[i * 3 + 1] = new Float(thisLine[1]).floatValue() / 1000;
points[i * 3 + 2] = new Float(thisLine[2]).floatValue() / 1000;

Which could be written more cleanly as:

String[] thisLine = split(raw[i], ",");
float[] f = float(thisLine);
points[i * 3 + 0] = f[0] / 1000;
points[i * 3 + 1] = f[1] / 1000;
points[i * 3 + 2] = f[2] / 1000;

However, to his credit, Aaron may have have intentionally skipped it in this case since he don’t need the whole line of numbers.

Or if you’re using the Processing API with Eclipse or some other IDE, that means that the float() cast won’t work for you. You can substitute float() with the parseFloat() method:

String[] thisLine = split(raw[i], ",");
float[] f = parseFloat(thisLine);
points[i * 3 + 0] = f[0] / 1000;
points[i * 3 + 1] = f[1] / 1000;
points[i * 3 + 2] = f[2] / 1000;

The same can be done for int, char, byte, and boolean. You can also go the other direction by converting float[] or int[] arrays to String[] arrays using the str() method. (The method is named str() because a String() cast would be awkward, a string() cast would be error prone, and it’s not really parseStr() either.)

When using parseInt() and parseFloat() (versus the int() and float() casts), it’s also possible to include a second parameter that specifies a “default” value for missing data. Normally, the default is Float.NaN for parseFloat(), or 0 with parseInt() and the others. When parsing integers, 0 and “no data” often have a very different meaning, in which case this can be helpful.

Tuesday, July 15, 2008 | parse  

Radiohead - House of Cards

Radiohead’s new video for “House of Cards” built using a laser scanner and software:

Aaron Koblin, one of Casey’s former students was involved in the project and also made use of Processing for the video. He writes:

A couple of hours ago was the release of a project I’ve been working on with Radiohead and Google. Lots of laser scanner fun.

I released some Processing code along with the data we captured to make the video. Also tried to give a basic explanation of how to get started using Processing to play with all this stuff.

The project is hosted at code.google.com/radiohead, where you can also download all the data for the point clouds captured by the scanner, as well as Processing source code to render the points and rotate Thom’s head as much as you’d like. This is the download page for the data and source code.

They’ve also posted a “making of” video:

(Just cover your ears toward the end where the director starts going on about “everything is data…”)

Sort of wonderful and amazing that they’re releasing the data behind the project, opening up the possibility for a kind of software-based remixing of the video. I hope their leap of faith will be rewarded by individuals doing interesting and amazing things with the data. (Nudge, nudge.)

Aaron’s also behind the excellent Flight Patterns as well as The Sheep Market, both highly recommended.

Tuesday, July 15, 2008 | data, motion, music  

Derek Jeter Probably Didn’t Need To Jump To Throw That Guy Out

05jeterderek14.jpgDerek Jeter vs. Objective Reality is an entertaining article from Slate regarding a study by Shane T. Jensen at the Wharton School. Nate DiMeo writes:

The take-away from the study, which was presented at the annual meeting of the American Association for the Advancement of Science, was that Mr. Jeter (despite his three Gold Gloves and balletic leaping throws) is the worst-fielding shortstop in the game.

The New York press was unhappy, but the stats-minded baseball types (Sabermetricians) weren’t that impressed. DiMeo continues:

Mostly, though, the paper didn’t provoke much intrigue because Jeter’s badness is already an axiom of [Sabermetric literature]. In fact, debunking the conventional wisdom about the Yankee captain’s fielding prowess has become a standard method of proving the validity of a new fielding statistic. That places Derek Jeter at the frontier of new baseball research.

Well put. Mr. Jeter defended himself by saying:

“Maybe it was a computer glitch”

What I like about the article, aside from a objective and quantitative reason to dislike Jeter (I already have a quantity of subjective reasons) is how the article frames the issue in the broader sports statistics debate. It nicely covers this new piece of information as a microcosm of the struggle between sabermetricians and traditional baseball types, while essentially poking fun at both: the total refusal of the traditional side to buy into the numbers, and the schadenfreude of the geeks going after Jeter since he’s the one who gets the girls. (The article is thankfully not as trite as that, but you get the idea.)

I’m also biased since the metric in the paper places Pokey Reese, one of my favorite Red Sox players of 2004 as #11 amongst second basemen between 2000-2005.

And of course, The Onion does it better:

Experts: ‘Derek Jeter Probably Didn’t Need To Jump To Throw That Guy Out’

BRISTOL, CT—Baseball experts agreed Sunday that Derek Jeter, who fielded a routine ground ball during a regular-season game in which the Yankees were leading by five runs and then threw it to first base using one of his signature leaps, did not have to do that to record the out. “If it had been a hard-hit grounder in the hole or even a slow dribbler he had to charge, that would’ve been one thing,” analyst John Kruk said during a broadcast of Baseball Tonight. “But when it’s hit right to him by [Devil Rays first-baseman] Greg Norton, a guy who has no stolen bases and is still suffering the effects of a hamstring injury sustained earlier this year… Well, that’s a different story.” Jeter threw out Norton by 15 feet and pumped his fist in celebration at the end of the play.

In other news, I can’t believe I just put a picture of Jeter on my site.

Monday, July 14, 2008 | baseball, mine, sports  
Book

Visualizing Data Book CoverVisualizing Data is my book about computational information design. It covers the path from raw data to how we understand it, detailing how to begin with a set of numbers and produce images or software that lets you view and interact with information. Unlike nearly all books in this field, it is a hands-on guide intended for people who want to learn how to actually build a data visualization.

The text was published by O’Reilly in December 2007 and can be found at Amazon and elsewhere. People who have purchased the book can find the examples here.

The book covers ideas found in my Ph.D. dissertation, which is basis for Chapter 1. The next chapter is an extremely brief introduction to Processing, which is used for the examples. but applies them to a series of examples, first starting with a simple mapping project (Chapter 3) to place data points on a map of the United States. Of course, the idea is not that lots of people want to visualize data for each of 50 states. Instead, it’s a jumping off point for learning how to lay out data spatially.

The chapters that follow cover six more projects, such as salary vs. performance (Chapter 5), zipdecode (Chapter 6), followed by more advanced topics dealing with trees, treemaps, hierarchies, and recursion (Chapter 7), plus graphs and networks (Chapter 8).

This site will be used for follow-up code and writing about related topics.