Writing

Our gattaca future begins with our sports heroes

The New York Times this morning documents Major League Baseball’s use of DNA tests to verify the age of baseball prospects:

Dozens of Latin American prospects in recent years have been caught purporting to be younger than they actually were as a way to make themselves more enticing to major league teams. Last week the Yankees voided the signing of an amateur from the Dominican Republic after a DNA test conducted by Major League Baseball’s department of investigations showed that the player had misrepresented his identity.

Some players have also had bone scans to be used in determining age range.

(Why does a “bone scan” sound so painful? “You won’t provide a DNA sample? Well, maybe you’ll change your mind after the bone scan!”)

Kathy Hudson of Johns Hopkins notes the problem with testing:

“The point of [the Genetic Information Nondiscrimination Act, passed last year] was to remove the temptation and prohibit employers from asking or receiving genetic information.”

The article continues and makes note of the fact that such tests are also used to determine whether a player’s parents are his real parents, which can have an upsetting outcome.

But perhaps the broader concern (outside broken homes) and the scarier motivation for expansion of such testing is noted by a scouting director (not named), who comments:

“Can they test susceptibility to cancer? I don’t know if they’re doing any of that. But I know they’re looking into trying to figure out susceptibility to injuries, things like that. If they come up with a test that shows someone’s connective tissue is at a high risk of not holding up, can that be used? I don’t know. I do think that’s where this is headed.”

Injury is perhaps the most significant, yet most random, factor in scouting. If we’re talking about paying someone $27 million, will the threat of a federal discrimination law (wielded by a young player and agent) really be enough to keep teams away from this?

Wednesday, July 22, 2009 | genetics, sports  

Mediocre metrics, and how did we get here?

In other news, an article from Slate about measuring obesity using BMI (Body Mass Index). Interesting reading as I continue with work in the health care space. The article goes through the obvious flaws of the BMI measure, along with some history. Jeremy Singer-Vine writes:

Belgian polymath Adolphe Quetelet devised the equation in 1832 in his quest to define the “normal man” in terms of everything from his average arm strength to the age at which he marries. This project had nothing to do with obesity-related diseases, nor even with obesity itself. Rather, Quetelet used the equation to describe the standard proportions of the human build—the ratio of weight to height in the average adult. Using data collected from several hundred countrymen, he found that weight varied not in direct proportion to height (such that, say, people 10 percent taller than average were 10 percent heavier, too) but in proportion to the square of height. (People 10 percent taller than average tended to be about 21 percent heavier.)

For some reason, this brings to mind a guy in a top hat guessing peoples’ weight at the county fair. More to the point is the “how did we get here?” part of the story. Starting with a mediocre measure, it evolved into something for which it was never intended, simply because it worked for a large number of individuals:

The new measure caught on among researchers who had previously relied on slower and more expensive measures of body fat or on the broad categories (underweight, ideal weight, and overweight) identified by the insurance companies. The cheap and easy BMI test allowed them to plan and execute ambitious new studies involving hundreds of thousands of participants and to go back through troves of historical height and weight data and estimate levels of obesity in previous decades.

Gradually, though, the popularity of BMI spread from epidemiologists who used it for studies of population health to doctors who wanted a quick way to measure body fat in individual patients. By 1985, the NIH started defining obesity according to body mass index, on the theory that official cutoffs could be used by doctors to warn patients who were at especially high risk for obesity-related illness. At first, the thresholds were established at the 85th percentile of BMI for each sex: 27.8 for men and 27.3 for women. (Those numbers now represent something more like the 50th percentile for Americans.) Then, in 1998, the NIH changed the rules: They consolidated the threshold for men and women, even though the relationship between BMI and body fat is different for each sex, and added another category, “overweight.” The new cutoffs—25 for overweight, 30 for obesity—were nice, round numbers that could be easily remembered by doctors and patients.

I hadn’t realized that it was only 1985 that this came into common use. And I thought the new cutoffs had more to do with the stricter definition from the WHO, rather than the simplicity of rounding. But back to the story:

Keys had never intended for the BMI to be used in this way. His original paper warned against using the body mass index for individual diagnoses, since the equation ignores variables like a patient’s gender or age, which affect how BMI relates to health.

After taking as fact that it was a poor indicator, all this grousing about the inaccuracy of BMI now has me wondering how often it’s actually out of whack. For instance, it does poorly for muscular athletes, but what percentage of the population is that? 10% at the absolute highest? Or at the risk of sounding totally naive, if the metric is correct, say, 85% of the time, does it deserve as much derision as it receives?

Going a little further, another fascinating part of returns to the fact that the BMI numbers had in the past been a sort of guideline used by doctors. Consider the context: a doctor might sit with a patient in their office, and if the person is obviously not obese or underweight, not even consider such a measure. But if there’s any question, BMI provides a general clue as to an appropriate range, which, when delivered by a doctor with experience, can be framed appropriately. However, as we move to using technology to record such measures—it’s easy to put an obesity calculation into an electronic medical record, for instance, that EMR does not (necessarily) include the doctor’s delivery.

Basically, we can make a general rule or goal that numbers that require additional context (delivery by a doctor), shouldn’t be stored in places devoid of context (databases). If we’re taking away context, the accuracy of the metric has to increase in proportion (or proportion squared, even) to the amount of context that has been removed.

I assume this is the case for most fields, and that the statistical field has a term (probably made up by Tukey) for the “remove context, increase accuracy” issue. At any rate, that’s the end of today’s episode of “what’s blindingly obvious to proper statisticians but I like working out for myself.”

Tuesday, July 21, 2009 | human, numberscantdothat  

“…the sexiest numbers I’ve seen in some time.”

A Fonts for Financials mailing from Hoefler & Frere-Jones includes some incredibly beautiful typefaces they’ve developed that play well with numbers. A sampling includes tabular figures (monospaced numbers, meaning “farewell, Courier!”) using Gotham and Sentinel:

courier and andale mono can bite me

Or setting indices (numbers in circles, apparently), using Whitney:

numbers, dots, dots, numbers

As Casey wrote this morning, “these are the sexiest numbers I’ve seen in some time.” I love ‘em.

Tuesday, July 21, 2009 | typography  

Dropping Statistics for Knowledge

My favorite part of this week’s Seminar on Innovative Approaches to Turn Statistics into Knowledge (aside from its comically long name) was the presentation from Amanda Cox of The New York Times. She showed three particular projects which are a little further up the complexity scale as compared to a lot of the work from the Times, and much more like the sort of numerical messes that catch my interest. The three serve are also a great cross-section of Amanda’s work with her collaborators, so I’m posting them here. Check ‘em out:

“How Different Groups Voted in the 2008 Democratic Presidential Primaries” by Shan Carter and Amanda Cox:

oh hillary

“All of Inflation’s Little Parts” by Matthew Bloch, Shan Carter and Amanda Cox

soap bubble opera

And finally, “Turning a Corner?” which is perhaps the most complicated of the bunch, but gets more interesting as you spend a little more time with it.

just give it some time

Sunday, July 19, 2009 | infographics  

“There’s a movie in there, but it’s a very unusual movie.”

how about some handsome with that?On the heels of today’s posting of the updated Salary vs. Performance piece comes word in the New York Times that a film version of Moneyball has been shelved:

Just days before shooting was to begin, Sony Pictures pulled the plug on “Moneyball,” a major film project starring Brad Pitt and being directed by Steven Soderbergh.

Yesterday I found it far more unsettling that such a movie would be made period, but today I’m oddly curious about how they might pull it off:

What baseball saw as accurate, Sony executives saw as being too much a documentary. Mr. Soderbergh, for instance, planned to film interviews with some of the people who were connected to the film’s story.

I guess we’ll never know, since other studios also passed on the project, but that’s probably a good thing.

As an aside, I’m in the midst of reading Liar’s Poker (another by Moneyball author Michael Lewis) and again find myself amused by his ability as a storyteller: he reminds me of a friend who can take the most banal event and turn it into the most peculiar and hilarious story you’ve ever heard.

Thursday, July 2, 2009 | movies, reading, salaryper  

Salary vs. Performance for 2009

go tigersI’ve just posted the updated version of Salary vs. Performance for the 2009 baseball season. I had hoped this year to rewrite the piece to cover multiple years, have a couple more analysis options, and even to rebuild it using the JavaScript version of Processing (no Java! no plug-in!), but a busy spring has upended my carefully crafted but poorly implemented plans.

Meanwhile, my inbox has been filling with plaintive comments like this one:

Will you be updating this site for this year? It’s the first year I think my team, the Giants would have a blue line instead of a red line.

How can I ignore the Giants fans? (Or for that matter, their neighbors to the south, the Dodgers, who perch atop the list as I write this.)

More about the project can be found in the archives. Visualizing Data explains the code and how it works, and the code itself is amongst the book examples.

Thursday, July 2, 2009 | inbox, salaryper  
Book

Visualizing Data Book CoverVisualizing Data is my book about computational information design. It covers the path from raw data to how we understand it, detailing how to begin with a set of numbers and produce images or software that lets you view and interact with information. Unlike nearly all books in this field, it is a hands-on guide intended for people who want to learn how to actually build a data visualization.

The text was published by O’Reilly in December 2007 and can be found at Amazon and elsewhere. Amazon also has an edition for the Kindle, for people who aren’t into the dead tree thing. (Proceeds from Amazon links found on this page are used to pay my web hosting bill.)

Examples for the book can be found here.

The book covers ideas found in my Ph.D. dissertation, which is basis for Chapter 1. The next chapter is an extremely brief introduction to Processing, which is used for the examples. Next is (chapter 3) is a simple mapping project to place data points on a map of the United States. Of course, the idea is not that lots of people want to visualize data for each of 50 states. Instead, it’s a jumping off point for learning how to lay out data spatially.

The chapters that follow cover six more projects, such as salary vs. performance (Chapter 5), zipdecode (Chapter 6), followed by more advanced topics dealing with trees, treemaps, hierarchies, and recursion (Chapter 7), plus graphs and networks (Chapter 8).

This site is used for follow-up code and writing about related topics.

As seen on Twitter