writing | ben fry

Writing

Mediocre metrics, and how did we get here?

In other news, an article from Slate about measuring obesity using BMI (Body Mass Index). Interesting reading as I continue with work in the health care space. The article goes through the obvious flaws of the BMI measure, along with some history. Jeremy Singer-Vine writes:

Belgian polymath Adolphe Quetelet devised the equation in 1832 in his quest to define the “normal man” in terms of everything from his average arm strength to the age at which he marries. This project had nothing to do with obesity-related diseases, nor even with obesity itself. Rather, Quetelet used the equation to describe the standard proportions of the human build—the ratio of weight to height in the average adult. Using data collected from several hundred countrymen, he found that weight varied not in direct proportion to height (such that, say, people 10 percent taller than average were 10 percent heavier, too) but in proportion to the square of height. (People 10 percent taller than average tended to be about 21 percent heavier.)

For some reason, this brings to mind a guy in a top hat guessing peoples’ weight at the county fair. More to the point is the “how did we get here?” part of the story. Starting with a mediocre measure, it evolved into something for which it was never intended, simply because it worked for a large number of individuals:

The new measure caught on among researchers who had previously relied on slower and more expensive measures of body fat or on the broad categories (underweight, ideal weight, and overweight) identified by the insurance companies. The cheap and easy BMI test allowed them to plan and execute ambitious new studies involving hundreds of thousands of participants and to go back through troves of historical height and weight data and estimate levels of obesity in previous decades.

Gradually, though, the popularity of BMI spread from epidemiologists who used it for studies of population health to doctors who wanted a quick way to measure body fat in individual patients. By 1985, the NIH started defining obesity according to body mass index, on the theory that official cutoffs could be used by doctors to warn patients who were at especially high risk for obesity-related illness. At first, the thresholds were established at the 85^th percentile of BMI for each sex: 27.8 for men and 27.3 for women. (Those numbers now represent something more like the 50^th percentile for Americans.) Then, in 1998, the NIH changed the rules: They consolidated the threshold for men and women, even though the relationship between BMI and body fat is different for each sex, and added another category, “overweight.” The new cutoffs—25 for overweight, 30 for obesity—were nice, round numbers that could be easily remembered by doctors and patients.

I hadn’t realized that it was only 1985 that this came into common use. And I thought the new cutoffs had more to do with the stricter definition from the WHO, rather than the simplicity of rounding. But back to the story:

Keys had never intended for the BMI to be used in this way. His original paper warned against using the body mass index for individual diagnoses, since the equation ignores variables like a patient’s gender or age, which affect how BMI relates to health.

After taking as fact that it was a poor indicator, all this grousing about the inaccuracy of BMI now has me wondering how often it’s actually out of whack. For instance, it does poorly for muscular athletes, but what percentage of the population is that? 10% at the absolute highest? Or at the risk of sounding totally naive, if the metric is correct, say, 85% of the time, does it deserve as much derision as it receives?

Going a little further, another fascinating part of returns to the fact that the BMI numbers had in the past been a sort of guideline used by doctors. Consider the context: a doctor might sit with a patient in their office, and if the person is obviously not obese or underweight, not even consider such a measure. But if there’s any question, BMI provides a general clue as to an appropriate range, which, when delivered by a doctor with experience, can be framed appropriately. However, as we move to using technology to record such measures—it’s easy to put an obesity calculation into an electronic medical record, for instance, that EMR does not (necessarily) include the doctor’s delivery.

Basically, we can make a general rule or goal that numbers that require additional context (delivery by a doctor), shouldn’t be stored in places devoid of context (databases). If we’re taking away context, the accuracy of the metric has to increase in proportion (or proportion squared, even) to the amount of context that has been removed.

I assume this is the case for most fields, and that the statistical field has a term (probably made up by Tukey) for the “remove context, increase accuracy” issue. At any rate, that’s the end of today’s episode of “what’s blindingly obvious to proper statisticians but I like working out for myself.”

Tuesday, July 21, 2009 | human, numberscantdothat

Skills as Numbers

BusinessWeek has an excerpt of Numerati, a book about the fabled monks of data mining (publishers weekly calls them “entrepreneurial mathematicians”) who are sifting through the personal data we create every day.

Picture an IBM manager who gets an assignment to send a team of five to set up a call center in Manila. She sits down at the computer and fills out a form. It’s almost like booking a vacation online. She puts in the dates and clicks on menus to describe the job and the skills needed. Perhaps she stipulates the ideal budget range. The results come back, recommending a particular team. All the skills are represented. Maybe three of the five people have a history of working together smoothly. They all have passports and live near airports with direct flights to Manila. One of them even speaks Tagalog.

Everything looks fine, except for one line that’s highlighted in red. The budget. It’s $40,000 over! The manager sees that the computer architect on the team is a veritable luminary, a guy who gets written up in the trade press. Sure, he’s a 98.7% fit for the job, but he costs $1,000 an hour. It’s as if she shopped for a weekend getaway in Paris and wound up with a penthouse suite at the Ritz.

Hmmm. The manager asks the system for a cheaper architect. New options come back. One is a new 29-year-old consultant based in India who costs only $85 per hour. That would certainly patch the hole in the budget. Unfortunately, he’s only a 69% fit for the job. Still, he can handle it, according to the computer, if he gets two weeks of training. Can the job be delayed?

This is management in a world run by Numerati.

I’m highly skeptical of management (a fundamentally human activity) being distilled to numbers in this manner. Unless, of course, the managers are that poor at doing their job. And further, what’s the point of the manager if they’re spending most of their time filling out the vacation form-style work order? (Filling out tedious year-end reviews, no doubt.) Perhaps it should be an indication that the company is simply too large:

As IBM sees it, the company has little choice. The workforce is too big, the world too vast and complicated for managers to get a grip on their workers the old-fashioned way—by talking to people who know people who know people.

Then we descend (ascend?) into the rah-rah of today’s global economy:

Word of mouth is too foggy and slow for the global economy. Personal connections are too constricted. Managers need the zip of automation to unearth a consultant in New Delhi, just the way a generation ago they located a shipment of condensers in Chicago. For this to work, the consultant—just like the condensers—must be represented as a series of numbers.

I say rah-rah because how else can you put refrigeration equipment parts in the same sentence as a living, breathing person with a mind, free will and a life.

And while I don’t think I agree with this particular thesis, the book as a whole looks like an interesting survey of efforts in this area. Time to finish my backlog of Summer reading so I can order more books…

Monday, September 1, 2008 | human, mine, notafuturist, numberscantdothat, privacy, social

Book

Visualizing Data is my 2007 book about computational information design. It covers the path from raw data to how we understand it, detailing how to begin with a set of numbers and produce images or software that lets you view and interact with information. When first published, it was the only book(s) for people who wanted to learn how to actually build a data visualization in code.

The text was published by O’Reilly in December 2007 and can be found at Amazon and elsewhere. Amazon also has an edition for the Kindle, for people who aren’t into the dead tree thing. (Proceeds from Amazon links found on this page are used to pay my web hosting bill.)

Examples for the book can be found here.

The book covers ideas found in my Ph.D. dissertation, which is the basis for Chapter 1. The next chapter is an extremely brief introduction to Processing, which is used for the examples. Next is (chapter 3) is a simple mapping project to place data points on a map of the United States. Of course, the idea is not that lots of people want to visualize data for each of 50 states. Instead, it’s a jumping off point for learning how to lay out data spatially.

The chapters that follow cover six more projects, such as salary vs. performance (Chapter 5), zipdecode (Chapter 6), followed by more advanced topics dealing with trees, treemaps, hierarchies, and recursion (Chapter 7), plus graphs and networks (Chapter 8).

This site is used for follow-up code and writing about related topics.

Much Clicked

Full Archives