Writing

Hide the bipolar data, here comes bioinformatics!

I was fascinated a few weeks ago to receive this email from the Genome-announce list at UCSC:

Last week the National Institutes of Health (NIH) modified their policy for posting and accessing genome-wide association studies (GWAS) data contained in NIH databases. They have removed public access to aggregate genotype GWAS data in response to the publication of new statistical techniques for analyzing dense genomic information that make it possible to infer the group assignment (case vs. control) of an individual DNA sample under certain circumstances. The Wellcome Trust Case Control Consortium in the UK and the Broad Institute of MIT and Harvard in Boston have also removed aggregate data from public availability. Consequently, UCSC has removed the “NIMH Bipolar” and “Wellcome Trust Case Control Consortium” data sets from our Genome Browser site.

The ingredients for a genome-wide association study are a few hundred people, and a list of what genetic letter (A, C, G, or T) is found at a few hundred specific locations in the DNA of each of those people. Such data is then correlated to whether individuals have a particular disease, and using the correlation, it’s possible to sometimes localize what part of the genome is responsible for the disease.

Of course, the diseases might be of a sensitive nature (e.g. bipolar disorder), so when such data is made publicly available, it’s done in a manner that protects the privacy of the individuals in the data set. What this message means is that a bioinformatics method has been developed that undermines those privacy protections. An amazing bit of statistics!

This made me curious about what led to such a result, so with a little digging, I found this press release, which describes the work:

A team of investigators led by scientists at the Translational Genomics Research Institute (TGen) have found a way to identify possible suspects at crime scenes using only a small amount of DNA, even if it is mixed with hundreds of other genetic fingerprints.

Using genotyping microarrays, the scientists were able to identify an individual’s DNA from within a mix of DNA samples, even if that individual represented less than 0.1 percent of the total mix, or less than one part per thousand. They were able to do this even when the mix of DNA included more than 200 individual DNA samples.

The discovery could help police investigators better identify possible suspects, even when dozens of people over time have been at a crime scene. It also could help reassess previous crime scene evidence, and it could have other uses in various genetic studies and in statistical analysis.

So the CSI folks have screwed it up for the bipolar folks. The titillatingly-titled “Resolving Individuals Contributing Trace Amounts of DNA to Highly Complex Mixtures Using High-Density SNP Genotyping Microarrays” can be found at PLoS Genetics, and a PDF describing the the policy changes is on the NIH’s site for Genome-Wide Association Studies. The PDF provides a much more thorough explanation of what association studies are, in case you’re looking for something better than my cartoon version described above.

Links to much more coverage can be found here, which includes major journals (Nature) and mainstream media outlets (LA Times, Financial Times) weighing in on the research. (It’s always funny to see how news outlets respond to this sort of thing—the Financial Times talk about the positive side, the LA Times focuses exclusively on the negative.) A discussion about the implications of the study can also be found on the PLoS site, with further background from the study’s primary author.

Science presents such fascinating contradictions. A potentially helpful advance that undermines another area of research. The breakthrough that opens a Pandora’s Box. It’s probably rare to see such a direct contradiction (that’s not heavily politicized like, say, stem cell research), but the social and societal impact is undoubtedly one of the things I love most about genetics in particular.

Tuesday, September 16, 2008 | genetics, mine, privacy, science  
Book

Visualizing Data Book CoverVisualizing Data is my 2007 book about computational information design. It covers the path from raw data to how we understand it, detailing how to begin with a set of numbers and produce images or software that lets you view and interact with information. When first published, it was the only book(s) for people who wanted to learn how to actually build a data visualization in code.

The text was published by O’Reilly in December 2007 and can be found at Amazon and elsewhere. Amazon also has an edition for the Kindle, for people who aren’t into the dead tree thing. (Proceeds from Amazon links found on this page are used to pay my web hosting bill.)

Examples for the book can be found here.

The book covers ideas found in my Ph.D. dissertation, which is the basis for Chapter 1. The next chapter is an extremely brief introduction to Processing, which is used for the examples. Next is (chapter 3) is a simple mapping project to place data points on a map of the United States. Of course, the idea is not that lots of people want to visualize data for each of 50 states. Instead, it’s a jumping off point for learning how to lay out data spatially.

The chapters that follow cover six more projects, such as salary vs. performance (Chapter 5), zipdecode (Chapter 6), followed by more advanced topics dealing with trees, treemaps, hierarchies, and recursion (Chapter 7), plus graphs and networks (Chapter 8).

This site is used for follow-up code and writing about related topics.