introduction to computational molecular biology
CLUSTERING WITH CAST | final project
This project is based on an implementation of the CAST algorithm for data clustering, first presented in Clustering Gene Expression Patterns by Ben-Dor et al.
In my implementation, I was particularly interested in making the algorithm demonstrate how it worked by iteratively updating a visual image of its progress in clustering the data set.
The applet below is a demonstration version that runs on test data in
a similarity matrix. Click the applet to restart it, or reload the
page in your browser to do the same.
Once the similarity matrix portion was working with the CAST algorithm, I worked on applying this to microarray data. I implemented a reader for Eisen's .cdt format files, and then used the data from Figure 3 in Eisen, et al. as input to the system. I also added a visual representation of the current sorting of the MicroArray data at the righthand side of the similarity matrix image. Click the image below for a quicktime movie the Eisen data being sorted using the same process. The file is large (24M) but will progressively download so your machine doesn't die.
Because microarray data produces a great deal of information, the program tends to be very memory/speed intensive. I got fairly good performance from a 933 MHz machine, even while running in Java. I had to implement some simple caching methods so that similarity matrix data could be cached to the disk and not rebuilt from scratch each time (this eventually saved lots of debugging time).
Data size had implications on the visual side as well, it was necessary to downsample the data so that it would fit on-screen. In this example, for instance, ~2600 elements in the matrix are scaled to fit in a few hundred pixels using a simple averaging of pixels.
For the curious, the source code for this project is available here. It's formatted using emacs-style tabs, so it may be difficult to read in a web browser (or pretty much anything besides emacs). The code has some goodies in it (the CAST implementation itself, some fun caching things, and a very compact piece of code that writes uncompressed tiff files). Most of the files should be named appropriately (put your bio thinking cap on and try to figure out what "MicroArray.java" is for). Although I tend to despise this sort of silliness, the architecture is quite pluggable that if someone were so inclined (which supposedly would have been me when I was writing this code...) one might add other methods of clustering, or other file formats for input, or visuals for output.
ben fry | 12 dec 00
Eisen, Michael B. Spellman, Paul T. Brown, Patrick O. Botstein, David. Cluster analysis and display of genome-wide expression patterns . Proceedings of the National Academy of Sciences. Vol.95, pp.14863-14868, December 1998.