18.417
introduction to computational molecular biology
CLUSTERING WITH CAST   |   final project




This project is based on an implementation of the CAST algorithm for data clustering, first presented in Clustering Gene Expression Patterns by Ben-Dor et al.

In my implementation, I was particularly interested in making the algorithm demonstrate how it worked by iteratively updating a visual image of its progress in clustering the data set.

The applet below is a demonstration version that runs on test data in a similarity matrix. Click the applet to restart it, or reload the page in your browser to do the same.



This browser does not support Java, the applet cannot be viewed.



Once the similarity matrix portion was working with the CAST algorithm, I worked on applying this to microarray data. I implemented a reader for Eisen's .cdt format files, and then used the data from Figure 3 in Eisen, et al. as input to the system. I also added a visual representation of the current sorting of the MicroArray data at the righthand side of the similarity matrix image. Click the image below for a movie the Eisen data being sorted using the same process. The video encoding process darkens the image badly, though you can still just barely see the red and green of the microarray data on the right-hand side.






Because microarray data produces a great deal of information, the program tends to be very memory/speed intensive. I got fairly good performance from a 933 MHz machine, even while running in Java. I had to implement some simple caching methods so that similarity matrix data could be cached to the disk and not rebuilt from scratch each time (this eventually saved lots of debugging time).

Data size had implications on the visual side as well, it was necessary to downsample the data so that it would fit on-screen. In this example, for instance, ~2600 elements in the matrix are scaled to fit in a few hundred pixels using a simple averaging of pixels.

ben fry   |   12 dec 00






Ben-Dor, Amir. Shamir, Ron. Yakhini, Zohar. Clustering Gene Expression Patterns. Journal of Computational Biology (Mary Ann Liebert, Inc.) Volume 6, Numbers 3/4, 1999. pp. 281-297.

Eisen, Michael B. Spellman, Paul T. Brown, Patrick O. Botstein, David. Cluster analysis and display of genome-wide expression patterns . Proceedings of the National Academy of Sciences. Vol.95, pp.14863-14868, December 1998.