Writing

Renting Big Data

logo_aws.gifBack in December (or maybe even November… sorry, digging out my inbox this morning) Amazon announced the availability of public data sets for their Elastic Compute Cloud platform:

Previously, large data sets such as the mapping of the Human Genome and the US Census data required hours or days to locate, download, customize, and analyze. Now, anyone can access these data sets from their Amazon Elastic Compute Cloud (Amazon EC2) instances and start computing on the data within minutes. Users can also leverage the entire AWS ecosystem and easily collaborate with other AWS users. For example, users can produce or use prebuilt server images with tools and applications to analyze the data sets. By hosting this important and useful data with cost-efficient services such as Amazon EC2, AWS hopes to provide researchers across a variety of disciplines and industries with tools to enable more innovation, more quickly.

The current lists includes ENSEMBL (550 GB), GenBank (250 GB), various collections from the US Census (about 500 GB), and a handful of others (with more promised). I’m excited about the items under the “Economy” heading, since lots of that information has to date been difficult to track down in one place and in a single format.

While it may be possible to download these as raw files from FTP servers from their original sources, it’s already set up for you, rather than running rsync or ncftp for twenty-four hours, then spending an afternoon setting up a Linux server with MySQL and lots of big disk space, and dealing with various issues regarding versions of Apache, MySQL, PHP, different Perl modules to be installed, permissions to be fixed, etc. etc. (Can you tell the pain is real?)

As I understand it, you start with a frozen version of the database, then import that into your own workspace on AWS, and pay only for the CPU time, storage, and bandwidth that you actually use. Pricing details are here, but wear boots — there’s a lotta cloud marketingspeak to wade through.

(Thanks to Andrew Otwell for the nudge.)

Sunday, January 25, 2009 | acquire, data, goinuptotheserverinthesky  

The cloud over the rainforest brings a thunderstorm

And now, the opposite of the Amazon plot posted yesterday. No sooner had I finished writing about their online aptitude that they have a major site outage, greeting visitors with a Http/1.1 Service Unavailable message.

amazon-down-500.jpg

Plot from this article on News.com.

Friday, June 6, 2008 | goinuptotheserverinthesky, notafuturist  

What’s that big cloud over the rainforest?

As the .com shakeout loomed in the late 90s, I always assumed that:

  1. Most internet-born companies would disappear.
  2. Traditional (brick & mortar) stores would eventually get their act together and have (or outsource) a proper online presence. For instance Barnes & Noble hobbling toward a usable site, and Borders just giving up and turning over their online presence to Amazon. The former comical, the latter brilliant, though Borders has just returned with their own non-Amazonian presence. (Though I think the humor is now gone from watching old-school companies trying to move online.)
  3. Finally, a few new names—namely the biggest ones, like Amazon—would be left that didn’t disappear with the others from point #1.

Basically, that not much would change. A couple new brands would emerge, but that there wasn’t really room in people’s heads for that many new retailers or services. (It probably didn’t help that all their logos were blue and orange, and had names like Flooz, Boo and Kibu that feel natural on the tongue and inspire buyer loyalty and confidence.)

aws_bandwidth.gifBut not only did more companies stick around, some seem to be successfully pivoting into other areas. From Amazon:

In January of 2008 we announced that the Amazon Web Services now consume more bandwidth than do the entire global network of Amazon.com retail sites.

This from a blog post with this plot of the bandwidth use for both sides of the business.

Did you imagine that the site where you could buy books cheaper than anywhere else in 1998 would ten years later exceed the bandwidth from that with services for data storage and cloud computing? Of course, this announcement doesn’t say anything about their profits at this point, but I don’t think anyone expected Steve Jobs to turn Apple into a toy factory and start turning out music players and cell phones to have it become half their business within just a few years. (That’s half as in, “beastly silver PCs and shiny black and white laptops seem important and all, but those take real work…why bother?”)

But the point (aside from subjecting you to a long-winded description of .com history and my shortcomings as a futurist) has more to do with Amazon becoming a business that’s dealing purely in information. The information economy is all about people moving bits and ideas around (abstractions of things), instead of silk, furs, and spices (actual physical things). And while books are information, the growth of Amazon’s data services business—as evidenced by that graph—is one of the strongest indicators I’ve seen of just how real the non-real information economy has become. Not that the information economy is something new; but that the groundwork has been laid in the preceding decades where something like Amazon Web Services can be successful.

And since we’re on the subject of Amazon, I’ll close with more from Jeff Bezos from “How the Web Was Won” in this month’s Vanity Fair:

When we launched, we launched with over a million titles. There were countless snags. One of my friends figured out that you could order a negative quantity of books. And we would credit your credit card and then, I guess, wait for you to deliver the books to us. We fixed that one very quickly.

Or showing his genius early on:

When we started out, we were packing on our hands and knees on these cement floors. One of the software engineers that I was packing next to was saying, You know, this is really killing my knees and my back. And I said to this person, I just had a great idea. We should get kneepads. And he looked at me like I was from Mars. And he said, Jeff, we should get packing tables.

Thanks to Eugene for passing along the links.

Thursday, June 5, 2008 | goinuptotheserverinthesky, infographics, notaneconomist  
Book

Visualizing Data Book CoverVisualizing Data is my 2007 book about computational information design. It covers the path from raw data to how we understand it, detailing how to begin with a set of numbers and produce images or software that lets you view and interact with information. When first published, it was the only book(s) for people who wanted to learn how to actually build a data visualization in code.

The text was published by O’Reilly in December 2007 and can be found at Amazon and elsewhere. Amazon also has an edition for the Kindle, for people who aren’t into the dead tree thing. (Proceeds from Amazon links found on this page are used to pay my web hosting bill.)

Examples for the book can be found here.

The book covers ideas found in my Ph.D. dissertation, which is the basis for Chapter 1. The next chapter is an extremely brief introduction to Processing, which is used for the examples. Next is (chapter 3) is a simple mapping project to place data points on a map of the United States. Of course, the idea is not that lots of people want to visualize data for each of 50 states. Instead, it’s a jumping off point for learning how to lay out data spatially.

The chapters that follow cover six more projects, such as salary vs. performance (Chapter 5), zipdecode (Chapter 6), followed by more advanced topics dealing with trees, treemaps, hierarchies, and recursion (Chapter 7), plus graphs and networks (Chapter 8).

This site is used for follow-up code and writing about related topics.