Fascinating stuff.., Google's custom filesystem architectures that is.
Was just reading about their new indexing system codenamed "Caffeine", which also delved into the backend works of what I'd like to call, the "Googleplex". In the early days of Google Inc., they created the Google File System, or GFS for short. Which powers the engine services behind their public infrastructure, like BigTable (distributed real-time database) and MapReduce (number-crunching platform).
While GFS works fine for these purposes, the introduction of AJAX-driven or multimedia-dominated services like GMail and YouTube, GFS just does not cut it.
10 years ago, GFS was a state-of-the-art master-slave scalable distributed filesystem, a single-master design (where a single master node communicates with several slave nodes), thus creating a rather monolithic infrastructure with a single point of failure. And while indexing and searching works fine this way, the new dynamic web-applications demand low-latency storage for faster i/o, which GFS was not designed for.
For the 21st century, technology-achievements and faster bandwidth allows for a faster infrastructure, so the last two years, Google has been buzy revamping GFS, into what is now known as GFS2. The main difference between the first revision and the new, is dynamic distributing.
"GFS2 not only utilizes distributed slaves, but distributed masters as well".
For now, Caffeine runs in a single Google data center, which seems to suggest they've only implemented GFS2 at that one location. And as to the matter of data migration at a later point in time, über-Googler Matt Cutts downplays the risk and hassle, saying it's a matter of taking one data center down at a time. And as demand for bigger and faster storage arises, and people want more interactive and dynamic content and services, this is a natural progressional step for information technology in general. But it hardly seems weird, taken the fact about how fast our personal technology has sprung over the last years.
"Today Google - tomorrow, the rest of the world..."
Google's main philosophy when it comes to storage, and multi-processing, is to unify their entire infrastructure, and make it treat it's vast farm of thousands of computers housed in data centers as one single big virtual machine. Or as defined in a Google Research paper, it is a "Warehouse-Scale Machine", or "WSM" for short.
So, "Today Caffeine - tomorrow, everything else..."
Update thursday november 12th:
I just found a schematic image of Google's "Project 02" data center. It can be found here.