Clustera: A Data-Centric Approach to Scalable Cluster Management David J. DeWitt Microsoft Jim Gray Systems Lab & Computer Sciences Department, University of Wisconsin at Madison Twenty-five years ago, when we built our first cluster management system using a collection of twenty VAX 11/750 computers, the idea of a compute cluster was an exotic concept. Today, clusters of 1,000 nodes are common and some of the biggest have in excess of 10,000 nodes. Such clusters are simply awash in data about machines, users, jobs, and files. Many of the tasks that such systems are asked to perform are very similar to database transactions. For example, the system must accept jobs from users and send them off to be executed. The system should not "drop" jobs or lose files due to hardware or software failures. The software must also allow users to stop failed computations or "change their mind" and retract thousands of submitted but not yet completed jobs. Amazingly, no cluster management system that we are aware of uses a database system for managing its data. In this talk I will describe Clustera, a new cluster management system we have been working for the last three years. As one would expect from some database types, Clustera uses a relational DBMS to store all its operational data including information about jobs, users, machines, and files (executable, input, and output). One unique aspect of the Clustera design is its use of an application server (JBoss currently) in front of the relational DBMS. Application servers have a number of appealing capabilities. First, they can handle 10s of 1000s of clients. Second, they provide fault tolerance and scalability by running on multiple server nodes. Third, they multiplex connections to the database system to a level that the database system can comfortably support. Compute nodes in a Clustera cluster appear as web clients to the application server and make SOAP calls to submit requests for jobs to execute and to update status information that is stored in the relational database. Extensibility is a second key goal of the Clustera project. Traditional cluster management systems such as Condor were targeted toward long-running, computational intensive jobs. Newer systems such as Map-Reduce are targeted toward a specific type of data intensive parallel computation. Parallel SQL database systems represent a third type of cluster management system. The Clustera framework was designed to handle each of these classes of jobs in a common execution and data framework.