GPS: A Graph Processing System

GPS: A Graph Processing System

Overview

GPS is an open-source system for scalable, fault-tolerant, and easy-to-program execution of algorithms on extremely large graphs. GPS is similar to Google’s proprietary Pregel system, and Apache Giraph. GPS is a distributed system designed to run on a cluster of machines, such as Amazon's EC2.

In systems such as GPS and Pregel, the input graph (directed, possibly with values on edges) is distributed across machines and vertices send each other messages to perform a computation. Computation is divided into iterations called supersteps. Analogous to the map() and reduce() functions of the MapReduce framework, in each superstep a user-defined function called vertex.compute() is applied to each vertex in parallel. The user expresses the logic of the computation by implementing vertex.compute(). This design is based on Valiant's Bulk Synchronous Parallel model of computation. A detailed description can be found in the original Pregel paper.

There are five main differences between Pregel and GPS:

GPS is open-source.
GPS extends Pregel's API with a master.compute() function, which enables easy and efficient implementation of algorithms that are composed of multiple vertex-centric computations, combined with global computations.
GPS has an optional dynamic repartitioning scheme, which reassigns vertices to different machines during graph computation to improve performance, based on observing communication patterns.
GPS has an optimization called LALP that reduces the network I/O in when running certain algorithms on real-world graphs that have skewed degree distributions.
GPS programs can be implemented using a higher-level domain specific language called Green-Marl, and automatically compiled into native GPS code. Green-Marl is a traditional imperative language with several graph-specific language constructs that enable intuitive and simple expression of complicated algorithms.

We have completed an initial version of GPS, which is available to download. We have run GPS on up to 100 Amazon EC2 large instances and on graphs of up to 250 million vertices and 10 billion edges.

Using GPS we have studied the effects on system performance of different ways of partioning the graph, both before the graph computation starts and during the computation. A detailed description of GPS and our work on partitioning can be found in our paper published in SSDBM in July 2013.

Using our Green-Marl compiler, we were able to implement several algorithms such as Approximate Betweenness Centrality and Conductance, whose native GPS implmentations are very challenging. A detailed description of our work on the Green-Marl compiler can be found in our paper from CGO 2014.. Details about Green-Marl can be found in the original Green-Marl paper.

We have been exploring the problem of implementing graph algorithms efficiently on Pregel-like systems. We have done extensive implementations and evaluations of several fundamental graph algorithms such as computing strongly and weakly connected components (SCC), minimum spanning trees, a coloring, and approximate maximum matchings of large-scale graphs. We have observed that standard implementations of graph algorithms can incur unnecessary inefficiencies such as slow convergence or high communication or computation cost in Pregel-like systems, typically due to structural properties of the input graphs such as large diameters or skew in component sizes. We have developed several optimization techniques to address these inefficiencies. Our most general technique is based on the idea of performing some serial computation on a tiny fraction of the input graph, complementing Pregel’s vertex-centric parallelism. A detailed description of our work on optimizations and algorithms can be found in our paper that appeared in VLDB 2014.

In another work related GPS, though not directly built on top of GPS, we identify a set of high-level primitives for distributed processing of large-scale graphs. The motivation for our work is the observation that the APIs of current distributed graph systems, that contain a set of functions, such as the vertex.compute() of Pregel-like systems, or the gather(), apply(), and scatter() of Powergraph, are too low-level and difficult to program for certain computations. Programming these APIs yields long and complex programs for certain computations (for examples, see the implementations of the minimum spanning tree or the strongly connected components algorithms on GPS codebase). Similar issues with the MapReduce framework have led to widely-used languages such as Pig Latin and Hive, which offer high-level primitives for large-scale data processing. We took a similar approach for graph processing and introduced HelP, a set of high-level primitives that abstract some of the commonly used operations in distributed graph processing. HelP is implemented as a library ontop of the open-source GraphX distributed system. We currently do not have an implementation of the HelP primitives that compile to GPS. Details of the HelP primitives can be found in our paper that appeared at the GRADES 2014 workshop.

In our latest work, we address the problem of debugging programs written for Pregel-like systems, which currently can be a tedious and long manual process. After interviewing Giraph and GPS users, we developed Graft. Graft supports the debugging cycle that users typically go through: (1) Users describe programmatically the set of vertices they are interested in inspecting. During execution, Graft captures the context information of these vertices across supersteps. (2) Using Graft's GUI, users visualize how the values and messages of the captured vertices change from superstep to superstep,narrowing in suspicious vertices and supersteps. (3) Users replay the exact lines of the vertex.compute() function that executed for the suspicious vertices and supersteps, by copying code that Graft generates into their development environments' line-by-line debuggers. Graft also has features to construct end-to-end tests for Giraph programs. Graft is open-source and fully integrated into Apache Giraph's main code base. The details of the system can be found in our technical report, and the code and instructions on how to use Graft can be found on github.

The GPS project is supported by the National Science Foundation (grant IIS-0904497), KAUST, and an Amazon Web Services Research Grant.

GPS Source Code and Documentation

The GPS source code is available as open-source code under the BSD license. The source code can be found here. Here is the GPS documentation site containing information about how to download, set-up and run GPS as well as the GPS extensions to Pregel and the GPS API. Please join the user group stanfordgpsusers @ googlegroups.com for any questions/problems you might have in setting GPS up (the email group is public to everyone, you can join the group from Google Groups website).

Publications

S. Salihoglu and J. Widom. GPS: A Graph Processing System. SSDBM, July 2013

S. Hong, S. Salihoglu, J. Widom and K. Olukotun. Simplifying Scalable Graph Processing with a Domain-Specific Language. CGO, February 2014

S. Salihoglu and J. Widom. Optimizing Graph Algorithms on Pregel-like Systems. To appear at VLDB, September 2014.

S. Salihoglu and J. Widom. HelP: High-level Primitives For Large-Scale Graph Processing. GRADES Workshop on Graph Data management Experiences and Systems, June 2014.

S. Salihoglu, J. Shin, V. Khanna, B. Truong and J. Widom. Graft: A Debugging Tool For Apache Giraph. SIGMOD Demonstration, June 2015.

People

Faculty

Jennifer Widom
Students
- Semih Salihoglu
Users
- Arash Fard
- Amirreza Abdolrashidi
- Sai Wu
- Jawe Wang
- Sungpack Hong
- Jaeho Shin

Last edited by Semih Salihoglu, November 2014