Sedge is a software framework that supports large scale graph processing. Essentially, Sedge is inspired by Google’s Pregel. In Sedge, the input graph is usually split into several non-overlapping parts which are distributed to workers and then processed in a distributed manner. In general, the applications running in Sedge consists of several supersteps, in each of which a vertex can receive messages sent in the previous iteration, execute user-defined functions and send messages to other vertices. This vertex centric approach has been shown to be comprehensive and efficient enough to express a broad set of graph algorithms.
However, the Pregel like model is simply blind of intensive inter-machine communication and unbalanced workload. To cope with these challenges, Sedge provides an effecient graph partition management module, which is the key to improve query execution. The module adds important functions to support overlapping partitions, with the goal to process local graph queries faster. Sedge is able to minimize inter-machine communication by dynamically adapting graph partitions to query workload change, as well as data change.
Sedge is designed to eliminate the inter-machine communication as much as possible. As shown in the figure, the offine part first partitions the input graph in a distributed manner and distributes them to multiple workers. It creates multiple partition sets (i.e., complementary partitions) so that each set runs independently. Pregel is a scalable distributed graph processing framework that works in a bulk synchronous mode. Pregel is used as a computing platform that is able to execute local graph queries. There are various kinds of local graph queries including breadth-first search, random walk, and SPARQL queries. Unlike many graph algorithms, a local query usually starts at one vertex and only involves a limited number of vertices (termed active vertice). In each iteration, a Pregel instance only accesses active vertices, thus eliminating many synchronous steps.
The online part collects statistical information from workers and actively generates and removes partitions to accommodate the changing workload. We have a set of online techniques built in Sedge that are very effcient to minimize overhead.
For more details about the architecture and the performance issues, please refer to the documents