In this post, we'll look at Snailtrail, a tool to diagnose latency performance issues for distributed dataflows which has been developed in the Systems Group at ETH Zurich. It allows to answer the question of where are potential latency bottlenecks in a distributed streaming dataflow computation. Snailtrail can be applied to many distributed streaming applications. Only a lightweight stream of trace data is required, we'll go into details about it later. Snailtrail does the hard work of constructing an activity graph for time-based windows and ranking activities according to the critical participation, a novel metric we introduce. In this post, we'll walk through the concepts of activity graphs, time-based windows, and critical participation. Snailtrail currently supports Flink, Spark, Heron, TensorFlow and Timely dataflow.
Saniltrail provides a different approach to existing system profiling techniques, which are mostly based on aggregate performance counters. Performance counters can be very useful for diagnosing many different problems, but most importantly they do not capture the order and dependencies of tasks that are executed. The lack of detailed information makes it hard to interpret aggregate performance metrics to troubleshoot latency problems, or even worse, can be misleading. A representative example for misleading metrics is from Apache Spark. Spark is a distributed computing framework where a centralized scheduler assigns tasks to workers and then waits for all workers to terminate. Only after all workers have written their tasks' results to persistent storage, the controller will schedule another round of work. This has a clear benefit that fault-tolerance is easy to provide. It also means that the scheduler synchronizes the workers after every computation step with a global barrier. Such an execution trace is drawn below: TODO. Traditional profiling simply adds up the time spent in various components of the system. Most time is spent by the workers performing their tasks and a little bit of time is spent in the scheduler deciding what to compute next. It does not reveal the fact that the workers have to wait for everyone to finish before the scheduler will assign new work. Snailtrail solves this problem by taking dependencies between activities into account when ranking the activities with the critical participation.
To illustrate the problem, we ran a Yahoo Streaming benchmark on Spark and analyzed the trace both with Snailtrail and traditional profiling. The result from Snailtrail's Critical Participation metric clearly highlights the driver's scheduling activity as a potential latency bottleneck whereas traditional profiling does not. TODO
TODO: Mention other related work: Pivot tracing, Coz, Stitch, Vscope.
Snailtrail was presented at NSDI'18.