This is an old revision of the document!
Snailtrail: Generalizing critical paths for online analysis of distributed dataflows
In this post, we'll look at Snailtrail, a tool to diagnose latency performance issues for distributed dataflows which has been developed in the Systems Group at ETH Zurich. It allows to answer the question of where are potential latency bottlenecks in a distributed streaming dataflow computation. Snailtrail can be applied to many distributed streaming applications. Only a lightweight stream of trace data is required, we'll go into details about it later. Snailtrail does the hard work of constructing an activity graph for time-based windows and ranking activities according to the critical participation, a novel metric we introduce. In this post, we'll walk through the concepts of activity graphs, time-based windows, and critical participation. Snailtrail currently supports Flink, Spark, Heron, TensorFlow and Timely dataflow.
Snailtrail was presented at NSDI'18.