Differences

This shows you the differences between two versions of the page.

--- articles:snailtrail [2018/09/18 12:22] – moritz
+++ articles:snailtrail [2018/09/18 13:20] (current) – moritz
@@ Line 5: / Line 5: @@
 ===== Traditional profiling =====
-Saniltrail provides a different approach to existing system profiling techniques, which are mostly based on aggregate performance counters. Performance counters can be very useful for diagnosing many different problems, but most importantly they do not capture the order and dependencies of tasks that are executed. The lack of detailed information makes it hard to interpret aggregate performance metrics to troubleshoot latency problems, or even worse, can be misleading. A representative example for misleading metrics is from Apache Spark. Spark is a distributed computing framework where a centralized scheduler assigns tasks to workers and then waits for all workers to terminate. Only after all workers have written their tasks' results to persistent storage, the controller will schedule another round of work. This has a clear benefit that fault-tolerance is easy to provide. It also means that the scheduler synchronizes the workers after every computation step with a global barrier.
+Saniltrail provides a different approach to existing system profiling techniques, which are mostly based on aggregate performance counters. Performance counters can be very useful for diagnosing many different problems, but most importantly they do not capture the order and dependencies of tasks that are executed. The lack of detailed information makes it hard to interpret aggregate performance metrics to troubleshoot latency problems, or even worse, can be misleading. A representative example for misleading metrics is from Apache Spark. Spark is a distributed computing framework where a centralized scheduler assigns tasks to workers and then waits for all workers to terminate. Only after all workers have written their tasks' results to persistent storage, the controller will schedule another round of work. This has a clear benefit that fault-tolerance is easy to provide. It also means that the scheduler synchronizes the workers after every computation step with a global barrier. Such an execution trace is drawn below: TODO. Traditional profiling simply adds up the time spent in various components of the system. Most time is spent by the workers performing their tasks and a little bit of time is spent in the scheduler deciding what to compute next. It does not reveal the fact that the workers have to wait for everyone to finish before the scheduler will assign new work. Snailtrail solves this problem by taking dependencies between activities into account when ranking the activities with the critical participation.
+To illustrate the problem, we ran a Yahoo Streaming benchmark on Spark and analyzed the trace both with Snailtrail and traditional profiling. The result from Snailtrail's Critical Participation metric clearly highlights the driver's scheduling activity as a potential latency bottleneck whereas traditional profiling does not. TODO
+TODO: Mention other related work: Pivot tracing, Coz, Stitch, Vscope.
+===== Critical path analysis =====
 Snailtrail was presented at [[https://www.usenix.org/conference/nsdi18/presentation/hoffmann|NSDI'18]].