articles:snailtrail
Differences
This shows you the differences between two versions of the page.
Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
articles:snailtrail [2018/09/18 12:22] – moritz | articles:snailtrail [2018/09/18 13:20] (current) – moritz | ||
---|---|---|---|
Line 5: | Line 5: | ||
===== Traditional profiling ===== | ===== Traditional profiling ===== | ||
- | Saniltrail provides a different approach to existing system profiling techniques, which are mostly based on aggregate performance counters. Performance counters can be very useful for diagnosing many different problems, but most importantly they do not capture the order and dependencies of tasks that are executed. The lack of detailed information makes it hard to interpret aggregate performance metrics to troubleshoot latency problems, or even worse, can be misleading. A representative example for misleading metrics is from Apache Spark. Spark is a distributed computing framework where a centralized scheduler assigns tasks to workers and then waits for all workers to terminate. Only after all workers have written their tasks' results to persistent storage, the controller will schedule another round of work. This has a clear benefit that fault-tolerance is easy to provide. It also means that the scheduler synchronizes the workers after every computation step with a global barrier. | + | Saniltrail provides a different approach to existing system profiling techniques, which are mostly based on aggregate performance counters. Performance counters can be very useful for diagnosing many different problems, but most importantly they do not capture the order and dependencies of tasks that are executed. The lack of detailed information makes it hard to interpret aggregate performance metrics to troubleshoot latency problems, or even worse, can be misleading. A representative example for misleading metrics is from Apache Spark. Spark is a distributed computing framework where a centralized scheduler assigns tasks to workers and then waits for all workers to terminate. Only after all workers have written their tasks' results to persistent storage, the controller will schedule another round of work. This has a clear benefit that fault-tolerance is easy to provide. It also means that the scheduler synchronizes the workers after every computation step with a global barrier. |
+ | |||
+ | To illustrate the problem, we ran a Yahoo Streaming benchmark on Spark and analyzed the trace both with Snailtrail and traditional profiling. The result from Snailtrail' | ||
+ | |||
+ | TODO: Mention other related work: Pivot tracing, Coz, Stitch, Vscope. | ||
+ | |||
+ | ===== Critical path analysis ===== | ||
Snailtrail was presented at [[https:// | Snailtrail was presented at [[https:// |
articles/snailtrail.1537273355.txt.gz · Last modified: 2018/09/18 12:22 by moritz