diff --git a/doc/systems.md b/doc/systems.md index d57aa914c86553000b7d846f017e5b11079de95a..bc5be38e209ae4fc76b4a4222e2c69f6e6600be9 100644 --- a/doc/systems.md +++ b/doc/systems.md @@ -4,6 +4,77 @@ ## Profiling +### Profiling with PyToch + +CPU and GPU profiling under PyTorch is done using [kineto](https://github.com/pytorch/kineto). +For more info about profiling check the [Pytorch profiling docs](https://pytorch.org/tutorials/beginner/profiler.html). + +Let us now consider the scenario where we want to profile inference under LLama. This is the call to +`chat_completion` from [dialog.py](https://gitlab.cs.pub.ro/netsys/llama/-/blob/main/dialog.py#L71) from +our Llama repository. We add the profiler call as follows: + +```Python ++ with torch.profiler.profile( ++ on_trace_ready=torch.profiler.tensorboard_trace_handler('./path/to/a/log_folder'), ++ activities=[ torch.profiler.ProfilerActivity.CUDA,torch.profiler.ProfilerActivity.CPU], ++ profile_memory=True, with_stack=True) as p: + results, ctx = generator.chat_completion( + [dialog], + max_gen_len=max_gen_len, + temperature=temperature, + top_p=top_p, + ) +``` + +### Visualizing Traces + +The recommended approach is to use [Holistic Trace Analysis (HTA)](https://github.com/facebookresearch/HolisticTraceAnalysis) and a jupyter notebook +for the visualization of the traces. + +First we load the trace: +```Python +from hta.trace_analysis import TraceAnalysis +analyzer = TraceAnalysis(trace_dir = "path/to/trace/folder") +``` + +Next we can use the [HTA API](https://hta.readthedocs.io/en/latest/) in a jupyter notebook. For example: + +```Python +from hta.trace_analysis import TraceAnalysis + +# Load the trace +analyzer = TraceAnalysis(trace_dir = "log/llama_13B") + +# get the gpu kernel breakdown +kernel_type_metrics_df, kernel_metrics_df = analyzer.get_gpu_kernel_breakdown(visualize = False, duration_ratio = 0.8, num_kernels = 5, include_memory_kernels = True) +``` + +```Python +kernel_type_metrics_df + + kernel_type sum percentage +0 COMPUTATION 3887727 64.1 +1 COMMUNICATION 2178302 35.9 +2 MEMORY 2256 0.0 +3 COMMUNICATION overlapping MEMORY 0 0.0 +4 COMPUTATION overlapping COMMUNICATION 0 0.0 +5 COMPUTATION overlapping MEMORY 0 0.0 + +``` + +> Most of the functions have the argument `visualize` which can be used to enable plots. + +Further, there are several [experimental features](https://hta.readthedocs.io/en/latest/source/features/lightweight_critical_path_analysis.html) which will make the analysis easier: + +* CUPTI Counter Analysis: An experimental API to interpret GPU performance +counters. It attributes performance measurements from kernels to PyTorch +operators, and can help with kernel optimization and roofline analysis. + +* Lightweight Critical Path Analysis: An experimental API to compute the critical +path in the trace. Critical path can help one understand if an application is CPU +bound, GPU compute bound or communication bound. The path can be visualized on +the original trace as well as manipulated as a directed acyclic graph object. + ## Network communication ## Courses / books