Merge branch 'main-patch-3fa8' into 'main'

Introduce the Profiling section See merge request !2

Merge branch 'main-patch-3fa8' into 'main'
Introduce the Profiling section See merge request !2
12f0b56d · Alexandru-Mihai GHERGHESCU · 6eeb3269 · 209f2e1d · 12f0b56d
Commit 12f0b56d authored 1 year ago by Alexandru-Mihai GHERGHESCU
--- a/doc/systems.md
+++ b/doc/systems.md
@@ -4,6 +4,77 @@

 ## Profiling

+### Profiling with PyToch
+
+CPU and GPU profiling under PyTorch is done using [kineto](https://github.com/pytorch/kineto).
+For more info about profiling check the [Pytorch profiling docs](https://pytorch.org/tutorials/beginner/profiler.html). 
+
+Let us now consider the scenario where we want to profile inference under LLama. This is the call to
+`chat_completion` from [dialog.py](https://gitlab.cs.pub.ro/netsys/llama/-/blob/main/dialog.py#L71) from
+our Llama repository. We add the profiler call as follows:
+
+```Python
+            with torch.profiler.profile(
+                on_trace_ready=torch.profiler.tensorboard_trace_handler('./path/to/a/log_folder'),
+                activities=[ torch.profiler.ProfilerActivity.CUDA,torch.profiler.ProfilerActivity.CPU],
+                profile_memory=True, with_stack=True) as p:
+                    results, ctx = generator.chat_completion(
+                        [dialog],
+                        max_gen_len=max_gen_len,
+                        temperature=temperature,
+                        top_p=top_p,
+                    )
+```
+
+### Visualizing Traces
+
+The recommended approach is to use [Holistic Trace Analysis (HTA)](https://github.com/facebookresearch/HolisticTraceAnalysis) and a jupyter notebook
+for the visualization of the traces.
+
+First we load the trace:
+```Python
+from hta.trace_analysis import TraceAnalysis
+analyzer = TraceAnalysis(trace_dir = "path/to/trace/folder")
+```
+
+Next we can use the [HTA API](https://hta.readthedocs.io/en/latest/) in a jupyter notebook. For example:
+
+```Python
+from hta.trace_analysis import TraceAnalysis
+
+# Load the trace
+analyzer = TraceAnalysis(trace_dir = "log/llama_13B")
+
+# get the gpu kernel breakdown
+kernel_type_metrics_df, kernel_metrics_df = analyzer.get_gpu_kernel_breakdown(visualize = False, duration_ratio = 0.8, num_kernels = 5, include_memory_kernels = True)
+```
+
+```Python
+kernel_type_metrics_df
+
+	kernel_type 	sum 	percentage
+0 	COMPUTATION 	3887727 	64.1
+1 	COMMUNICATION 	2178302 	35.9
+2 	MEMORY 	2256 	0.0
+3 	COMMUNICATION overlapping MEMORY 	0 	0.0
+4 	COMPUTATION overlapping COMMUNICATION 	0 	0.0
+5 	COMPUTATION overlapping MEMORY 	0 	0.0
+
+```
+
+> Most of the functions have the argument `visualize` which can be used to enable plots.
+
+Further, there are several [experimental features](https://hta.readthedocs.io/en/latest/source/features/lightweight_critical_path_analysis.html) which will make the analysis easier:
+
+* CUPTI Counter Analysis: An experimental API to interpret GPU performance
+counters. It attributes performance measurements from kernels to PyTorch
+operators, and can help with kernel optimization and roofline analysis.
+
+* Lightweight Critical Path Analysis: An experimental API to compute the critical
+path in the trace. Critical path can help one understand if an application is CPU
+bound, GPU compute bound or communication bound. The path can be visualized on
+the original trace as well as manipulated as a directed acyclic graph object.
+
 ## Network communication

 ## Courses / books