Skip to content
Snippets Groups Projects
Commit 12f0b56d authored by Alexandru-Mihai GHERGHESCU's avatar Alexandru-Mihai GHERGHESCU
Browse files

Merge branch 'main-patch-3fa8' into 'main'

Introduce the Profiling section

See merge request !2
parents 6eeb3269 209f2e1d
No related branches found
No related tags found
1 merge request!2Introduce the Profiling section
......@@ -4,6 +4,77 @@
## Profiling
### Profiling with PyToch
CPU and GPU profiling under PyTorch is done using [kineto](https://github.com/pytorch/kineto).
For more info about profiling check the [Pytorch profiling docs](https://pytorch.org/tutorials/beginner/profiler.html).
Let us now consider the scenario where we want to profile inference under LLama. This is the call to
`chat_completion` from [dialog.py](https://gitlab.cs.pub.ro/netsys/llama/-/blob/main/dialog.py#L71) from
our Llama repository. We add the profiler call as follows:
```Python
+ with torch.profiler.profile(
+ on_trace_ready=torch.profiler.tensorboard_trace_handler('./path/to/a/log_folder'),
+ activities=[ torch.profiler.ProfilerActivity.CUDA,torch.profiler.ProfilerActivity.CPU],
+ profile_memory=True, with_stack=True) as p:
results, ctx = generator.chat_completion(
[dialog],
max_gen_len=max_gen_len,
temperature=temperature,
top_p=top_p,
)
```
### Visualizing Traces
The recommended approach is to use [Holistic Trace Analysis (HTA)](https://github.com/facebookresearch/HolisticTraceAnalysis) and a jupyter notebook
for the visualization of the traces.
First we load the trace:
```Python
from hta.trace_analysis import TraceAnalysis
analyzer = TraceAnalysis(trace_dir = "path/to/trace/folder")
```
Next we can use the [HTA API](https://hta.readthedocs.io/en/latest/) in a jupyter notebook. For example:
```Python
from hta.trace_analysis import TraceAnalysis
# Load the trace
analyzer = TraceAnalysis(trace_dir = "log/llama_13B")
# get the gpu kernel breakdown
kernel_type_metrics_df, kernel_metrics_df = analyzer.get_gpu_kernel_breakdown(visualize = False, duration_ratio = 0.8, num_kernels = 5, include_memory_kernels = True)
```
```Python
kernel_type_metrics_df
kernel_type sum percentage
0 COMPUTATION 3887727 64.1
1 COMMUNICATION 2178302 35.9
2 MEMORY 2256 0.0
3 COMMUNICATION overlapping MEMORY 0 0.0
4 COMPUTATION overlapping COMMUNICATION 0 0.0
5 COMPUTATION overlapping MEMORY 0 0.0
```
> Most of the functions have the argument `visualize` which can be used to enable plots.
Further, there are several [experimental features](https://hta.readthedocs.io/en/latest/source/features/lightweight_critical_path_analysis.html) which will make the analysis easier:
* CUPTI Counter Analysis: An experimental API to interpret GPU performance
counters. It attributes performance measurements from kernels to PyTorch
operators, and can help with kernel optimization and roofline analysis.
* Lightweight Critical Path Analysis: An experimental API to compute the critical
path in the trace. Critical path can help one understand if an application is CPU
bound, GPU compute bound or communication bound. The path can be visualized on
the original trace as well as manipulated as a directed acyclic graph object.
## Network communication
## Courses / books
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment