diff --git a/doc/systems.md b/doc/systems.md
index d57aa914c86553000b7d846f017e5b11079de95a..bc5be38e209ae4fc76b4a4222e2c69f6e6600be9 100644
--- a/doc/systems.md
+++ b/doc/systems.md
@@ -4,6 +4,77 @@
 
 ## Profiling
 
+### Profiling with PyToch
+
+CPU and GPU profiling under PyTorch is done using [kineto](https://github.com/pytorch/kineto).
+For more info about profiling check the [Pytorch profiling docs](https://pytorch.org/tutorials/beginner/profiler.html). 
+
+Let us now consider the scenario where we want to profile inference under LLama. This is the call to
+`chat_completion` from [dialog.py](https://gitlab.cs.pub.ro/netsys/llama/-/blob/main/dialog.py#L71) from
+our Llama repository. We add the profiler call as follows:
+
+```Python
++            with torch.profiler.profile(
++                on_trace_ready=torch.profiler.tensorboard_trace_handler('./path/to/a/log_folder'),
++                activities=[ torch.profiler.ProfilerActivity.CUDA,torch.profiler.ProfilerActivity.CPU],
++                profile_memory=True, with_stack=True) as p:
+                    results, ctx = generator.chat_completion(
+                        [dialog],
+                        max_gen_len=max_gen_len,
+                        temperature=temperature,
+                        top_p=top_p,
+                    )
+```
+
+### Visualizing Traces
+
+The recommended approach is to use [Holistic Trace Analysis (HTA)](https://github.com/facebookresearch/HolisticTraceAnalysis) and a jupyter notebook
+for the visualization of the traces.
+
+First we load the trace:
+```Python
+from hta.trace_analysis import TraceAnalysis
+analyzer = TraceAnalysis(trace_dir = "path/to/trace/folder")
+```
+
+Next we can use the [HTA API](https://hta.readthedocs.io/en/latest/) in a jupyter notebook. For example:
+
+```Python
+from hta.trace_analysis import TraceAnalysis
+
+# Load the trace
+analyzer = TraceAnalysis(trace_dir = "log/llama_13B")
+
+# get the gpu kernel breakdown
+kernel_type_metrics_df, kernel_metrics_df = analyzer.get_gpu_kernel_breakdown(visualize = False, duration_ratio = 0.8, num_kernels = 5, include_memory_kernels = True)
+```
+
+```Python
+kernel_type_metrics_df
+
+	kernel_type 	sum 	percentage
+0 	COMPUTATION 	3887727 	64.1
+1 	COMMUNICATION 	2178302 	35.9
+2 	MEMORY 	2256 	0.0
+3 	COMMUNICATION overlapping MEMORY 	0 	0.0
+4 	COMPUTATION overlapping COMMUNICATION 	0 	0.0
+5 	COMPUTATION overlapping MEMORY 	0 	0.0
+
+```
+
+> Most of the functions have the argument `visualize` which can be used to enable plots.
+
+Further, there are several [experimental features](https://hta.readthedocs.io/en/latest/source/features/lightweight_critical_path_analysis.html) which will make the analysis easier:
+
+* CUPTI Counter Analysis: An experimental API to interpret GPU performance
+counters. It attributes performance measurements from kernels to PyTorch
+operators, and can help with kernel optimization and roofline analysis.
+
+* Lightweight Critical Path Analysis: An experimental API to compute the critical
+path in the trace. Critical path can help one understand if an application is CPU
+bound, GPU compute bound or communication bound. The path can be visualized on
+the original trace as well as manipulated as a directed acyclic graph object.
+
 ## Network communication
 
 ## Courses / books