Everything else is noise.
GPU → CUDA • cuBLAS • cuDNN • TensorRT • Triton • NCCL
ML → PyTorch • JAX • TensorFlow
Infra → Kubernetes • OpenShift • Docker • AWS
Distributed → DDP • Model Parallelism • Run:AI • Slurm
Code → Python • C++ • Bash
- LLM inference latency, throughput & cost
- Multi-GPU training efficiency
- Kernel performance & memory behavior
- GPU utilization at scale
- Cloud-native AI deployments
- TensorRT-LLM pipelines
- Triton multi-model serving
- Advanced CUDA optimization
- Multi-node distributed workloads
- Mapping workloads → NVIDIA hardware
LinkedIn → linkedin.com/in/atharva21
Email → [email protected]
Speed. Parallelism. Precision.
Accelerating neural force — one GPU cycle at a time.


