About
I'm a Software Engineer at Microsoft's Azure Kubernetes Service team, specializing in GPU infrastructure that powers large-scale AI and machine learning workloads.
At Microsoft, I lead GPU workloads on AKS and have shipped multiple high-impact features. I built the Fully Managed GPU Experience, enabling one-step GPU node pool creation with all relevant dependencies installed (drivers, device plugin, and GPU metrics) and implemented GPU Health Monitoring to proactively detect hardware failures and surface them at the kubernetes layer. I also developed the Artifact Streaming feature, reducing pod start times by up to 20x through on-demand image loading. My work spans GPU infrastructure, Kubernetes optimization, and building resilient systems for AI training and inference at scale.
I graduated from UC Berkeley with dual degrees in Electrical Engineering and Computer Science and Business Administration, as part of the founding class of the Management, Entrepreneurship and Technology (MET) program. I focused heavily on machine learning coursework, served as President of Robotics @ Berkeley, and was selected as a Kleiner Perkins Fellow and Accel Scholar.
I'm passionate about advancing cloud-native and AI infrastructure. Through speaking at KubeCon and other conferences, I share insights on building resilient, scalable systems for the next generation of AI workloads.
Selected Work
Delivered a fully managed GPU experience for AKS that installs NVIDIA GPU driver, device plugin, and DCGM metrics exporter by default. This feature enables one-step GPU node pool creation, making GPU resources in AKS as simple to use as general-purpose CPU nodes and eliminating complex manual configuration and operational overhead.
Integrated Node Problem Detector (NPD) with AKS to automatically detect and report GPU hardware failures and driver issues. This proactive monitoring system helps customers identify GPU problems early, reducing downtime and improving reliability for AI/ML training workloads.
GPU SKU Onboarding & Quality
H100 & A100 GPU Support
Onboarded new H100 and A100 GPUs with multi-instance GPU (MIG) support and all relevant dependencies to AKS. Implemented automated driver updates to ensure customers always have the latest GPU capabilities. Collaborated closely with Nvidia, AMD, and Canonical to deliver high-quality GPU experiences.
Reduced Kubernetes pod start time by 30% on average and up to 20x for large container images by developing an artifact streaming feature with on-demand image loading. This innovation improves scale-up time and reduces costs for customers running large-scale workloads.
Cost Optimization at Scale
Storage Cost Reduction
Saved $3 million per year in storage costs for AKS through automated replica scale-down of VM images in the CDN. Implemented intelligent caching strategies that maintain performance while significantly reducing infrastructure expenses.
Kubernetes Version Management
Accelerating Release Cycles
Drove Kubernetes versioning on AKS from k8s 1.21 to k8s 1.24. Refactored and automated parts of the version addition process, reducing minor version onboarding time from approximately 6 weeks to 3 weeks, enabling faster delivery of new features to customers.
Conference Talks
Learn how to build an end-to-end AI-PaaS on Kubernetes by combining cloud-native tools, Model Context Protocol (MCP) servers, and intelligent agents. Demonstrates how an agent can resolve simple text commands, call external MCP metadata services, calculate optimal GPU topology, and provision nodes via the Kubernetes AI Toolchain Operator—all without hand-editing manifests.
Transparent, Infra-Level Checkpoint and Restore for Resilient AI/ML Workloads
KubeCon EU, London, 2025
Demonstrates a Kubernetes operator to checkpoint and hot-restart distributed ML workloads using CRIU, CRI-O, and cuda-checkpoint. Covers synchronization mechanisms for JobSets running stateful workloads to be checkpointed during node maintenance scenarios, with discussion of use cases, limitations, and productionization steps.
Explores failure and orchestration challenges in large-scale ML training across thousands of GPUs. Covers the spectrum of GPU issues, how observability tools like NVIDIA DCGM enable proactive problem detection, and principles of fault-tolerant distributed training to mitigate GPU failure impact.
Demonstrates how Kubernetes Operators can automate the installation, configuration, and lifecycle management of AI-ready infrastructure end-to-end—from cluster provisioning and node configuration to deep learning model deployments. Includes a live demo of fine-tuning an LLM workload using GPU Operator and Kubernetes AI Toolchain Operator.
Dives into GPU failure challenges in distributed ML training. Explores the spectrum of GPU issues and why even minor performance drops can cripple large jobs. Shares best practices for efficient identification, remediation, and prevention of GPU failures, with insights from cloud provider and autonomous vehicle company experience.
Examines approaches to reduce cold start times of Kubernetes pods with large container images. Compares on-demand image loading, peer-to-peer systems, pre-warming nodes, and checkpoint/restore techniques. Discusses how optimal approaches vary by workload type and the latency tradeoffs throughout the pod lifecycle.
Podcast Appearances
Guest with Brendan Burns, co-founder of Kubernetes and Corporate Vice President at Microsoft. 9000 streams. Host: Gerhard Lazu.