Ganeshkumar Ashokavardhanan

Software Engineer at Microsoft

Building GPU infrastructure for Azure Kubernetes Service. Passionate about cloud-native technologies, Kubernetes, and machine learning.

GitHub LinkedIn

About

I'm a Software Engineer at Microsoft's Azure Kubernetes Service team, specializing in GPU infrastructure that powers large-scale AI and machine learning workloads.

At Microsoft, I lead GPU workloads on AKS and have shipped multiple high-impact features. I built the Fully Managed GPU Experience, enabling one-step GPU node pool creation with all relevant dependencies installed (drivers, device plugin, and GPU metrics) and implemented GPU Health Monitoring to proactively detect hardware failures and surface them at the kubernetes layer. I also developed the Artifact Streaming feature, reducing pod start times by up to 20x through on-demand image loading. My work spans GPU infrastructure, Kubernetes optimization, and building resilient systems for AI training and inference at scale.

I graduated from UC Berkeley with dual degrees in Electrical Engineering and Computer Science and Business Administration, as part of the founding class of the Management, Entrepreneurship and Technology (MET) program. I focused heavily on machine learning coursework, served as President of Robotics @ Berkeley, and was selected as a Kleiner Perkins Fellow and Accel Scholar.

I'm passionate about advancing cloud-native and AI infrastructure. Through speaking at KubeCon and other conferences, I share insights on building resilient, scalable systems for the next generation of AI workloads.

Selected Work

Fully Managed GPU Experience

One-Step GPU Node Pool Creation

Delivered a fully managed GPU experience for AKS that installs NVIDIA GPU driver, device plugin, and DCGM metrics exporter by default. This feature enables one-step GPU node pool creation, making GPU resources in AKS as simple to use as general-purpose CPU nodes and eliminating complex manual configuration and operational overhead.

Simplified GPU deployment for customers

GPU Health Monitoring

Automated GPU Failure Detection

Integrated Node Problem Detector (NPD) with AKS to automatically detect and report GPU hardware failures and driver issues. This proactive monitoring system helps customers identify GPU problems early, reducing downtime and improving reliability for AI/ML training workloads.

Proactive GPU failure detection and monitoring

GPU SKU Onboarding & Quality

H100 & A100 GPU Support

Onboarded new H100 and A100 GPUs with multi-instance GPU (MIG) support and all relevant dependencies to AKS. Implemented automated driver updates to ensure customers always have the latest GPU capabilities. Collaborated closely with Nvidia, AMD, and Canonical to deliver high-quality GPU experiences.

Latest GPU hardware with automated driver management

Artifact Streaming

Accelerating Pod Start Times

Reduced Kubernetes pod start time by 30% on average and up to 20x for large container images by developing an artifact streaming feature with on-demand image loading. This innovation improves scale-up time and reduces costs for customers running large-scale workloads.

30% faster pod starts, 20x for large images

Cost Optimization at Scale

Storage Cost Reduction

Saved $3 million per year in storage costs for AKS through automated replica scale-down of VM images in the CDN. Implemented intelligent caching strategies that maintain performance while significantly reducing infrastructure expenses.

$3M annual savings

Kubernetes Version Management

Accelerating Release Cycles

Drove Kubernetes versioning on AKS from k8s 1.21 to k8s 1.24. Refactored and automated parts of the version addition process, reducing minor version onboarding time from approximately 6 weeks to 3 weeks, enabling faster delivery of new features to customers.

50% faster version onboarding (6 weeks → 3 weeks)

Conference Talks

Agent-Driven MCP for AI Workloads on Kubernetes

KubeCon NA, Atlanta, 2025

Learn how to build an end-to-end AI-PaaS on Kubernetes by combining cloud-native tools, Model Context Protocol (MCP) servers, and intelligent agents. Demonstrates how an agent can resolve simple text commands, call external MCP metadata services, calculate optimal GPU topology, and provision nodes via the Kubernetes AI Toolchain Operator—all without hand-editing manifests.

Transparent, Infra-Level Checkpoint and Restore for Resilient AI/ML Workloads

KubeCon EU, London, 2025

Demonstrates a Kubernetes operator to checkpoint and hot-restart distributed ML workloads using CRIU, CRI-O, and cuda-checkpoint. Covers synchronization mechanisms for JobSets running stateful workloads to be checkpointed during node maintenance scenarios, with discussion of use cases, limitations, and productionization steps.

Building Resilience for Large-Scale AI Training

KubeCon NA, Salt Lake City, 2024

Explores failure and orchestration challenges in large-scale ML training across thousands of GPUs. Covers the spectrum of GPU issues, how observability tools like NVIDIA DCGM enable proactive problem detection, and principles of fault-tolerant distributed training to mitigate GPU failure impact.

Simplify AI Infrastructure with Kubernetes Operators

AI-Dev, Hong Kong, 2024

Demonstrates how Kubernetes Operators can automate the installation, configuration, and lifecycle management of AI-ready infrastructure end-to-end—from cluster provisioning and node configuration to deep learning model deployments. Includes a live demo of fine-tuning an LLM workload using GPU Operator and Kubernetes AI Toolchain Operator.

Detecting and Overcoming GPU Failures During ML Training

AI-Dev, Hong Kong, 2024

Dives into GPU failure challenges in distributed ML training. Explores the spectrum of GPU issues and why even minor performance drops can cripple large jobs. Shares best practices for efficient identification, remediation, and prevention of GPU failures, with insights from cloud provider and autonomous vehicle company experience.

Scaling Up Without Slowing Down: Accelerating Pod Start Time

KubeCon EU, Paris, 2024

Examines approaches to reduce cold start times of Kubernetes pods with large container images. Compares on-demand image loading, peer-to-peer systems, pre-warming nodes, and checkpoint/restore techniques. Discusses how optimal approaches vary by workload type and the latency tradeoffs throughout the pod lifecycle.

Podcast Appearances

Behind the Scenes at Microsoft Azure

Ship It! Podcast

Guest with Brendan Burns, co-founder of Kubernetes and Corporate Vice President at Microsoft. 9000 streams. Host: Gerhard Lazu.

Community, Opensource and Kubernetes

Kubernetes Bytes

Guest with Brendan Burns, co-founder of Kubernetes and Corporate Vice President at Microsoft. Hosts: Ryan Wallmer & Bhavin Shah.