I'm an AI Infrastructure Engineer specializing in building and scaling the backend systems that power large-scale machine learning. I bridge the gap between AI development and production by orchestrating distributed GPU clusters, tuning low-latency inference engines, and hardening core data infrastructure.
What I do:
- Architecting high-performance GPU and AI clusters
- Optimizing serving engines for low-latency production inference
- Scaling distributed vector databases and data pipelines
- Automating reliable MLOps and infrastructure engineering
Services
GPU Cluster Orchestration
I provision and optimize multi-node GPU clusters using Kubernetes, managing resource allocation, VRAM utilization, and multi-tenant isolation.
High-Performance Inference Serving
I set up highly optimized serving engines like vLLM, TensorRT-LLM, and Triton to minimize TTFT and maximize token throughput for production workloads.
Vector DB & RAG Storage Systems
I deploy and scale distributed vector databases like Pinecone, Milvus, or Qdrant, optimizing indexing strategies and retrieval pipeline speeds.
AI Infrastructure Pipelines (MLOps)
I build robust CI/CD and data engineering pipelines, automating weight distribution, checkpointing, and dynamic cluster autoscaling.
Distributed Training Infra
I architect infrastructure setups for model fine-tuning and training, configuring data-parallel and model-parallel setups with Ray and DeepSpeed.
Compute & Cost Monitoring
I implement full-stack observability frameworks to track GPU metrics, prompt cache hit-rates, latency, and cloud compute expenditures.