Architect, AI Cloud Platform

Oxmiq Labs - Campbell, CA

Apply Now

Job Description

About OXMIQOXMIQ designs GPU and AI silicon for large-scale model inference and training and is developing an infrastructure and AI service orchestration platform that runs on heterogeneous accelerator hardware.The RoleThe Architect, AI Cloud Platform, owns the inference-serving architecture of OXMIQ's infrastructure and AI service orchestration platform "” the layer through which customer workloads are served from accelerator fleets at scale. The role is responsible for the end-to-end serving path: how a model is loaded, scheduled, batched, cached, dispatched, and routed across heterogeneous hardware to deliver competitive latency, throughput, and token-per-dollar economics.The Architect must also have a working understanding of the broader platform layers on which inference serving depends "” Kubernetes-based orchestration, multi-tenant isolation, observability, billing, and DC-scale provisioning "” and will collaborate with the engineering teams that own those layers to deliver a performant, integrated solution. Background and hands-on experience with these layers is expected; ownership of their delivery is not.The role is hands-on. The Architect produces design documents, prototypes critical components, leads technical reviews, and works directly with engineering leads across each layer of the stack. The Architect also serves as a technical point of contact in selected customer and partner engagements.Key ResponsibilitiesOwn the inference-serving architecture end to end, including model loading, continuous batching, KV-cache management, prefix caching, request routing, and SLA-aware scheduling across heterogeneous accelerators.Lead the design of disaggregated prefill/decode deployments, including KV-cache transfer (e.g., NIXL over RDMA / InfiniBand / RoCE), KV-cache-aware request routing, and the orchestration patterns required to operate them at scale.Define the integration model between OXMIQ's Capsule runtime and the open-source inference-serving stack (vLLM, SGLang, TensorRT-LLM, llm-d, NVIDIA Dynamo, Triton Inference Server) so that serving workloads dispatch across heterogeneous silicon as a first-class capability.Partner with the orchestration team on the design of Kubernetes-based scheduling for accelerator fleets, including multi-tenant isolation, GPU and accelerator scheduling, and capacity management, ensuring it meets the needs of the inference-serving layer.Partner with the data-center infrastructure team on DC-scale provisioning, OS imaging, firmware, and burn-in validation flows for AI pods running on OXMIQ and third-party hardware, ensuring inference SLAs are achievable on the resulting fleet.Conduct architecture and code reviews and provide technical guidance to engineering leads across inference, orchestration, runtime, security, monitoring, and platform UI.Produce design documents, prototypes, and reference implementations for new platform components.Serve as the technical representative of the platform architecture in selected customer and partner engagements.Required Qualifications10+ years of platform, infrastructure, or cloud software engineering experience, with at least several years at a Principal Engineer or Architect level owning a multi-component platform.Deep, hands-on experience with modern inference-serving systems "” vLLM, SGLang, TensorRT-LLM, Triton Inference Server, or comparable "” at the level of operating them in production, modifying them, and understanding their internals.Working knowledge of LLM serving optimizations: continuous batching, PagedAttention, prefix caching, KV-cache management, speculative decoding, and quantization (FP8, INT8, or comparable) for inference.Hands-on experience with disaggregated prefill/decode architectures, including KV-cache transfer mechanisms (NIXL, RDMA over InfiniBand or RoCE), KV-cache-aware request routing, and the operational considerations of running disaggregated serving at scale.Deep experience with Kubernetes at production scale: operators, CRDs, scheduling, multi-tenancy, GPU and accelerator scheduling, and the operational realities of running it. Familiarity with Kubernetes-native inference frameworks (llm-d, NVIDIA Dynamo, or comparable) is expected.Familiarity with the rest of the AI cloud platform surface "” monitoring, billing, multi-tenant security, the user-facing console "” and the integration points between them.Working knowledge of the rest of the data-center stack: virtualization, storage, networking, OS provisioning, firmware, and DC-scale lifecycle.Experience integrating ML frameworks (PyTorch, JAX) and inference runtimes into a managed platform.Track record of leading distributed engineering teams technically through design documents, architecture reviews, and build-vs-buy decisions.Strong written and verbal communication.Working exposure to Claude Code or equivalent AI-assisted development workflows, with hands-on use in day-to-day engineering practice.Preferred QualificationsOpen-source contributions to vLLM, SGLang, TensorRT-LLM, llm-d, NVIDIA Dynamo, Triton Inference Server, or comparable inference-serving projects.Experience operating an LLM-serving platform at customer-facing scale at a NeoCloud, hyperscaler, or model provider.Familiarity with advanced KV-cache architectures: DeepSeek-style multi-head latent attention (MLA), RadixAttention (SGLang), and other cross-request KV-reuse mechanisms.Hands-on experience with high-throughput serving on heterogeneous accelerator hardware (NVIDIA, AMD, Intel, or custom AI silicon).Experience with multi-tenant isolation, secure enclaves, and confidential compute on accelerators.Familiarity with sovereign-cloud or regulated-deployment requirements and applicable compliance frameworks.Background in observability, latency tracing (TTFT, TPOT), and billing infrastructure for inference workloads.EducationBS/MS/PhD in Computer Science, Computer Engineering, Electrical Engineering, or a related field "” or equivalent practical experience.Working EnvironmentThe Architect reports into the AI Infrastructure / Orchestration organization and works directly with the OXMIQ silicon, runtime, and compiler teams, as well as with engineering leads across each layer of the platform. AI-assisted development tools (Claude Code or equivalent) are a standard part of engineering practice at OXMIQ and are expected to be used in daily work.Compensation & BenefitsOXMIQ offers a competitive compensation package, including base salary, equity participation, comprehensive medical, dental, and vision coverage, and the opportunity to contribute to foundational silicon and software technology.OXMIQ is an equal opportunity employer. We evaluate qualified applicants without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, disability, veteran status, or any other legally protected characteristic.

Created: 2026-05-09

➤

Login

Create Account

Architect, AI Cloud Platform