Machine Learning / AI Operations Engineer
Arcade
Location
HQ - San Francisco (SOMA) - In Person
Employment Type
Full time
Location Type
On-site
Department
Engineering
Everyone's talking about AI. But here's the truth: ChatGPT can't send your emails. It can't book your flights. It can't even order you lunch.
Why? Because AI is trapped in a chat box. It can't take real actions in the real world.
We are changing that forever. We're not just building another AI company - we're creating the infrastructure that will power every AI application you'll use in the future.
The Revolution Needs You
Every AI app needs agentic "tools" - special functions that let AI models take real actions. Without tools, AI can only chat. With tools, AI can actually do things. We're building the definitive tools catalog and tool-calling platform that will unlock AI's true potential. Think Zapier for AI Actions. Think Auth0 for AI. Think really big.
Why This Is The Opportunity of a Lifetime
Founder-Market Fit : Our CEO previously founded Stormpath (acquired by Okta), where he created the first Authentication API for developers. He's done this before - and this time the market is 10x bigger. Our CTO led the vector database team at Redis, shipped 100+ LLM applications, and is a contributor to LangChain and LlamaIndex. He knows this space better than anyone.
Dream Team: We've assembled authentication, integrations, distributed systems, and AI experts from Okta, Redis, Microsoft, Splunk, Ngrok, Google, Airbyte, Disney, Snowflake, and HPE who've built and founded multiple successful developer platforms.
Perfect Timing: We're at the inflection point of AI adoption. The biggest problem isn't better models - it's connecting AI to real-world actions. That's us.
Massive Market: We're building critical infrastructure for the biggest technological shift of our generation. Every AI app will need what we're building.
Backed By The Best: Our investors have backed Databricks, Clickhouse, MongoDB, Perplexity, Cohere, ScaleAI, Confluent, Elastic, and Firebase. They see what we see - this is going to be huge.
The Challenge
As our first Machine Learning/AI Operations Engineer, you will be responsible for building and maintaining the models and infrastructure that power Arcade's agentic features. We are building state-of-the art features to make everyone's agents more powerful, including tool selection, memory & context management, and related products. For certain tasks, building custom models is the way to deliver the performance and accuracy that customers are asking for, but hasn't been possible until now. We need your help ensuring that these features work reliably and quickly within our cloud and on-premise for our enterprise customers.
What You'll Do
Build: Create bleeding-edge models fine-tuned for Arcade's agentic products.
Deploy: Test and deploy our models and related application software, both on-prem and in our cloud.
Monitor: Prevent model drift. Make the models and APIs better. Collect the data you need to do it.
Build our stack. Use your experience to chose the right tools for the job, balancing speed, maintainability and cost.
Shape the roadmap for the team
Build leverage (via AI) - projects that take a week today should take a day next time.
Share your work with our customers and community, building our (and your) brand.
Required Skills
An insatiable desire to ship.
Strong understanding of the state of the art in machine learning, especially LLMs and tool-calling (e.g. MCP).
Comfortable with tuning libraries (HuggingFace Trainer, DeepSpeed, FSDP, QLoRA, etc)
Familiarity with model lifecycle management tools (MLflow, Weights & Biases, DVC, etc)
Experience with model optimization, quantization and deployment formats (ONNX, OpenVino, TensortRT, etc)
Experience with modern monitoring tools (Prometheus, Grafana, Datadog, ELK, Arize AI, etc.)
Production experience with at least one major agent framework (Langchain, LlamaIndex, OpenAI Agents SDK, Mastra, etc)
-
5+ years of software engineering experience comprising of:
3+ years experience working on a production level ML training or inference system
2+ years of ML Expertise with frameworks for ML (TensorFlow, PyTorch, Scikit-learn, etc)
2+ years of backend development experience with either Python or Go
2+ years of experience with infrastructure and deployment (AWS/GCP, Terraform, Helm, etc)
Strong experience building and maintaining production APIs.
Track record of writing clean, well-documented, well-tested code.
User-centered approach to designing developer-centric products and tools.
Bonus Points
Experience building developer tools or SDKs
Open-source contributions
Experience with high-scale distributed systems
You’ve been a startup founder or early-state startup employee before and love it.
Join The Movement
We're not just building a product - we're leading a movement to transform AI from just chatbots to agents that can take actions against real systems. This is your chance to be at the forefront of that revolution.
If you want to look back in 5 years and say, "I helped build that", then we want to talk to you.
Ready to make AI actually useful? Apply Now