Service AI Engineering

Production AI agents that ship, not slide demos.

We build Claude-based agents with real tools, real observability, and real evals. Your team owns the codebase. Your security team can audit it. Your CFO can see the cost dashboard.

Projects are scope-dependent. Free discovery call.
agent.example.com
agent.ts ts
    
      
          
          // agent.ts
        
          
          import { ClaudeAgent } from '@anthropic-ai/claude-agent-sdk';
        
          
          import { tools } from './tools';
        
          
          import { skills } from './skills';
        
          
           
        
          
          export const supportAgent = new ClaudeAgent({
        
          
            model: 'claude-sonnet-4-5',
        
          
            systemPrompt: await loadFile('./prompts/support.md'),
        
          
            tools,
        
          
            skills,
        
          
            hooks: {
        
          
              onStop: logUsage,
        
          
              onError: alertOpsChannel,
        
          
            },
        
          
          });
        
    
  

Why this matters

Most AI agents do not survive contact with real users.

The demo runs on a curated input. Production runs on whatever your customers type at 2am. Tool calls fail silently, prompts drift, costs spiral, and the only person who understands the prompt has left the company. We build agents that survive that reality, with the observability, evals, and runbooks to prove it.

What we build

An agent your team can actually own.

No black-box prompt files. No undocumented tool wiring. Every agent we ship comes with the eval suite, the runbook, and the cost dashboard your team needs to operate it after we leave.

01

Tool design before model selection

Most AI agent failures are tool failures. We design the tool surface first, write JSON schemas the model can actually call reliably, and only then pick the model that fits.

Tool call success rate above 95 percent on day one.

02

Skills, not megaprompts

Domain knowledge lives in versioned, testable skills the agent loads on demand. Your prompt stays under 200 lines. Updating a workflow does not require redeploying the agent.

Iteration cycle drops from days to minutes.

03

Observability from commit one

Every tool call, every token, every retry traced to OpenTelemetry. Cost and latency dashboards live before the agent talks to its first user. Debug is grep, not vibes.

Mean time to debug a failed run under 10 minutes.

04

Guardrails the legal team accepts

Input validation, output filtering, prompt injection defenses, scope limits enforced in code. Your security team reviews the agent the same way they review any service.

Passes SOC 2 and enterprise procurement review.

05

Prompt caching wired by default

System prompt and skills cached. Per-conversation cache hit rate above 90 percent. Token bills fall by 60 to 80 percent versus naive implementations.

Production cost per session typically under 5 cents.

06

Handoff documented for your team

Every agent ships with a runbook, a prompt change checklist, an evals suite, and onboarding docs. Your team owns it after week 12, not just operates it.

No vendor lock-in, no consultant dependency.

<5¢

typical production cost per agent session across our shipped agents

Measured with Anthropic prompt caching and skill loading. Public methodology on request.

The eval layer

Tests that catch prompt regressions before users do.

Every agent ships with a Vitest eval suite mapped to real user cases. CI runs it on every prompt change. Cache utilization is asserted, not assumed. Regressions block merge.

evals/support-agent.eval.ts ts
    
      
          
          // evals/support-agent.eval.ts
        
          
          import { describe, it, expect } from 'vitest';
        
          
          import { supportAgent } from '../agent';
        
          
          import cases from './fixtures/support-cases.json';
        
          
           
        
          
          describe('support agent', () => {
        
          
            for (const c of cases) {
        
          
              it(c.name, async () => {
        
          
                const result = await supportAgent.run(c.input);
        
          
                expect(result.tool_calls).toContainEqual(
        
          
                  expect.objectContaining({ name: c.expected_tool })
        
          
                );
        
          
                expect(result.message).toMatch(c.expected_pattern);
        
          
                expect(result.usage.cache_read_input_tokens).toBeGreaterThan(0);
        
          
              });
        
          
            }
        
          
          });
        
    
  

Process

How an agent project runs.

01

Discovery

Two weeks. We map the workflow, identify the tools the agent needs, define the success metric, and lock the eval set. You see a paper prototype before any code.

Fixed scope, fixed price, no surprises.

02

Build

Four to six weeks. Tools first, then prompt, then skills, then guardrails. Staging deploy by week three. Eval suite runs on every commit from day one.

You can talk to the agent in week three.

03

Launch + monitor

Two weeks. Canary rollout, observability dashboards live, on-call coverage during the first 30 days. Handoff docs and team training before we step back.

Your team owns the agent at week 12.

Common questions

Frequently asked

  1. How long does an agent project take?

    Eight to twelve weeks for a production agent with a real tool surface. Discovery and tool design takes two weeks, the agent itself ships in four to six, evals and hardening run in parallel, and a soft-launch monitoring window closes it. Faster if you already have the tool APIs.

  2. Why not just use ChatGPT or a no-code agent platform?

    No-code platforms work for demos. They fall over at production traffic, custom tools, observability requirements, and enterprise security review. We build agents your security team can audit and your engineering team can own.

  3. Which models do you use?

    We default to Claude Sonnet 4.5 with the Claude Agent SDK because the tool calling and skill system fit production workloads. We also ship on OpenAI and open-weight models when latency, cost, or data residency requires it.

  4. How do you handle prompt injection and abuse?

    Input validation against a schema, allowlist for tool inputs, output filtering for sensitive data, scope limits on what the agent can read or write, and a rate-limited, logged audit trail. We include a red-team pass before launch.

  5. Will the agent be expensive to run?

    Not if it is built right. Prompt caching, skill loading, and tool design typically keep production cost per session under 5 cents. We give you a cost dashboard and an alert before launch so surprises are impossible.

  6. What does it cost?

    AI agent projects are scope-dependent for a single-purpose agent with three to five tools. Multi-agent systems with custom evals and observability are scoped after discovery. Discovery call is free and we send a fixed-price quote within 48 hours.

Ready to ship a real agent?

Tell us what you want to build.

Discovery call is free. Fixed-price quote within 48 hours. NDA on request.