Akforges
← All posts
AIProductionLLMEngineering

The 12-Item Checklist Before Shipping Your AI Prototype to Production

Most AI prototypes fail in production for the same reasons. Here's the checklist we run on every client LLM system before it touches real users.

April 15, 20268 min readAkforges Studio

Most AI prototypes fail in production for the same reasons: no evals, no guardrails, no cost controls, and a model call that assumes perfect JSON output. Here's the checklist we use before every client system goes live.


Why prototypes fail in production

A Cursor-generated app or LangChain demo works perfectly in your local environment because you're controlling the inputs. In production, users will send you edge cases, adversarial prompts, empty strings, 40-language queries, and inputs that are three times longer than your context window.

The model will eventually hallucinate. The provider will update a model without notice. Your costs will spike 10x on a Tuesday because someone ran a batch job. None of these are edge cases — they're certainties.

The checklist below is what we implement before any AI system we build goes live.


The checklist

1. Structured output enforcement

Never trust raw LLM output in production code. Use Zod, Pydantic, or your framework's equivalent to enforce a schema on every model response. If the model returns malformed JSON, your retry logic should catch it — not a runtime exception that surfaces to users.

const schema = z.object({
  summary: z.string().max(500),
  confidence: z.number().min(0).max(1),
  sources: z.array(z.string().url()),
});

2. Retry with backoff and fallback

Transient failures happen on every provider. Your inference layer should retry at least 3 times with exponential backoff before failing. For non-real-time tasks, consider a fallback to a cheaper model (GPT-4o mini, Claude Haiku) before returning an error.

3. An eval suite with ground-truth data

Before you ship, you need to know what "working correctly" means. Assemble at least 50–100 hand-labelled examples covering your most common use cases and your known edge cases. Run them on every PR. If a model update drops your pass rate below threshold, you want to catch it in CI — not from a support ticket.

4. Input validation and prompt injection defence

Validate and sanitize user input before it touches your prompt. At minimum: length limits, encoding normalisation, and a check for common prompt injection patterns ("ignore previous instructions", "pretend you are", "DAN mode"). This isn't paranoia — it's a common attack vector on any user-facing AI feature.

5. Output filtering

Even with a well-written system prompt, foundation models can produce content you don't want associated with your product. Implement an output filter for your use case: PII redaction, harmful content detection, or simply a regex check for patterns that should never appear in your output.

6. Per-user cost limits

Without spend controls, one user running a batch job can spike your monthly bill by an order of magnitude. Implement per-user and per-org token budgets with hard limits. Redis works well here — a simple counter with a TTL is enough for most use cases.

7. Latency SLOs and timeouts

Set a p95 latency target and instrument it from day one. Add request-level timeouts (not just the provider SDK's default) so a slow model response doesn't hold a connection open indefinitely. If your feature is user-facing, p95 > 4 seconds will hurt conversion.

8. Distributed tracing on every call

Every LLM call should log: model version, prompt hash, token count (input + output), latency, cost, and the user/org ID. LangSmith and Langfuse both do this well. Without tracing, debugging production failures is archaeological work.

9. Model version pinning

Provider model aliases (gpt-4o, claude-3-5-sonnet-latest) update without notice. An alias update that changes model behaviour can silently degrade your eval pass rate. Pin to a specific model version (gpt-4o-2024-08-06) in production and test alias updates explicitly before promoting.

10. Context window management

Naive RAG implementations stuff as much context as possible into the prompt. Production systems need explicit context window management: chunking strategy, relevance threshold cutoffs, and a fallback for when retrieved context is too long. Test your system with maximum-length inputs before launch.

11. Graceful degradation

What happens when the LLM provider is down? Your system should have a defined fallback: cached response, rule-based fallback, or a clear error message — not a 500 that confuses users. Implement a circuit breaker around your model calls.

12. A rollback plan

Model deployments can go wrong. Before launch, document: how to revert to a previous model version, how to disable the AI feature with a feature flag, and who has the access to do it at 2am. The runbook should be written before it's needed.


What we see most often

The three items that clients most commonly miss:

  1. Structured output enforcement — prototypes assume the model always returns valid JSON. It doesn't.
  2. Eval harness — shipping without evals means the first regression you catch is a user complaint.
  3. Cost controls — every month, somewhere, an AI startup gets a surprise $30k invoice because one endpoint had no spend limit.

The 30-minute quick check

If you're launching tomorrow and can only do three things:

  1. Add Zod/Pydantic schema validation to every model response
  2. Add a hard timeout (15s) and retry (3x with backoff) on every LLM call
  3. Add a per-user rate limit of 100 requests/hour

These three changes prevent the most common production failures. Everything else on the list is important — but these are the ones that will wake you up at 3am if you skip them.


If you're working through this list and want a second pair of eyes on your implementation, we offer standalone AI audits — 5 days, written report, no commitment. Get in touch.

Work with us

Need help applying this to your stack?

Free 30-min strategy call. We'll scope your problem and tell you honestly what the fix looks like.

Book a strategy call