What is the biggest risk when integrating AI into existing software?

Non-determinism and latency. AI calls are slower (100ms–10s) and return different results on the same input. Your existing code assumes deterministic, fast responses. You need explicit fallbacks, timeouts, and result validation.

How do you handle AI errors gracefully in production?

Treat AI calls like any external API: wrap them in try/catch, set explicit timeouts, return sensible defaults when they fail, and log every failure. Never let an AI error crash your application.

How much does it cost to run LLM calls in production?

GPT-4o costs approximately $0.005 per 1K tokens input and $0.015 per 1K tokens output. A typical AI feature with 2K input + 500 output tokens costs ~$0.018 per call. At 10,000 calls/day, that's $180/day — plan your cost model carefully.

Should you use RAG or fine-tuning for domain-specific knowledge?

RAG for most cases. Fine-tuning is expensive, requires labelled data, and needs retraining when knowledge changes. RAG lets you update your knowledge base without retraining and provides citation capabilities.

Best Practices for Integrating AI into Existing Software

The Hidden Complexity of AI Integration

Integrating AI into existing software looks simple on the surface: call an API, get a response, display it. But production AI integration has a set of failure modes that most engineering teams don't anticipate until they hit them in production.

This guide covers the patterns we use at SpeedMVPs when integrating LLMs and ML models into existing systems — patterns refined across 100+ production AI deployments.

Design Principles for AI-Augmented Systems

Principle 1: AI calls are I/O operations. Treat every LLM or ML API call like a database query or HTTP request: async, fallible, and with bounded latency expectations. Never block a user-facing request on an AI call without a timeout and fallback.

Principle 2: Validate AI outputs. LLMs return natural language that may not match your expected schema. Always validate structured outputs (JSON, lists, classifications) before using them in downstream logic. Use output parsers (LangChain, Instructor) to enforce schemas.

Principle 3: Design for non-determinism. The same prompt returns different outputs on different calls. Your system should be correct when given any valid output from the model — not just the ideal output you tested against.

API Design for AI Features

The cleanest approach is to treat AI as a service layer with a well-defined interface:

Define input schema: what data does the AI call need?
Define output schema: what structured data should it return?
Define error contract: what happens when the call fails or returns invalid output?
Define cost contract: how many tokens does this call consume, and is there a cheaper fallback?

Concrete example for a sentiment analysis integration:

// Bad: tightly coupled, no error handling
const sentiment = await openai.chat.completions.create({...});
return sentiment.choices[0].message.content;

// Good: service layer with schema validation
const result = await aiService.analyzeSentiment({
  text: userReview,
  timeout: 5000,
  fallback: 'neutral'
});
// result is always { sentiment: 'positive'|'negative'|'neutral', confidence: 0.0-1.0 }

Error Handling Patterns

AI errors come in three categories:

Network errors: API timeout, rate limit, service outage. Handle with exponential backoff and fallbacks.
Content policy errors: Your input triggered a safety filter. Log these for review — they may reveal edge cases in your input validation.
Invalid output errors: The model returned something that doesn't match your expected schema. Log the raw output, return the fallback, and alert your team.

Production error handling pattern:

async function callAIWithFallback<T>(
  aiCall: () => Promise<T>,
  fallback: T,
  options: { timeout: number; retries: number }
): Promise<T> {
  for (let attempt = 0; attempt <= options.retries; attempt++) {
    try {
      return await Promise.race([aiCall(), timeout(options.timeout)]);
    } catch (error) {
      log.error({ attempt, error });
      if (attempt === options.retries) return fallback;
      await sleep(Math.pow(2, attempt) * 1000);
    }
  }
  return fallback;
}

Cost Management

LLM costs compound quickly at scale. Key techniques for production cost management:

Prompt caching: OpenAI and Anthropic offer prompt caching for repeated system prompts. A 2,000-token system prompt cached at $0.00025/1K tokens saves ~$0.40 per 1,000 calls.
Model selection: Use GPT-4o-mini or Claude Haiku for simple classification, extraction, and routing tasks. Reserve GPT-4o/Claude Sonnet for complex reasoning. Cost difference: 10–20×.
Token budgeting: Measure average token usage per call type. Set alerts when usage exceeds 2× the baseline (this usually indicates a prompt injection or unexpected input).
Response caching: Cache identical inputs for a reasonable TTL. Product descriptions and static content analysis can often be cached for 24h+.

Observability

You can't debug what you can't observe. Minimum viable LLM observability:

Log every AI call: input tokens, output tokens, latency, model used, error (if any)
Track cost per feature, per user, per day
Alert on p95 latency spikes (often indicates prompt injection or unusually long inputs)
Capture and store outputs for a random 5% sample for quality review

Tools: LangSmith, Helicone, Braintrust, or a simple Postgres + Grafana stack if you prefer self-hosted.

Security Considerations

AI integration introduces new attack surfaces:

Prompt injection: Users can try to override your system prompt via user-controlled inputs. Always separate system prompts from user inputs using the messages array — never concatenate them as strings.
Data exfiltration: Be careful what context you include in prompts. Never include other users' data in prompts without explicit access controls.
Output injection: If you render LLM outputs as HTML or execute them as code, sanitise them like any untrusted input.

Integration Checklist

Before deploying an AI integration to production, verify:

✅ Async with timeout and fallback on every AI call
✅ Output schema validation with error logging
✅ Cost tracking per call type
✅ Prompt injection mitigation (messages array, not string concatenation)
✅ Rate limit handling with exponential backoff
✅ Sensitive data excluded from prompts
✅ Model outputs sanitised before HTML rendering
✅ Observability: latency, tokens, errors tracked

SpeedMVPs implements all of these patterns by default in our AI integration engagements. If you're adding AI to an existing product, contact us for a free 30-minute architecture review.