What are the stages of AI product development?

There are six stages: discovery (define the problem and prove the model can do it), data (decide what context the model needs and source it), build (a thin product around one AI capability), evaluation (a graded test set defining acceptable output), launch (ship to a small cohort with monitoring), and iteration (improve from real usage). The stage that distinguishes AI from ordinary software is evaluation — it is a phase in its own right, not a final QA pass.

How long does the AI product development process take?

An AI MVP scoped to one core capability runs through the full process in 2-3 weeks, from around $8,000. The timeline stretches only when teams skip discovery and discover scope mid-build, or when they postpone the evaluation set and ship a feature nobody can prove works. The single biggest timeline variable is scope discipline, not engineering speed.

What does the AI product development process include?

It includes problem definition and a feasibility test, a data and context plan, a thin build around one AI feature, an evaluation set that scores model output against examples you wrote in advance, a controlled launch with error and cost monitoring, and a weekly iteration loop. It deliberately excludes building multiple AI features, training a custom model, or polishing design before a single real user has touched the product.

How is the AI product process different from normal software development?

The core difference is non-determinism. Traditional software either works or throws an error; an AI feature can return a plausible, confident, wrong answer. That forces two extra commitments into the process — a data phase that decides what context the model sees, and an evaluation phase that scores output against graded examples before launch. Without those, you are shipping a feature you cannot measure or defend.

When should I build an evaluation set?

Write your evaluation set during discovery, before any code. It is 15-30 example inputs paired with the output you would consider good, plus a few deliberately hard or out-of-scope cases. Writing it early forces you to define 'good enough' while it is still cheap to change, and it becomes the test suite you run against the real system before launch and after every prompt change.

Do I need a data scientist for the AI product development process?

Almost never at the MVP stage. Most MVP-stage AI products are built on hosted models like GPT-4 or Claude, where the work is prompt design, retrieval, and evaluation rather than model training. A product-minded full-stack engineer who understands prompting and RAG covers it. Add a dedicated ML engineer only once you have proprietary data and a use case off-the-shelf models genuinely cannot handle.

AI Product Development Process Explained | SpeedMVPs

The AI product development process, explained

The AI product development process is the end-to-end path from "we think AI could solve this" to a live product that real users rely on. It runs through six phases: discovery, data, build, evaluation, launch, and iteration.

The temptation is to treat this like normal software with an LLM bolted on. It isn't. Traditional software is deterministic — it works or it throws an error you can read. An AI feature can hand back a fluent, confident, completely wrong answer and look perfectly fine in a demo. That single property changes the process in two concrete ways: data becomes a planning phase, and evaluation becomes a phase of its own rather than a QA step at the end.

This guide walks the full lifecycle the way we actually run it on AI MVPs, including the gate that has to be cleared before each phase ends. To make it concrete, we'll carry one example — a tool that ranks and summarises inbound CVs for a solo recruiter — through all six phases, so you can see how a single decision in discovery ripples all the way to launch. If you want the version written specifically for non-technical founders rather than the mechanics, read AI product development for non-technical founders. This page focuses on the process itself.

The six phases at a glance

Phase	Goal	The gate to clear
1. Discovery	Define the problem and prove the model can do it	A one-sentence problem statement + a passing feasibility test
2. Data	Decide what context the model needs and source it	A clear map of inputs, context, and where each comes from
3. Build	Wrap one AI capability in a thin, usable product	The core loop works end to end on happy-path inputs
4. Evaluation	Define and measure "good enough" output	The model passes your graded test set at the agreed bar
5. Launch	Ship to a small cohort with monitoring	Real users complete the core workflow; errors and cost are tracked
6. Iteration	Improve from real usage	A repeatable weekly loop driven by data

Each phase has a gate because, in our experience, the cost of fixing a mistake roughly multiplies every time you cross one. A bad problem definition costs an hour to fix in discovery and weeks to unpick after launch.

Phase 1: Discovery — define the problem and prove feasibility

Discovery has two jobs that founders routinely collapse into one and get wrong: confirming the problem is worth solving, and confirming the AI can actually solve it.

Problem definition. Sharpen the idea until it survives as a single sentence with a who, a current pain, and a measurable outcome. "Use AI to help recruiters" is not a scope. "Cut the time a solo recruiter spends screening 50 inbound CVs from 3 hours to 20 minutes by ranking and summarising them against a job spec" is. That sentence is our worked example, and the specificity isn't pedantry — it defines everything downstream.

The feasibility test. This is the AI-specific part. Before committing, paste 5-10 real CVs and a job spec into GPT-4 or Claude by hand and judge the raw ranking and summaries. In our experience, if a competent manual prompt gets you to roughly 70% of acceptable, prompt engineering and retrieval can usually carry it the rest of the way; if it's nearer 20%, no amount of engineering will save it and you should change the scope now, while it's free.

Write the evaluation set here. Counterintuitively, the test that governs Phase 4 is authored in Phase 1: 15-30 real CV-and-spec pairs matched with the ranking and summary you'd call good, plus a handful of deliberately hard cases (a CV in a second language, an over-qualified candidate, a spec with contradictory requirements). Writing it now forces you to define "good enough" while it costs nothing.

Failure mode: skipping feasibility and discovering in week two that the model can't reliably do the one thing the whole product depends on.

Gate to clear: a one-sentence problem statement and a feasibility test that clears your bar. For how to scope tightly before this, see the step-by-step AI MVP guide.

Phase 2: Data — decide what the model needs to see

This phase barely exists in traditional software and is the one most teams underinvest in. An LLM is only as good as the context you put in front of it. For the recruiter tool, decide three things before building:

Inputs — what the user provides: the uploaded CVs and the job spec.
Context — what extra information the model needs that the user won't type: the recruiter's own definition of a strong hire, past placements, must-have versus nice-to-have criteria.
Source and freshness — where each piece lives and how current it must be (the job spec changes per role; the recruiter's hiring preferences are stable).

That decision sets your architecture. If the recruiter wants the model to reason over a large library of past placements and policies, you need retrieval (RAG) — embeddings plus a vector store — not a longer prompt. Here is where the choice earns its keep: if the corpus is small and lives alongside everything else, pgvector inside the Supabase Postgres you already run keeps the stack to one database and is the right default. Reach for a dedicated vector store like Pinecone only once retrieval volume or latency outgrows Postgres — when you're querying millions of chunks and want managed scaling rather than tuning your own indexes. For the recruiter MVP, screening 50 CVs against one spec, you often need neither: the CVs and spec fit in a single well-structured prompt, and fine-tuning or training a custom model is almost never the right Phase 2 answer for an MVP — it's a later-stage move once you have proprietary data and a proven use case.

Output schema is part of data planning. Decide the exact shape of the model's output now — for the recruiter tool, a ranked JSON list with a score, a two-line summary, and a flagged-concerns field per candidate — because downstream UI and evaluation both depend on it.

Failure mode: treating "we'll just put more in the prompt" as a data strategy, then watching quality collapse the moment real-world context exceeds the context window.

Gate to clear: a one-page map of inputs, required context, and where each comes from. For the architecture choices behind this, see how to develop an AI app.

Phase 3: Build — a thin product around one capability

Build the smallest possible product that exercises the core AI loop end to end — for the recruiter, that's "upload CVs and a spec, get back a ranked, summarised list." Each piece of the stack should map to a decision you already made, not a default you copied:

Next.js for the app, because the same framework renders the upload UI and hosts the API route that calls the model — one codebase, no separate backend to stand up for an MVP.
Supabase for auth and database, because the recruiter's uploads, results, and (if needed) pgvector embeddings all live in one Postgres instance.
Vercel for hosting, because it deploys the Next.js app and its API routes from the same repo with preview URLs for every change.
A hosted model (GPT-4 or Claude) behind your own API endpoint, so you can swap models or tune prompts without touching the client — the feasibility test in Phase 1 already told you the capability is there.

The build splits into three tracks that run in parallel for speed:

The AI track — the endpoint that takes the CVs and spec, assembles context, calls the model, parses output to your ranked-JSON schema, stores it, and handles failure (retries, timeouts, the model being down).
The product track — the thin interface: the upload screen, the ranked-results display, auth, and the loading and streaming states that keep a recruiter from abandoning a 10-second response.
The infrastructure track — database schema, deployment, environment variables, secrets.

Resist every feature that isn't the core loop. An applicant-tracking dashboard, team seats, billing, and a second AI feature all belong after launch. For why ruthless scope is what actually hits the timeline, see why startups fail to ship fast.

Failure mode: building three AI features at 60% quality instead of one at 95%. Users forgive a small product; they don't forgive an unreliable one.

Gate to clear: the core loop works end to end on happy-path inputs.

Phase 4: Evaluation — define and measure "good enough"

This is the phase that separates an AI process from a software process, and the one most likely to be skipped under deadline pressure. You cannot ship a non-deterministic feature you can't measure.

Take the CV evaluation set you wrote in Phase 1 and run it against the assembled system — not the model in isolation, because prompts that shine alone often break once full context is wired in. For each case, score the output:

Pass / fail against your defined "good" ranking and summary, or
A 1-5 rubric where you decide the minimum average and the maximum tolerable fail rate before launch — for the recruiter, perhaps "the top three candidates the tool surfaces must include the ones I'd have shortlisted by hand."

Then iterate the prompt, retrieval, and output schema against that fixed set until you clear the bar. Critically, the set doesn't change as you tune — that's what makes it a measurement rather than wishful thinking. The acceptable bar is domain-dependent: a CV-screening assistant that a human still reviews can ship at a higher fail rate than anything legal, medical, or financial, where the same loop runs against a much stricter threshold and explicit guardrails.

Failure mode: "it looked great in the demo." A demo is one cherry-picked CV. An evaluation set is the difference between a feature you can defend and one you're hoping works.

Gate to clear: the model passes your graded test set at the agreed bar.

Phase 5: Launch — ship to a small cohort with monitoring

Launch is controlled, not a press release. Deploy to production, test the full journey on the live URL (staging is never identical), and stand up three things before a single recruiter arrives:

Error monitoring (Sentry) on both the app and the AI endpoint, because a failed CV parse should page you, not silently return an empty list.
Product analytics (PostHog) on the core funnel — upload, results returned, candidate opened — so you can see exactly where recruiters stall.
Cost and latency tracking on model calls — screening 50 CVs at once is a real token line item, and a runaway prompt is a real risk.

Then invite 10-20 target recruiters directly, hand them a two-minute walkthrough, open a direct feedback channel, and watch the funnel live for 48 hours. For the full pre-flight list, see the MVP launch checklist.

Gate to clear: real users complete the core workflow, and errors, cost, and latency are all visible.

Phase 6: Iteration — improve from real usage

Launch isn't the finish line; it's where the real product starts. Iteration is a tight weekly loop: collect data (analytics plus actual recruiter conversations) → identify the single biggest problem → ship a fix → measure → repeat. In an AI product the highest-leverage fixes are usually prompt and retrieval changes — for the recruiter tool, maybe the summaries bury the candidate's most relevant experience — then activation (recruiters sign up but stall before their first upload), then latency, which is the number one complaint in early AI products and is fixable with streaming, caching, and right-sizing the model.

Every prompt change reruns the Phase 4 evaluation set before it ships — that's how you improve the CV rankings without silently regressing the cases you already got right. For what comes after the first iterations, see post-MVP iteration.

How the phases map to time and cost

For an AI MVP scoped to one capability — like the recruiter tool — the whole process runs in 2-3 weeks from around $8,000: discovery and data in the first few days, build across the middle, evaluation and launch at the end, and iteration as smaller follow-on sprints. Expanding past that single validated capability into a full product is a separate, multi-sprint effort scoped on its own. The decisive variable is never engineering speed; it's scope discipline and whether you actually ran discovery and evaluation instead of skipping them. To pressure-test a budget against your own scope, use the AI MVP cost calculator or read the breakdown in AI MVP cost.

If you want the wider context for where this process sits in 2026 — the trends, models, and shifts shaping AI products this year — see AI product development in 2026.

Key takeaways

The AI product development process has six phases: discovery, data, build, evaluation, launch, iteration.
Two phases are unique to AI: data (what context the model sees) and evaluation (a graded test set that defines "good enough").
Write your evaluation set in discovery, before code — it's the cheapest place to define success.
Each phase has a gate; crossing one with an unfixed mistake multiplies the cost of fixing it.
Scope discipline, not engineering speed, is what keeps an AI MVP at 2-3 weeks.

Ready to run this process on your idea instead of reading about it? Talk to SpeedMVPs — fixed-price, production-grade AI MVPs delivered in 2-3 weeks.