How to implement AI (Without burning a hole in the budget)
Every department I talk to is rushing to “do something with AI.” Very few of them can say what problem they’re actually trying to solve, how they’ll know if it worked, or what they’re willing to spend to find out. The pattern is always the same: the C-suite reads something about AI being able to replace jobs and provide 40% efficiency, a prototype appears within a fortnight, access quietly expands across the organisation without any guardrails, and three months later, finance flags that spend has gone from zero to £X,XXXX a month with nobody able to say which team owns what.
The result is predictable. Disappointment, and a cloud bill that won’t stop climbing, because nobody put the guardrails in before turning the taps on.
This post is about how to avoid that, from a product and delivery perspective. It’s written for anyone standing up AI capability in central government and wanting to do it properly.
For context, my experience here is specifically with Claude on AWS Bedrock in a UK government context. Most of what follows applies just as well if you’re deploying OpenAI’s models via Azure OpenAI, Gemini via Vertex, or any other managed LLM offering on your cloud provider of choice. The principles around use cases, requirements, attribution and cost control don’t care which model you pick. Some of the specific tooling I mention (QuickSight, AWS Budgets, Bedrock application inference profiles) is AWS-flavoured, but every cloud has an equivalent. Substitute as needed.
I’m also assuming assurance is already in hand. Governance board happy, data residency sorted, route-to-live agreed. If that’s not where you are, stop reading and go do that first. Everything below assumes you’ve got permission to build.
Start with use cases, not technology
The fastest way to waste six months is to stand up an AI capability and then go looking for things to do with it. Pick the use cases first, then work backwards to the platform.
And crucially, pick use cases you can actually measure. This is the bit people skip. If you can’t define a baseline (“civil servants currently take 18 minutes per case on average”) and a target (“reduce to 12 minutes”), you cannot tell whether AI helped, hurt, or made no difference. You’ll end up relying on vibes and anecdote when someone senior asks for the ROI, which is a conversation you do not want to be in. Measurability isn’t a nice-to-have, it’s the thing that separates a pilot that gets scaled from a pilot that gets quietly mothballed.
In my experience the use cases that land well share a few characteristics. There’s a clear baseline of someone doing something manual, repetitive and unstructured. The cost of a wrong answer is low enough that a human-in-the-loop check is genuinely sufficient. The inputs are text-heavy. The business owner can tell you what “better” looks like in numbers.
Things that tend to work in a government delivery context include drafting and summarising unstructured text (consultation responses, case notes, correspondence), semantic search across policy corpora nobody can navigate, and internal knowledge retrieval for frontline staff currently relying on SharePoint and tribal knowledge.
And because most departments running Claude on Bedrock are doing so on a cloud developer platform, there’s a whole class of engineer-facing use cases that often deliver faster returns than the user-facing ones: Pull request review and summarisation, auto-generating initial drafts of runbooks from incident data, generating Terraform from architecture diagrams, assisting with log analysis during incidents, writing first-draft RFCs or architectural decision records from a set of bullet points and helping engineers navigate sprawling internal documentation to name a few. These are measurable too: PR cycle time, time-to-resolution on incidents, lines of legacy code successfully migrated, and developer satisfaction scores.
Things that tend to fail: anything where the cost of a wrong answer is high and nobody’s checking the output; anything making real-time decisions on individuals; anything where the underlying process is broken and AI is being asked to paper over it. AI won’t fix your broken workflow, it’ll just make the mess faster.
Pick two or three. Rank them. Be honest about which one is genuinely painful for users versus which one just sounds impressive in a steering committee. Get written agreement with each business owner on what success looks like, in numbers, before you write a line of code. “Improves productivity” is not a success criterion. “Reduces average case handling time by 20% on this specific queue, measured over four weeks” is.
Collate the requirements properly
Once you’ve got your use cases, the temptation is to jump into building. Resist it. Spend a week or two gathering requirements across five dimensions.
Success metrics and KPIs. What’s the leading indicator you’ll use to know if the thing is working (e.g. adoption rate, task completion rate)? What’s the lagging indicator that tells you it’s actually delivering value (e.g. case handling time, PR cycle time, user satisfaction)? What’s the counter-metric that warns you it’s going wrong (e.g. error rate, rework rate, escalations)? How often will you measure, who owns the measurement, and what’s the threshold that triggers a kill/scale decision? Write these down before you start. They become the spine of every steering committee conversation you’ll have.
Cost. What’s the budget ceiling? What’s the unit economics target (cost per user, cost per prompt)? Who’s the finance business partner and how do they want to see spend broken down? Is there appetite for variable spend or does the department need a fixed monthly ceiling?
Performance. How many requests per day at peak? What’s the acceptable latency? Is this real-time interactive or batch? Can the user tolerate a five-second wait or do they need sub-second? This shapes your model choice, your caching strategy, and whether you can take advantage of batch pricing.
Security and data handling. What data classes are in scope (OFFICIAL, OFFICIAL-SENSITIVE)? Is personal data involved? Where does the data sit, where are calls made from, where are logs stored? Are you using Bedrock in a UK region, and is the account inside your existing landing zone? What’s the model invocation logging policy and who has access to prompt and response logs?
Operational. Who runs this in live? What’s the incident response path when the model returns something wrong in a way that matters? How do you roll back a bad prompt? What’s the change control for the system prompt itself, because that system prompt is now production code?
Write all of this down. Share with the tech lead, business owner, finance business partner and security lead. Get sign-off. This document becomes the thing you measure the pilot against and the thing that stops scope from quietly drifting six weeks in.
The cost requirements (the bit everyone skips)
Of those dimensions, cost is the one that gets the least attention and causes the most pain. So I’m going to go deep on it.
LLM spend behaves differently from typical cloud spend. It’s per-token, it scales with user behaviour rather than infrastructure decisions, and a small prompt change can 10x your costs overnight. The controls need to reflect that.
A related point before the controls themselves: roll the capability out to a small group of users first. I can’t stress this enough. Five to ten engaged users on a single use case tells you what a hundred users would cost, lets you fix the runaway prompts before they become a runaway bill, and gives you a credible baseline of “cost per outcome” before anyone commits budget to scaling. Every time I’ve seen a department go straight from prototype to directorate-wide access, they’ve spent the next quarter firefighting spend they could have predicted from a two-week closed pilot.
Here’s what I consider the minimum viable set of cost controls for that first pilot.
Must-have (MVP / pilot)
Attribution from day one, across multiple dimensions. This is the single biggest mistake I see. Everyone shares a single Bedrock invocation path, nobody knows who’s spending what, and six months later you’re trying to reconstruct usage from CloudTrail. You need to be able to slice spend by individual user, by team, by model (Haiku vs Sonnet vs Opus), by use case, and by cost centre so finance can map it back to the P&L. Use Bedrock application inference profiles per use case, tag every invocation with user and cost centre tags, route calls through per-team IAM roles, and log the model ID on every call. If you can’t answer “how much did the casework summariser cost us last month, which team drove it, and on which model” in under thirty seconds, your attribution is broken and every downstream control will be too.
Soft caps with tiered alerts. Before you reach for the emergency brake, you need a warning system. Alerts at 50%, 75% and 90% of monthly budget, routed to the product owner and the finance business partner. The point isn’t surveillance, it’s that nobody is surprised. Surprises in government finance cost careers.
Hard caps that actually stop spend. When the warnings have been ignored, or when something has gone properly wrong, you need a stop. AWS Budgets with a Budget Action attached that revokes the IAM permission to call Bedrock when the monthly cap is hit. Not another alert. A stop. Set it at a level that gives headroom but prevents a runaway retry loop from burning through the annual budget in a weekend.
A QuickSight dashboard a non-technical person can read. Spend by use case, spend by team, spend by user, spend by model, tokens in vs tokens out, cost per successful interaction, trend over the last 30 days, 90 days, six months. The SRO should be able to open it in a steering committee and know in ten seconds whether the programme is on track. If it needs a data analyst to translate it, it’s not a dashboard, it’s a data source.
Model selection discipline. Haiku, Sonnet and Opus sit at very different price points. Haiku is roughly a tenth of the cost of Opus for input tokens. Most use cases do not need Opus. Default to Sonnet, reach for Haiku where accuracy tolerates it, and make people justify Opus in writing. This single policy will noticeably compress the bill.
Audit logging. Bedrock model invocation logging turned on, delivered to CloudWatch or S3, capturing who called what model from which workspace with what token count. This is boring until the day NAO or internal audit asks, at which point it’s the difference between a calm conversation and a career event.
Should-have (stabilise after pilot)
Once the MVP is running and stable, the next layer of maturity is about catching problems earlier and squeezing more value per pound. AWS Cost Anomaly Detection scoped to Bedrock, watching for the spikes that signal a bug (retry loop, prompt-injection attack driving up tokens, a new team ramping without warning) rather than genuine demand. Prompt caching, which can cut input costs by up to 90% on repeated context and almost always applies to long system prompts reused across every call. Batch processing via the Message Batches API for anything non-urgent, which gives a 50% discount and can halve the bill on the right workload. And unit economics tracking that moves the conversation beyond “total spend” to “cost per outcome” (per case triaged, per summary generated, per PR reviewed), which is the number the SRO actually cares about.
None of these block the pilot, but all of them become uncomfortable to be missing once you’re in month three and scaling.
Nice-to-have (maturity plays)
Once you’re running at scale across multiple directorates, the focus shifts to organisational maturity. Chargeback or showback so cost lands in the P&L of the team driving it rather than a central IT budget and teams feel the cost of their own decisions. Application-layer rate limiting so one misbehaving client can’t exhaust the shared budget. Provisioned Throughput for predictable workloads where the committed-use discount beats on-demand. Automated prompt evaluation pipelines that flag when a prompt change has pushed up average token count before it ships. And FinOps-by-design gates, where every new use case goes through a cost review as part of normal governance rather than as an afterthought.
None of these are needed to run a successful pilot. All of them matter once you’re running a platform.
What a pilot actually looks like
If I’m dropping in and being asked to deliver a Claude-on-Bedrock capability properly, the first 90 days look roughly like this.
Weeks 1 to 2 are discovery. What are people already doing, officially and unofficially? Who are the stakeholders? What’s the Bedrock footprint already, if any? Don’t skip this. You’ll find things.
Weeks 3 to 4 are foundations. Attribution set up, soft and hard caps in place, dashboard v1 live, audit logging on, model selection policy written down, success metrics agreed. Nothing new gets built until these are in place. This is the part where product managers earn their keep by saying no, politely but firmly, to the engineer who just wants to start prototyping.
Weeks 5 to 8 are the pilot itself, starting with a small closed group of users, ideally five to ten, on one or two tightly-scoped use cases with the success criteria you agreed up front. Daily standups, weekly cost review, fortnightly demo to the SRO. Only open access beyond that group once the cost-per-outcome number is stable and within the envelope.
Weeks 9 to 12 are evaluation. Against the success metrics you wrote down in week four, not the ones you wish you’d picked in hindsight. Kill what didn’t work, double down on what did, publish what you learned, start the next wave.
Throughout all of it, treat the system prompt as production code. Version it, review it, test it, monitor the cost impact of every change. A one-line addition to a system prompt can move the token count per call by 20% and nobody will notice until the invoice arrives.
What happens after the pilot
A successful pilot is the easy bit. The hard bit is everything that comes next, and most departments under-resource it badly because the political momentum has already moved on to the next shiny thing.
A few things to expect, and to plan for.
Demand will outstrip your ability to onboard. Word gets around quickly when something works. You’ll have other directorates asking for access, other teams wanting their own use cases stood up, and senior stakeholders treating your pilot team as the de facto AI capability for the department. Decide early whether you’re a pilot team that hands things over, or a platform team that owns the capability long-term. These need very different operating models, funding routes, and headcount.
Your cost model will change shape. Pilot spend is dominated by experimentation: lots of prompt iteration, lots of test runs, relatively few real users. Production spend is dominated by volume: the prompts settle down, but actual usage scales 10x or 100x. The cost-per-outcome number you measured in week 12 will not hold once you have a thousand users instead of ten. Re-baseline as you scale, and be honest with finance about the curve.
Measurement has to keep going. It’s tempting to declare victory after the pilot evaluation and stop instrumenting. The KPIs that proved the use case in pilot are the same ones that will tell you when something has drifted, when a model upgrade has changed the output quality, or when users have quietly stopped using the feature because it no longer works for them. Build measurement into business-as-usual, not just into the pilot phase.
Models will change underneath you. Anthropic releases new Claude versions regularly. Each one shifts the price-to-performance curve. You need a deliberate process for evaluating new models against your existing prompts and use cases, with the cost and quality data to make an informed switch. The team that’s still running on the model they picked at pilot eighteen months ago is almost certainly leaving money or quality on the table, often both.
The system prompt becomes legacy faster than you think. Six months in, you’ll have prompts that nobody quite remembers writing, dependencies on specific model behaviours, and edge cases handled with bolted-on instructions. Treat your prompts the way you’d treat any other piece of production code that’s been around for a year: refactor them, document them, test them, and have a clear owner.
Plan the operating model deliberately. Who runs the dashboards in steady state? Who reviews new use case requests? Who owns the relationship with the cloud provider and tracks new features? Who keeps the cost controls maintained as the platform evolves? These are not glamorous roles, but if nobody is named against them, the capability will quietly degrade and you’ll be back to firefighting within a year.
The departments that get long-term value from AI are the ones that treat the pilot as the start of a programme, not the end of a project. Resource it accordingly.
What success looks like
I’m not promising:
- ❌ That AI will transform your department overnight
- ❌ That savings will show up in a ministerial briefing in month one
- ❌ That the first use case you pick will be the winner
I am promising:
- ✅ You’ll know exactly what you’re spending, where, and why, every single day
- ✅ Nobody gets a nasty surprise on the monthly Bedrock invoice
- ✅ You’ll have two or three use cases with real, measurable impact rather than demo-ware
- ✅ Cost per outcome will be a number on a dashboard, not a vibe
- ✅ Teams will actually use the thing
A closing thought
If you got to the end and thought “this is exactly the mess we’re in,” you’re not alone. I’ve yet to walk into a department where the cost controls and product discipline around AI are as mature as they should be, and that gap is going to get more expensive the longer it’s left.
The good news is that none of this is hard. It’s just product management, applied to a category of spend that behaves slightly differently from what most teams are used to. Pick the use cases that matter, gather the requirements properly, put the cost controls in before you turn the taps on, start small, and measure everything.
If any of this resonates and you’d like to chat, get in touch. Otherwise, take what’s useful and ignore the rest.