Write your evals before you write your RAG
Most teams build the retrieval system, then ask whether it works. The teams that ship reliably build the eval harness first.
Tool-calling in production: what the benchmarks don't show
Benchmark scores look clean. Production tool calls look nothing like benchmarks. Here's what breaks.
Context window management for long-running agents
At some token budget, the model forgets. How you handle that boundary defines whether your agent degrades gracefully or fails hard.
pgvector vs. dedicated vector DB: the honest comparison
For most production systems we've shipped, pgvector was the right call. Here's the reasoning and the caveats.
Prompt versioning isn't optional in production
A prompt that worked in March may not work in May — model updates, data drift, or just entropy. Version your prompts like code.
Setting latency budgets for AI features before you build them
User tolerance for AI latency is not the same as API timeout limits. Set your p95 budget on day one.
One note a fortnight. No spam.
Practical write-ups on agents, evals, retrieval, and product engineering. Straight to your inbox.