Last November, my phone lit up at 2am with a Slack alert. Our content classification pipeline had been quietly running for six hours — and in that time, it had burned through $340 in OpenAI API credits and produced results that were roughly 70% garbage. The cause? A bug in the retry logic that couldn’t distinguish why a request was failing. Every context_length_exceeded error got retried three times, full stop. By morning, the damage was done.
That was the moment I stopped treating AI pipelines as “API calls with a bit of plumbing” and started taking them seriously as production systems. Since then, I’ve processed over 15,000 generations through the same pipeline — document classification, code review automation, an internal Q&A system for a 12-person eng team — and I’ve got very specific opinions about what actually breaks in practice.
Retry Logic Is Not a One-Size-Fits-All Problem
The version I shipped initially was straightforward. I’d wired up tenacity, set exponential backoff, called it done.
from tenacity import retry, stop_after_attempt, wait_exponential
@retry(
stop=stop_after_attempt(3),
wait=wait_exponential(multiplier=1, min=4, max=60)
)
def call_llm(prompt: str) -> str:
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": prompt}]
)
return response.choices[0].message.content
Nothing technically wrong with this. The problem is it’s dangerously incomplete. OpenAI’s errors are not all retryable, and treating them identically is exactly how you burn $340 overnight.
RateLimitError? Retry it — you hit a quota, you wait, you try again. InvalidRequestError? Retrying that is just throwing money at a wall. If your prompt exceeded the context limit or you passed a malformed parameter, attempt number two will fail for the exact same reason. That 2am incident was precisely this: a prompt with a context_length_exceeded error getting queued for retries over and over, each one billing tokens for partial input before failing.
Here is the split that actually matters:
- Retry:
RateLimitError,APITimeoutError,APIConnectionError,InternalServerError(503s) - Don’t retry:
InvalidRequestError,AuthenticationError,PermissionDeniedError - Inspect first:
BadRequestError— check the message, then decide
import openai
from tenacity import retry, stop_after_attempt, wait_exponential, retry_if_exception_type
RETRYABLE_EXCEPTIONS = (
openai.RateLimitError,
openai.APITimeoutError,
openai.APIConnectionError,
openai.InternalServerError,
)
@retry(
retry=retry_if_exception_type(RETRYABLE_EXCEPTIONS),
stop=stop_after_attempt(4),
wait=wait_exponential(multiplier=2, min=5, max=120),
reraise=True
)
def call_llm(prompt: str, model: str = "gpt-4o-mini") -> str:
try:
response = client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": prompt}],
timeout=30.0 # do not skip this
)
return response.choices[0].message.content
except openai.BadRequestError as e:
# context_length_exceeded is not retryable — fail fast
if "context_length_exceeded" in str(e):
raise ValueError(f"Prompt too long: {len(prompt)} chars") from e
raise
That timeout=30.0 is not optional. Without it, the OpenAI API will sometimes just sit there — I’ve seen it hang for 90+ seconds during off-peak hours. Workers get stuck, batch jobs stall for hours instead of minutes. Set the timeout. Four retries is also the right ceiling in my experience; beyond that you’re usually hitting something structural, not transient, and you’re just adding latency while solving nothing.
Three Places Cost Will Catch You Off Guard
Honestly, I underestimated cost management early on. “It’s just tokens, how bad can it get” is something I literally said out loud. I stopped saying it around month two.
System prompt duplication. When you batch 100 documents through the same pipeline, and your system prompt is 500 tokens, you’re sending that system prompt 100 times. OpenAI’s prompt caching (available since August 2024) gives a 90% discount on cached input tokens — for gpt-4o — but it only kicks in when the cacheable prefix is at least 1024 tokens. Shorter system prompts get no benefit. Structure your prompts so the stable content leads and is substantial enough to hit that threshold; the dynamic per-document content can follow.
Model selection. My default was gpt-4o for everything, because that’s what I’d been using in prototypes and it felt safer. After running an A/B test on 50 documents — same inputs, both models, manually comparing outputs — I found that for simple classification tasks, gpt-4o-mini was within 3-5% accuracy while costing roughly 15-20x less. I moved all classification and keyword extraction to mini, kept gpt-4o for complex reasoning and long-document summarization. That one change cut my monthly spend by about 60%.
The key word is “simple.” Don’t assume your task qualifies — run the comparison yourself. What counts as simple enough for mini varies considerably by task type.
Uncontrolled output tokens. Output tokens cost more than input tokens on gpt-4o-mini (roughly 4x). If you’re expecting structured JSON and not setting max_tokens, the model is free to ramble. Setting response_format={"type": "json_object"} helps — output is more consistent and usually shorter — but the model will still sometimes add a verbose reasoning field if your schema allows it. Define your schema as tightly as possible. A max_tokens ceiling is cheap insurance.
Stop Trusting Model Output at Face Value
I knew LLMs could hallucinate. What I didn’t act on quickly enough was that hallucinated output can corrupt your database before you notice.
Our classifier was supposed to return one of five specific category names. For maybe 1-2% of requests, the model returned a slight variation — a synonym, a different casing, occasionally something completely invented. Low frequency. Didn’t surface in manual spot checks. Then downstream systems started erroring out, we traced it back, and found months of malformed records in the database. Cleaning it up took half a day.
The fix was Pydantic with a properly defined enum:
from pydantic import BaseModel, field_validator
from enum import Enum
import json
class Category(str, Enum):
TECHNICAL = "technical"
BUSINESS = "business"
LEGAL = "legal"
MARKETING = "marketing"
OTHER = "other"
class ClassificationResult(BaseModel):
category: Category
confidence: float
reasoning: str
@field_validator("confidence")
@classmethod
def confidence_range(cls, v: float) -> float:
if not 0.0 <= v <= 1.0:
raise ValueError("confidence must be between 0 and 1")
return v
def classify_document(text: str) -> ClassificationResult:
response = call_llm(
f"Classify the following document. Respond only in JSON.\n\n{text}"
)
try:
data = json.loads(response)
return ClassificationResult(**data)
except (json.JSONDecodeError, ValueError) as e:
# track parse failures separately — spikes indicate prompt or model issues
metrics.increment("llm.output_parse_failure")
logger.error(f"Output parse failure: {e}, raw: {response[:200]}")
raise OutputValidationError("Model output did not match expected schema") from e
One thing I noticed: tracking parse failures as a dedicated metric is more useful than it sounds. When that rate ticks up unexpectedly, it’s a signal that either your prompt changed, or the model behavior shifted quietly. OpenAI updates models without always announcing it — I’ve caught two silent behavioral changes by watching parse failure rates. One was in gpt-4o-mini sometime in late 2024, where structured output became slightly more verbose and started including wrapper text before the JSON.
The confidence field earns its place too. Results below 0.5 go into a manual review queue rather than auto-processing. Whether model-reported confidence maps cleanly to real accuracy is debatable — in my experience it’s a rough guide, not a calibrated probability — but it’s better than nothing for triaging uncertain cases, and it surfaces edge cases you’d miss otherwise.
Observability: The Part I Underinvested In
This is the most underrated section on this list. Regular API services are relatively easy to instrument: request logs, error rates, latency histograms. AI pipelines have all of those problems plus new ones — nondeterministic outputs, token-based cost that varies per request, silent model changes, and failure modes that only appear with specific input types.
Per request, I track: input token count, output token count, latency, model name, success/failure, retry count, parse success/failure, and the estimated cost calculated from the usage field in the API response. That last one matters. Don’t rely on the OpenAI dashboard for cost attribution — you want per-request data so you can tell which pipeline stage or input type is expensive. “Our AI costs are up 40% this month” is useless without knowing where.
Beyond metrics, I do random sampling. About 1-2% of completed requests get flagged for a human to look at. That human is me, usually on Friday mornings for 30 minutes. Not automated validation — actual eyeballing of the input, the output, and whether they make sense together. It’s tedious. It’s also how I caught a prompt regression before it became a real incident, twice. Automated metrics catch that something is wrong; sampling helps you understand what is wrong and why.
I tried LangSmith for a few weeks. Genuinely useful for debugging complex chains with branching logic or multi-step agent loops. Our pipeline doesn’t have that — it’s mostly linear — so the overhead wasn’t worth the benefit. I’m on OpenTelemetry now, shipping traces to Grafana. More setup upfront, but the instrumentation is transparent and there’s no vendor lock-in if I want to switch observability stacks later.
Confession: I didn’t build any of this seriously until month three. The “we’ll add observability later” mindset cost more than just the hours to retroactively instrument — it cost the incidents I could have caught earlier. Retrofitting per-request cost tracking and latency attribution into a running pipeline is genuinely painful. Build it first, even when it feels premature.
What I’d Actually Build Today
No hedging here — this is what I’d do if starting from scratch.
Use the SDK directly. If your pipeline is a sequence of LLM calls with pre/post processing, skip LangChain. The abstraction layer makes debugging harder and every major version upgrade breaks something (the langchain 0.2 → 0.3 migration was not a good time). LangChain earns its overhead for complex agent loops or multi-step RAG systems. For a focused production pipeline, it’s complexity you don’t need.
Separate retryable and non-retryable errors from day one. Log retry counts, log final outcomes, log the specific failure reason. You will thank yourself when debugging at 11pm.
Set up per-request cost tracking immediately. Add a daily spend alert in the OpenAI dashboard — mine is at $50/day. It has triggered twice. Both times it was a bug, not legitimate load.
Canary new prompts and model changes. Route 10% of traffic to the new version, watch the metrics for a few hours, then expand. Newer model does not automatically mean better results for your specific task. I’ve rolled back twice when a gpt-4o update produced noisier classification output than whatever was running before.
Validate outputs with Pydantic. Not because models are bad at following instructions — they’re actually quite good — but because “quite good” isn’t good enough when 1% failure rate at 15,000 requests per month means 150 corrupted records. That’s not acceptable.
I’m not 100% sure everything here scales to 10x the volume — some of this, especially around queueing and concurrency management, would need rethinking at that level. But if you’re running a serious internal AI pipeline and starting to feel like the wheels are loosening, this is the checklist I wish someone had handed me six months ago. The 2am alerts are preventable. Most of them, anyway.