{"id":157,"date":"2026-03-08T23:07:47","date_gmt":"2026-03-08T23:07:47","guid":{"rendered":"https:\/\/blog.rebalai.com\/en\/2026\/03\/08\/building-production-ready-ai-pipelines-lessons-fro\/"},"modified":"2026-03-18T22:00:08","modified_gmt":"2026-03-18T22:00:08","slug":"building-production-ready-ai-pipelines-lessons-fro","status":"publish","type":"post","link":"https:\/\/blog.rebalai.com\/en\/2026\/03\/08\/building-production-ready-ai-pipelines-lessons-fro\/","title":{"rendered":"Building Production-Ready AI Pipelines: Lessons from Running 10K+ Generations"},"content":{"rendered":"<p>It was a Tuesday morning when I opened our Datadog dashboard and saw 847 silent failures from the previous night&#8217;s batch job. No alerts. No exceptions in our logs. Just a queue that had quietly eaten thousands of tokens and returned nothing useful. Our pipeline had been &#8220;succeeding&#8221; <a href=\"https:\/\/blog.rebalai.com\/en\/2026\/03\/08\/rag-deep-dive-chunking-strategies-vector-databases\/\" title=\"in the\">in the<\/a> sense that it wasn&#8217;t throwing errors \u2014 it was just producing garbage and writing it to the database like everything was fine.<\/p>\n<p>That was month two of running LLM-powered features in <a href=\"https:\/\/m.do.co\/c\/06956e5e2802\" title=\"DigitalOcean for Production Workloads\" rel=\"nofollow sponsored\" target=\"_blank\">production<\/a>. I thought I had it figured out by then. I did not.<\/p>\n<p>Over the past eight months, on a three-person team, I&#8217;ve pushed somewhere north of 10,000 generations through <a href=\"https:\/\/m.do.co\/c\/06956e5e2802\" title=\"DigitalOcean for Production Workloads\" rel=\"nofollow sponsored\" target=\"_blank\">production<\/a> pipelines \u2014 across Claude 3.5, GPT-4o, and a brief, regrettable experiment with a self-hosted Mistral instance that I will get to. Here&#8217;s <a href=\"https:\/\/blog.rebalai.com\/en\/2026\/03\/05\/rag-vs-fine-tuning-when-to-use-each-technique-for\/\" title=\"What I Actually Learned\">what I actually learned<\/a>, as opposed to <a href=\"https:\/\/blog.rebalai.com\/en\/2026\/03\/09\/setting-up-github-actions-for-python-applications\/\" title=\"What the\">what the<\/a> documentation implied I would need to care about.<\/p>\n<h2>Retry Logic Is a Trap If You Do It Wrong<\/h2>\n<p>Every guide tells you to implement retries. What they <a href=\"https:\/\/blog.rebalai.com\/en\/2026\/03\/09\/setting-up-github-actions-for-python-applications\/\" title=\"Don&#8217;t Tell You\">don&#8217;t tell you<\/a> is that naive exponential backoff will bankrupt you during a rate limit storm, and that retrying on the wrong error codes will just make your problems worse, faster.<\/p>\n<p>My first implementation looked roughly like this:<\/p>\n<pre><code class=\"language-python\">import time\nimport random\n\ndef call_llm_with_retry(prompt, max_retries=3):\n    for attempt in range(max_retries):\n        try:\n            response = client.messages.create(\n                model=&quot;claude-3-5-sonnet-20241022&quot;,\n                max_tokens=1024,\n                messages=[{&quot;role&quot;: &quot;user&quot;, &quot;content&quot;: prompt}]\n            )\n            return response.content[0].text\n        except Exception as e:\n            if attempt == max_retries - 1:\n                raise\n            wait = (2 ** attempt) + random.uniform(0, 1)\n            time.sleep(wait)\n<\/code><\/pre>\n<p>Looks fine, right? The problem is that broad <code>except Exception<\/code>. I was retrying on context length errors (HTTP 400) \u2014 deterministic failures where no amount of waiting fixes a prompt that&#8217;s 2,000 tokens over the limit. I was also retrying on content policy rejections. And on malformed JSON responses from my own parsing layer, which weren&#8217;t even API errors.<\/p>\n<p>After a particularly bad Friday afternoon <a href=\"https:\/\/m.do.co\/c\/06956e5e2802\" title=\"Deploy on DigitalOcean Cloud\" rel=\"nofollow sponsored\" target=\"_blank\">deploy<\/a> where this pattern caused a cascade of 400 errors that chewed through our rate limit budget retrying requests that were never going to succeed, I got specific:<\/p>\n<pre><code class=\"language-python\">import anthropic\nimport time\nimport random\nimport logging\n\nRETRYABLE_STATUS_CODES = {429, 500, 502, 503, 529}\n\ndef call_llm_with_retry(prompt: str, max_retries: int = 4) -&gt; str:\n    last_exception = None\n\n    for attempt in range(max_retries):\n        try:\n            response = client.messages.create(\n                model=&quot;claude-3-5-sonnet-20241022&quot;,\n                max_tokens=1024,\n                messages=[{&quot;role&quot;: &quot;user&quot;, &quot;content&quot;: prompt}]\n            )\n            return response.content[0].text\n\n        except anthropic.RateLimitError as e:\n            # 429 \u2014 back off hard, respect Retry-After header if present\n            retry_after = getattr(e, 'retry_after', None)\n            wait = retry_after if retry_after else (2 ** attempt) * 2 + random.uniform(0, 2)\n            logging.warning(f&quot;Rate limited. Waiting {wait:.1f}s (attempt {attempt + 1})&quot;)\n            time.sleep(wait)\n            last_exception = e\n\n        except anthropic.APIStatusError as e:\n            if e.status_code not in RETRYABLE_STATUS_CODES:\n                # 400, 401, 403, 404 \u2014 these won't get better with retries\n                logging.error(f&quot;Non-retryable API error {e.status_code}: {e.message}&quot;)\n                raise\n            wait = (2 ** attempt) + random.uniform(0, 1)\n            time.sleep(wait)\n            last_exception = e\n\n        except anthropic.APIConnectionError as e:\n            # Network issues \u2014 retry with backoff\n            wait = (2 ** attempt) + random.uniform(0, 1)\n            time.sleep(wait)\n            last_exception = e\n\n    raise last_exception\n<\/code><\/pre>\n<p>The separation between retryable and non-retryable errors cut our wasted API spend by about 30% <a href=\"https:\/\/blog.rebalai.com\/en\/2026\/03\/08\/rag-deep-dive-chunking-strategies-vector-databases\/\" title=\"in the\">in the<\/a> first week. Not because we were hitting tons of 400s \u2014 we weren&#8217;t \u2014 but because when we did, they were expensive ones (long prompts) and we were burning budget retrying them six times.<\/p>\n<p>Practical takeaway: type your exceptions. If you&#8217;re using Anthropic&#8217;s SDK, <code>anthropic.RateLimitError<\/code>, <code>anthropic.APIStatusError<\/code>, and <code>anthropic.APIConnectionError<\/code> are distinct and should be handled differently. Same pattern applies to OpenAI&#8217;s SDK.<\/p>\n<h2>The Cost Math Will Surprise You<\/h2>\n<p>I thought I had a handle on costs. Did input\/output token estimates, built a little calculator, felt confident. Then I saw the actual bill.<\/p>\n<p>The issue wasn&#8217;t the per-token cost. It was everything I hadn&#8217;t accounted for: tokens burned on retries, on failed generations, on the system prompt I was including on <em>every single request<\/em> even when most of those requests didn&#8217;t need the full context. My system prompt was 847 tokens. Across 10,000 requests, that&#8217;s 8.47 million tokens of input just for boilerplate.<\/p>\n<p>So I started being deliberate about prompt architecture. Short context = shorter system prompt. A simple classification task doesn&#8217;t need the five-paragraph system prompt I wrote for open-ended generation. I built a prompt registry \u2014 nothing fancy, just a dict of prompt templates keyed by task type \u2014 and matched prompt complexity to task complexity.<\/p>\n<p>One thing I noticed that genuinely surprised me: batch processing doesn&#8217;t just save money on some APIs, it changes your failure mode profile entirely. With synchronous requests, latency spikes cause timeouts and downstream failures. With batch, the failure shows up hours later when you check results. Both are annoying; they&#8217;re annoying in different ways, on different schedules. <a href=\"https:\/\/blog.rebalai.com\/en\/2026\/03\/08\/edge-computing-in-2026-why-developers-are-adopting\/\" title=\"for Our\">For our<\/a> async summarization jobs, batch <a href=\"https:\/\/blog.rebalai.com\/en\/2026\/03\/08\/edge-computing-in-2026-why-developers-are-adopting\/\" title=\"Made Sense\">made sense<\/a>. For anything user-facing, obviously not.<\/p>\n<p>Also: watch your output token limits. I was setting <code>max_tokens=4096<\/code> on everything out of habit. The model doesn&#8217;t charge you for tokens it doesn&#8217;t use, but it holds a connection open while generating. For tasks that reliably produce short outputs, tighter limits improve throughput and catch runaway generations early.<\/p>\n<h2>Observability: <a href=\"https:\/\/blog.rebalai.com\/en\/2026\/03\/08\/github-copilot-alternatives-in-2026-cursor-codeium\/\" title=\"What I Actually\">What I Actually<\/a> Watch<\/h2>\n<p>Before shipping, I imagined needing detailed traces of every reasoning step. <a href=\"https:\/\/blog.rebalai.com\/en\/2026\/03\/08\/github-copilot-alternatives-in-2026-cursor-codeium\/\" title=\"What I Actually\">What I actually<\/a> monitor day-to-day is much simpler and more boring.<\/p>\n<p>The signals that matter in my setup:<\/p>\n<ul>\n<li><strong>Generation latency (p50, p95, p99)<\/strong> \u2014 p95 being more than 3x p50 usually means something weird is happening upstream, or my prompts have gotten inconsistent<\/li>\n<li><strong>Token count per request<\/strong> \u2014 sudden spikes here mean prompt injection or a bug in my context-building logic<\/li>\n<li><strong>Stop reason distribution<\/strong> \u2014 if <code>stop_reason: \"max_tokens\"<\/code> climbs above ~5%, something is wrong with my output length assumptions<\/li>\n<li><strong>Error rate by error type<\/strong> \u2014 separated by retryable vs. non-retryable, as above<\/li>\n<\/ul>\n<p>Maybe at scale I&#8217;d need semantic similarity metrics, hallucination detection, all of that. But for 10K generations a month, the operational signals told me more about what was actually broken than anything <a href=\"https:\/\/blog.rebalai.com\/en\/2026\/03\/08\/rag-deep-dive-chunking-strategies-vector-databases\/\" title=\"in the\">in the<\/a> &#8220;LLM observability&#8221; category.<\/p>\n<p>The one exception: I log a random 1% sample of full prompt+response pairs to a separate store for offline review. This has caught more real bugs than any automated metric \u2014 things like my context truncation cutting off mid-sentence; a template interpolation bug that put <code>{customer_name}<\/code> literally <a href=\"https:\/\/blog.rebalai.com\/en\/2026\/03\/08\/rag-deep-dive-chunking-strategies-vector-databases\/\" title=\"in the\">in the<\/a> prompt for about 200 requests before I noticed; and a system prompt that was accidentally instructing the <a href=\"https:\/\/blog.rebalai.com\/en\/2026\/03\/05\/claude-vs-gpt-4o-vs-gemini-20-which-ai-model-to-us\/\" title=\"Model to\">model to<\/a> respond in Spanish for reasons I still haven&#8217;t fully traced. (I pushed a fix before finding the root cause. Bad habit. But it&#8217;s fixed.)<\/p>\n<p>Right, so \u2014 the self-hosted Mistral experiment. I spent <a href=\"https:\/\/blog.rebalai.com\/en\/2026\/03\/08\/ai-coding-assistant-benchmarks-real-world-performa\/\" title=\"Three Weeks\">three weeks<\/a> running a quantized Mistral 7B instance on a rented A100, convinced I&#8217;d save money and gain latency control. The latency was fine. The output quality on my specific tasks (structured extraction from messy text) was noticeably worse, and the operational overhead of managing the inference server ate most of the cost savings. Your mileage may vary if you have a team with MLOps experience; I don&#8217;t, really. We&#8217;re primarily a web backend shop and it showed. Took it down in week four.<\/p>\n<h2>Structured Output Is More Fragile Than the Demos Suggest<\/h2>\n<p>Getting JSON out of a language model reliably was the part I underestimated most. The demos always work. <a href=\"https:\/\/m.do.co\/c\/06956e5e2802\" title=\"DigitalOcean <a href=\"https:\/\/blog.rebalai.com\/en\/2026\/03\/04\/fine-tuning-vs-rag-when-to-use-each-approach-for-production-llms\/\" title=\"for Production\">for Production<\/a> Workloads&#8221; rel=&#8221;nofollow sponsored&#8221; target=&#8221;_blank&#8221;>Production<\/a> does not always work.<\/p>\n<p>Even with JSON mode or tool use, you need to handle partial outputs, schema mismatches, and the question of what to do when validation fails. I went through three iterations:<\/p>\n<ol>\n<li>Prompt-only JSON extraction \u2014 about 92% success rate, which sounds okay until you realize 8% silent failures is catastrophic at scale<\/li>\n<li>JSON mode with <code>response_format: {type: \"json_object\"}<\/code> \u2014 better, but this only enforces valid JSON, not your schema<\/li>\n<li>Tool use \/ function calling with strict schema \u2014 this is where I landed, and it&#8217;s genuinely better, though you pay for <a href=\"https:\/\/blog.rebalai.com\/en\/2026\/03\/05\/copilot-vs-cursor-vs-codeium\/\" title=\"It in\">it in<\/a> prompt complexity<\/li>\n<\/ol>\n<p>Even with strict tool use, I see maybe 1-2% of responses where the model technically calls the tool but fills optional fields with placeholder values (&#8220;N\/A&#8221;, &#8220;unknown&#8221;, empty strings) instead of omitting them. I validate against a Pydantic model post-extraction and route those to a dead letter queue for human review rather than silently accepting them.<\/p>\n<p>The dead letter queue was one of the better decisions I made. It gave me a place for &#8220;I&#8217;m not sure what to do with this&#8221; responses that wasn&#8217;t &#8220;crash&#8221; or &#8220;silently corrupt the database.&#8221; About 200 of those 847 initial failures would have been catchable with this pattern.<\/p>\n<h2>What I&#8217;d Actually Recommend<\/h2>\n<p>If you&#8217;re starting from scratch, my honest suggestion is to just use the managed APIs. I know the &#8220;self-host for control&#8221; argument is appealing \u2014 I made it to myself <a href=\"https:\/\/blog.rebalai.com\/en\/2026\/03\/08\/ai-coding-assistant-benchmarks-real-world-performa\/\" title=\"for Three Weeks\">for three weeks<\/a> before the Mistral experiment cured me of it. Unless you have a hard data residency requirement or you&#8217;re processing volumes where the math definitively works out, the operational cost is real and it doesn&#8217;t show up <a href=\"https:\/\/blog.rebalai.com\/en\/2026\/03\/08\/rag-deep-dive-chunking-strategies-vector-databases\/\" title=\"in the\">in the<\/a> <a href=\"https:\/\/www.amazon.com\/s?k=GPU+for+deep+learning&#038;tag=synsun0f-20\" title=\"Best GPUs for AI and <a href=\"https:\/\/www.amazon.com\/s?k=deep+learning+book&#038;tag=synsun0f-20\" title=\"Best <a href=\"https:\/\/www.amazon.com\/s?k=deep+learning+book&#038;tag=synsun0f-20\" title=\"Best <a href=\"https:\/\/www.amazon.com\/s?k=deep+learning+book&#038;tag=synsun0f-20\" title=\"Best Deep Learning Books on Amazon\" rel=\"nofollow sponsored\" target=\"_blank\">Deep Learning<\/a> Books on Amazon&#8221; rel=&#8221;nofollow sponsored&#8221; target=&#8221;_blank&#8221;>Deep Learning<\/a> Books on Amazon&#8221; rel=&#8221;nofollow sponsored&#8221; target=&#8221;_blank&#8221;>Deep Learning<\/a> on Amazon&#8221; rel=&#8221;nofollow sponsored&#8221; target=&#8221;_blank&#8221;>GPU<\/a> rental price.<\/p>\n<p>Error handling before observability. Seriously, in that order. One well-typed exception handler is worth more than a week of dashboards. Know which errors are retryable, retry those, raise the rest immediately. You cannot dashboard your way out of code that silently fails.<\/p>\n<p>A dead letter queue \u2014 build it on day one, not when you&#8217;re scrambling to understand your failure modes at day fifty. Every generation pipeline has some percentage of responses that don&#8217;t fit the happy path. &#8220;Fail loudly and queue for human review&#8221; is so much better than &#8220;silently accept garbage and discover the problem during a <a href=\"https:\/\/m.do.co\/c\/06956e5e2802\" title=\"DigitalOcean for Production Workloads\" rel=\"nofollow sponsored\" target=\"_blank\">production<\/a> incident.&#8221;<\/p>\n<p>Log a random sample of prompts and responses \u2014 1% to a separate store, or 0.1% at higher volumes. Not all of them; storage costs add up fast. But that sample will surface things no metric catches. The <code>{customer_name}<\/code> bug I mentioned? Found via sampling, not alerting.<\/p>\n<p>And keep your prompts in version control. Treat prompt changes like code changes. I have a <code>prompts\/<\/code> directory, everything is versioned, and significant changes go through the same review process as code. I still see teams treating prompts as configuration rather than code, and they discover why that&#8217;s a mistake when something breaks and they can&#8217;t tell what changed.<\/p>\n<p>The thing I keep coming back to: the hard part of AI pipelines isn&#8217;t the AI. It&#8217;s the same distributed systems problems \u2014 queueing, retries, schema validation, observability \u2014 just with a new failure mode where the output looks plausible even when it&#8217;s completely wrong. That last part is what makes it genuinely harder than it sounds. A network timeout is obvious. A response that passes JSON validation but returns the wrong answer is not. Once I started <a href=\"https:\/\/blog.rebalai.com\/en\/2026\/03\/09\/serverless-vs-containers-in-2026-a-practical-decis\/\" title=\"Treating <a href=\"https:\/\/blog.rebalai.com\/en\/2026\/03\/09\/serverless-vs-containers-in-2026-a-practical-decis\/\" title=\"It as\">It as<\/a> a&#8221;>treating <a href=\"https:\/\/blog.rebalai.com\/en\/2026\/03\/09\/serverless-vs-containers-in-2026-a-practical-decis\/\" title=\"It as\">it as<\/a> a<\/a> distributed systems problem with an extra validation layer, things got clearer. Not easy. Just clearer.<\/p>\n<p><!-- Reviewed: 2026-03-06 | Status: ready_to_publish | Changes: meta_description adjusted to 157 chars; roughed up recommendations section (varied sentence structure, removed parallel imperative openers); tightened final paragraph with specific contrast example; trimmed cost section header; added parenthetical aside in observability section; minor voice adjustments throughout --><\/p>\n","protected":false},"excerpt":{"rendered":"<p>It was a Tuesday morning when I opened our Datadog dashboard and saw 847 silent failures from the previous night\u2019s batch job. No alerts.<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"site-sidebar-layout":"default","site-content-layout":"","ast-site-content-layout":"default","site-content-style":"default","site-sidebar-style":"default","ast-global-header-display":"","ast-banner-title-visibility":"","ast-main-header-display":"","ast-hfb-above-header-display":"","ast-hfb-below-header-display":"","ast-hfb-mobile-header-display":"","site-post-title":"","ast-breadcrumbs-content":"","ast-featured-img":"","footer-sml-layout":"","ast-disable-related-posts":"","theme-transparent-header-meta":"","adv-header-id-meta":"","stick-header-meta":"","header-above-stick-meta":"","header-main-stick-meta":"","header-below-stick-meta":"","astra-migrate-meta-layouts":"default","ast-page-background-enabled":"default","ast-page-background-meta":{"desktop":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"tablet":{"background-color":"","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"mobile":{"background-color":"","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""}},"ast-content-background-meta":{"desktop":{"background-color":"var(--ast-global-color-4)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"tablet":{"background-color":"var(--ast-global-color-4)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"mobile":{"background-color":"var(--ast-global-color-4)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""}},"footnotes":""},"categories":[1],"tags":[],"class_list":["post-157","post","type-post","status-publish","format-standard","hentry","category-general"],"_links":{"self":[{"href":"https:\/\/blog.rebalai.com\/en\/wp-json\/wp\/v2\/posts\/157","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/blog.rebalai.com\/en\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blog.rebalai.com\/en\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blog.rebalai.com\/en\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/blog.rebalai.com\/en\/wp-json\/wp\/v2\/comments?post=157"}],"version-history":[{"count":14,"href":"https:\/\/blog.rebalai.com\/en\/wp-json\/wp\/v2\/posts\/157\/revisions"}],"predecessor-version":[{"id":484,"href":"https:\/\/blog.rebalai.com\/en\/wp-json\/wp\/v2\/posts\/157\/revisions\/484"}],"wp:attachment":[{"href":"https:\/\/blog.rebalai.com\/en\/wp-json\/wp\/v2\/media?parent=157"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blog.rebalai.com\/en\/wp-json\/wp\/v2\/categories?post=157"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blog.rebalai.com\/en\/wp-json\/wp\/v2\/tags?post=157"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}