{"id":31,"date":"2026-03-05T15:14:39","date_gmt":"2026-03-05T15:14:39","guid":{"rendered":"https:\/\/blog.rebalai.com\/en\/2026\/03\/05\/ai-pipeline-lessons\/"},"modified":"2026-03-18T22:00:09","modified_gmt":"2026-03-18T22:00:09","slug":"ai-pipeline-lessons","status":"publish","type":"post","link":"https:\/\/blog.rebalai.com\/en\/2026\/03\/05\/ai-pipeline-lessons\/","title":{"rendered":"Building Production-Ready AI Pipelines: Lessons from Running 10K+ Generations"},"content":{"rendered":"<p><script type=\"application\/ld+json\">\n{\n  \"@context\": \"https:\/\/schema.org\",\n  \"@type\": \"BlogPosting\",\n  \"headline\": \"Building <a href=\"https:\/\/m.do.co\/c\/06956e5e2802\" title=\"DigitalOcean for Production Workloads\" rel=\"nofollow sponsored\" target=\"_blank\">Production<\/a>-Ready <a href=\"https:\/\/blog.rebalai.com\/en\/2026\/03\/08\/building-production-ready-ai-pipelines-lessons-fro\/\" title=\"AI <a href=\"https:\/\/blog.rebalai.com\/en\/2026\/03\/08\/building-production-ready-ai-pipelines-lessons-fro\/\" title=\"Pipelines: Lessons from Running\">Pipelines: Lessons from Running<\/a> 10K+\">AI <a href=\"https:\/\/blog.rebalai.com\/en\/2026\/03\/08\/building-production-ready-ai-pipelines-lessons-fro\/\" title=\"Pipelines: Lessons from\">Pipelines: Lessons from<\/a> Running 10K+<\/a> Generations\",\n  \"description\": \"Last November, my phone lit up at 2am with a Slack alert.\",\n  \"url\": \"https:\/\/blog.rebalai.com\/en\/2026\/03\/05\/ai-pipeline-lessons\/\",\n  \"datePublished\": \"2026-03-05T15:14:39\",\n  \"dateModified\": \"2026-03-05T17:39:33\",\n  \"inLanguage\": \"en_US\",\n  \"author\": {\n    \"@type\": \"Organization\",\n    \"name\": \"RebalAI\",\n    \"url\": \"https:\/\/blog.rebalai.com\/en\/\"\n  },\n  \"publisher\": {\n    \"@type\": \"Organization\",\n    \"name\": \"RebalAI\",\n    \"logo\": {\n      \"@type\": \"ImageObject\",\n      \"url\": \"https:\/\/blog.rebalai.com\/wp-content\/uploads\/logo.png\"\n    }\n  },\n  \"mainEntityOfPage\": {\n    \"@type\": \"WebPage\",\n    \"@id\": \"https:\/\/blog.rebalai.com\/en\/2026\/03\/05\/ai-pipeline-lessons\/\"\n  }\n}\n<\/script><\/p>\n<p>Last November, my phone lit up at 2am with a Slack alert. Our content classification pipeline had been quietly running for six hours \u2014 and in that time, it had burned through $340 in OpenAI API credits and produced results that were roughly 70% garbage. The cause? A bug <a href=\"https:\/\/blog.rebalai.com\/en\/2026\/03\/08\/rag-deep-dive-chunking-strategies-vector-databases\/\" title=\"in the\">in the<\/a> retry logic that couldn&#8217;t distinguish <em>why<\/em> a request was failing. Every <code>context_length_exceeded<\/code> error got retried three times, full stop. By morning, the damage was done.<\/p>\n<p>That was the moment <a href=\"https:\/\/blog.rebalai.com\/en\/2026\/03\/09\/serverless-vs-containers-in-2026-a-practical-decis\/\" title=\"I Stopped Treating\">I stopped treating<\/a> AI pipelines as &#8220;API calls with a bit of plumbing&#8221; and started taking them seriously as <a href=\"https:\/\/m.do.co\/c\/06956e5e2802\" title=\"DigitalOcean <a href=\"https:\/\/blog.rebalai.com\/en\/2026\/03\/09\/typescript-5x-in-2026-features-that-actually-matte\/\" title=\"for Production\">for Production<\/a> Workloads&#8221; rel=&#8221;nofollow sponsored&#8221; target=&#8221;_blank&#8221;>production<\/a> systems. Since then, I&#8217;ve processed over 15,000 generations through the same pipeline \u2014 document classification, code review automation, an internal Q&amp;A system for a 12-person eng team \u2014 and I&#8217;ve got very specific opinions about <a href=\"https:\/\/blog.rebalai.com\/en\/2026\/03\/05\/advanced-prompt-engineering-techniques-chain-of-th\/\" title=\"What Actually\">what actually<\/a> breaks in practice.<\/p>\n<h2>Retry Logic Is Not a One-Size-Fits-All Problem<\/h2>\n<p>The version I shipped initially was straightforward. I&#8217;d wired up <code>tenacity<\/code>, set exponential backoff, called it done.<\/p>\n<pre><code class=\"language-python\">from tenacity import retry, stop_after_attempt, wait_exponential\n\n@retry(\n    stop=stop_after_attempt(3),\n    wait=wait_exponential(multiplier=1, min=4, max=60)\n)\ndef call_llm(prompt: str) -&gt; str:\n    response = client.chat.completions.create(\n        model=&quot;gpt-4o&quot;,\n        messages=[{&quot;role&quot;: &quot;user&quot;, &quot;content&quot;: prompt}]\n    )\n    return response.choices[0].message.content\n<\/code><\/pre>\n<p>Nothing technically wrong with this. The problem is it&#8217;s dangerously incomplete. OpenAI&#8217;s errors are not all retryable, and treating them identically is exactly how you burn $340 overnight.<\/p>\n<p><code>RateLimitError<\/code>? Retry it \u2014 you hit a quota, you wait, you try again. <code>InvalidRequestError<\/code>? Retrying that is just throwing money at a wall. If your prompt exceeded the context limit or you passed a malformed parameter, attempt number two will fail for the exact same reason. That 2am incident was precisely this: a prompt with a <code>context_length_exceeded<\/code> error getting queued for retries over and over, each one billing tokens for partial input before failing.<\/p>\n<p>Here is the split <a href=\"https:\/\/blog.rebalai.com\/en\/2026\/03\/09\/typescript-5x-in-2026-features-that-actually-matte\/\" title=\"That Actually\">that actually<\/a> matters:<\/p>\n<ul>\n<li><strong>Retry<\/strong>: <code>RateLimitError<\/code>, <code>APITimeoutError<\/code>, <code>APIConnectionError<\/code>, <code>InternalServerError<\/code> (503s)<\/li>\n<li><strong>Don&#8217;t retry<\/strong>: <code>InvalidRequestError<\/code>, <code>AuthenticationError<\/code>, <code>PermissionDeniedError<\/code><\/li>\n<li><strong>Inspect first<\/strong>: <code>BadRequestError<\/code> \u2014 check the message, then decide<\/li>\n<\/ul>\n<pre><code class=\"language-python\">import openai\nfrom tenacity import retry, stop_after_attempt, wait_exponential, retry_if_exception_type\n\nRETRYABLE_EXCEPTIONS = (\n    openai.RateLimitError,\n    openai.APITimeoutError,\n    openai.APIConnectionError,\n    openai.InternalServerError,\n)\n\n@retry(\n    retry=retry_if_exception_type(RETRYABLE_EXCEPTIONS),\n    stop=stop_after_attempt(4),\n    wait=wait_exponential(multiplier=2, min=5, max=120),\n    reraise=True\n)\ndef call_llm(prompt: str, model: str = &quot;gpt-4o-mini&quot;) -&gt; str:\n    try:\n        response = client.chat.completions.create(\n            model=model,\n            messages=[{&quot;role&quot;: &quot;user&quot;, &quot;content&quot;: prompt}],\n            timeout=30.0  # do not skip this\n        )\n        return response.choices[0].message.content\n    except openai.BadRequestError as e:\n        # context_length_exceeded is not retryable \u2014 fail fast\n        if &quot;context_length_exceeded&quot; in str(e):\n            raise ValueError(f&quot;Prompt too long: {len(prompt)} chars&quot;) from e\n        raise\n<\/code><\/pre>\n<p>That <code>timeout=30.0<\/code> is not optional. Without it, the OpenAI API will sometimes just sit there \u2014 I&#8217;ve seen it hang for 90+ seconds during off-peak hours. Workers get stuck, batch jobs stall for hours instead of minutes. Set the timeout. Four retries is also the right ceiling in my experience; beyond that you&#8217;re usually hitting something structural, not transient, and you&#8217;re just adding latency while solving nothing.<\/p>\n<h2>Three Places Cost Will Catch You Off Guard<\/h2>\n<p>Honestly, I underestimated cost management early on. &#8220;It&#8217;s just tokens, how bad can it get&#8221; is something I literally said out loud. <a href=\"https:\/\/blog.rebalai.com\/en\/2026\/03\/09\/serverless-vs-containers-in-2026-a-practical-decis\/\" title=\"I Stopped\">I stopped<\/a> saying it around month two.<\/p>\n<p><strong>System prompt duplication.<\/strong> When you batch 100 documents through the same pipeline, and your system prompt is 500 tokens, you&#8217;re sending that system prompt 100 times. OpenAI&#8217;s prompt caching (available since August 2024) gives a 90% discount on cached input tokens \u2014 for gpt-4o \u2014 but it only kicks in when the cacheable prefix is at least 1024 tokens. Shorter system prompts get no benefit. Structure your prompts so the stable content leads and is substantial enough to hit that threshold; the dynamic per-document content can follow.<\/p>\n<p><strong>Model selection.<\/strong> My default was gpt-4o for everything, because that&#8217;s <a href=\"https:\/\/blog.rebalai.com\/en\/2026\/03\/05\/rag-vs-fine-tuning-when-to-use-each-technique-for\/\" title=\"What I\">what I<\/a>&#8217;d been using in prototypes and it felt safer. After running an A\/B test on 50 documents \u2014 same inputs, both models, manually comparing outputs \u2014 I found that for simple classification tasks, gpt-4o-mini was within 3-5% accuracy while costing roughly 15-20x less. I moved all classification and keyword extraction to mini, kept gpt-4o for complex reasoning and long-document summarization. That one change cut my monthly spend by about 60%.<\/p>\n<p>The key word is &#8220;simple.&#8221; Don&#8217;t assume your task qualifies \u2014 run the comparison yourself. What counts as simple enough for mini varies considerably by task type.<\/p>\n<p><strong>Uncontrolled output tokens.<\/strong> Output tokens cost more than input tokens on gpt-4o-mini (roughly 4x). If you&#8217;re expecting structured JSON and not setting <code>max_tokens<\/code>, the model is free to ramble. Setting <code>response_format={\"type\": \"json_object\"}<\/code> helps \u2014 output is more consistent and usually shorter \u2014 but the model will still sometimes add a verbose <code>reasoning<\/code> field if your schema allows it. Define your schema as tightly as possible. A <code>max_tokens<\/code> ceiling is cheap insurance.<\/p>\n<h2>Stop Trusting Model Output at Face Value<\/h2>\n<p>I knew LLMs could hallucinate. <a href=\"https:\/\/blog.rebalai.com\/en\/2026\/03\/05\/rag-vs-fine-tuning-when-to-use-each-technique-for\/\" title=\"What I\">What I<\/a> didn&#8217;t act on quickly enough was that hallucinated output can corrupt your database before you notice.<\/p>\n<p>Our classifier was supposed to return one of five specific category names. For maybe 1-2% of requests, the model returned a slight variation \u2014 a synonym, a different casing, occasionally something completely invented. Low frequency. Didn&#8217;t surface in manual spot checks. Then downstream systems started erroring out, we traced it back, and found <a href=\"https:\/\/blog.rebalai.com\/en\/2026\/03\/05\/rag-vs-fine-tuning-when-to-use-each-technique-for\/\" title=\"Months of\">months of<\/a> malformed records <a href=\"https:\/\/blog.rebalai.com\/en\/2026\/03\/08\/rag-deep-dive-chunking-strategies-vector-databases\/\" title=\"in the\">in the<\/a> database. Cleaning it up took half a day.<\/p>\n<p>The fix was Pydantic with a properly defined enum:<\/p>\n<pre><code class=\"language-python\">from pydantic import BaseModel, field_validator\nfrom enum import Enum\nimport json\n\nclass Category(str, Enum):\n    TECHNICAL = &quot;technical&quot;\n    BUSINESS = &quot;business&quot;\n    LEGAL = &quot;legal&quot;\n    MARKETING = &quot;marketing&quot;\n    OTHER = &quot;other&quot;\n\nclass ClassificationResult(BaseModel):\n    category: Category\n    confidence: float\n    reasoning: str\n\n    @field_validator(&quot;confidence&quot;)\n    @classmethod\n    def confidence_range(cls, v: float) -&gt; float:\n        if not 0.0 &lt;= v &lt;= 1.0:\n            raise ValueError(&quot;confidence must be between 0 and 1&quot;)\n        return v\n\ndef classify_document(text: str) -&gt; ClassificationResult:\n    response = call_llm(\n        f&quot;Classify the following document. Respond only in JSON.\\n\\n{text}&quot;\n    )\n    try:\n        data = json.loads(response)\n        return ClassificationResult(**data)\n    except (json.JSONDecodeError, ValueError) as e:\n        # track parse failures separately \u2014 spikes indicate prompt or model issues\n        metrics.increment(&quot;llm.output_parse_failure&quot;)\n        logger.error(f&quot;Output parse failure: {e}, raw: {response[:200]}&quot;)\n        raise OutputValidationError(&quot;Model output did not match expected schema&quot;) from e\n<\/code><\/pre>\n<p>One thing I noticed: tracking parse failures as a dedicated metric is more useful than it sounds. When that rate ticks up unexpectedly, it&#8217;s a signal that either your prompt changed, or the model behavior shifted quietly. OpenAI updates models without always announcing it \u2014 I&#8217;ve caught two silent behavioral changes by watching parse failure rates. One was in gpt-4o-mini sometime in late 2024, where structured output became slightly more verbose and started including wrapper text before the JSON.<\/p>\n<p>The <code>confidence<\/code> field earns its place too. Results below 0.5 go into a manual review queue rather than auto-processing. Whether model-reported confidence maps cleanly to real accuracy is debatable \u2014 in my experience it&#8217;s a rough guide, not a calibrated probability \u2014 but it&#8217;s better than nothing for triaging uncertain cases, and it surfaces edge cases you&#8217;d miss otherwise.<\/p>\n<h2>Observability: The Part I Underinvested In<\/h2>\n<p>This is the most underrated section on this list. Regular API services are relatively easy to instrument: request logs, error rates, latency histograms. AI pipelines have all of those problems plus new ones \u2014 nondeterministic outputs, token-based cost that varies per request, silent model changes, and failure modes that only appear with specific input types.<\/p>\n<p>Per request, I track: input token count, output token count, latency, model name, success\/failure, retry count, parse success\/failure, and the estimated cost calculated from the <code>usage<\/code> field <a href=\"https:\/\/blog.rebalai.com\/en\/2026\/03\/08\/rag-deep-dive-chunking-strategies-vector-databases\/\" title=\"in the\">in the<\/a> API response. That last one matters. Don&#8217;t rely on the OpenAI dashboard for cost attribution \u2014 you want per-request data so you can tell which pipeline stage or input type is expensive. &#8220;Our AI costs are up 40% this month&#8221; is useless without knowing where.<\/p>\n<p>Beyond metrics, I do random sampling. About 1-2% of completed requests get flagged for a human to look at. That human is me, usually on Friday mornings for 30 minutes. Not automated validation \u2014 actual eyeballing of the input, the output, and whether they make sense together. It&#8217;s tedious. It&#8217;s also how I caught a prompt regression before it became a real incident, twice. Automated metrics catch <em>that<\/em> something is wrong; sampling helps you understand <em>what<\/em> is wrong and <em>why<\/em>.<\/p>\n<p>I tried LangSmith for a few weeks. Genuinely useful for debugging complex chains with branching logic or multi-step agent loops. Our pipeline doesn&#8217;t have that \u2014 it&#8217;s mostly linear \u2014 so the overhead wasn&#8217;t worth the benefit. I&#8217;m on OpenTelemetry now, shipping traces to Grafana. More setup upfront, but the instrumentation is transparent and there&#8217;s no vendor lock-in if I want to switch observability stacks later.<\/p>\n<p>Confession: I didn&#8217;t build any of this seriously until month three. The &#8220;we&#8217;ll add observability later&#8221; mindset cost more than just the hours to retroactively instrument \u2014 it cost the incidents I could have caught earlier. Retrofitting per-request cost tracking and latency attribution into a running pipeline is genuinely painful. Build it first, even when it feels premature.<\/p>\n<h2>What I&#8217;d Actually Build Today<\/h2>\n<p>No hedging here \u2014 this is <a href=\"https:\/\/blog.rebalai.com\/en\/2026\/03\/05\/rag-vs-fine-tuning-when-to-use-each-technique-for\/\" title=\"What I\">what I<\/a>&#8217;d do if starting from scratch.<\/p>\n<p><strong>Use the SDK directly.<\/strong> If your pipeline is a sequence of LLM calls with pre\/post processing, skip LangChain. The abstraction layer makes debugging harder and every major version upgrade breaks something (the langchain 0.2 \u2192 0.3 migration was not a good time). LangChain earns its overhead for complex agent loops or multi-step RAG systems. For a focused <a href=\"https:\/\/m.do.co\/c\/06956e5e2802\" title=\"DigitalOcean for Production Workloads\" rel=\"nofollow sponsored\" target=\"_blank\">production<\/a> pipeline, it&#8217;s complexity you don&#8217;t need.<\/p>\n<p><strong>Separate retryable and non-retryable errors from day one.<\/strong> Log retry counts, log final outcomes, log the specific failure reason. You will thank yourself when debugging at 11pm.<\/p>\n<p><strong>Set up per-request cost tracking immediately.<\/strong> Add a daily spend alert <a href=\"https:\/\/blog.rebalai.com\/en\/2026\/03\/08\/rag-deep-dive-chunking-strategies-vector-databases\/\" title=\"in the\">in the<\/a> OpenAI dashboard \u2014 mine is at $50\/day. It has triggered twice. Both times it was a bug, not legitimate load.<\/p>\n<p><strong>Canary new prompts and model changes.<\/strong> Route 10% of traffic to the new version, watch the metrics for a few hours, then expand. Newer model does not automatically mean better results for your specific task. I&#8217;ve rolled back twice when a gpt-4o update produced noisier classification output than whatever was running before.<\/p>\n<p><strong>Validate outputs with Pydantic.<\/strong> Not because models are bad at following instructions \u2014 they&#8217;re actually quite good \u2014 but because &#8220;quite good&#8221; isn&#8217;t good enough when 1% failure rate at 15,000 requests per month means 150 corrupted records. That&#8217;s not acceptable.<\/p>\n<p>I&#8217;m not 100% sure everything here scales to 10x the volume \u2014 some of this, especially around queueing and concurrency management, would need rethinking at that level. But if you&#8217;re running a serious internal AI pipeline and starting to feel like the wheels are loosening, this is the checklist I wish someone had handed me <a href=\"https:\/\/blog.rebalai.com\/en\/2026\/03\/09\/deno-20-in-production-2026-migration-from-nodejs-a\/\" title=\"Six Months\">six months<\/a> ago. The 2am alerts are preventable. Most of them, anyway.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>{ &#8220;@context&#8221;: &#8220;https:\/\/schema.org&#8221;, &#8220;@type&#8221;: &#8220;BlogPosting&#8221;, &#8220;headline&#8221;: &#8220;Building Production -Ready Pipelines: Lessons from Running 10K+&#8221;>AI Pipelines: Les<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"site-sidebar-layout":"default","site-content-layout":"","ast-site-content-layout":"default","site-content-style":"default","site-sidebar-style":"default","ast-global-header-display":"","ast-banner-title-visibility":"","ast-main-header-display":"","ast-hfb-above-header-display":"","ast-hfb-below-header-display":"","ast-hfb-mobile-header-display":"","site-post-title":"","ast-breadcrumbs-content":"","ast-featured-img":"","footer-sml-layout":"","ast-disable-related-posts":"","theme-transparent-header-meta":"","adv-header-id-meta":"","stick-header-meta":"","header-above-stick-meta":"","header-main-stick-meta":"","header-below-stick-meta":"","astra-migrate-meta-layouts":"default","ast-page-background-enabled":"default","ast-page-background-meta":{"desktop":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"tablet":{"background-color":"","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"mobile":{"background-color":"","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""}},"ast-content-background-meta":{"desktop":{"background-color":"var(--ast-global-color-4)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"tablet":{"background-color":"var(--ast-global-color-4)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"mobile":{"background-color":"var(--ast-global-color-4)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""}},"footnotes":""},"categories":[1],"tags":[],"class_list":["post-31","post","type-post","status-publish","format-standard","hentry","category-general"],"_links":{"self":[{"href":"https:\/\/blog.rebalai.com\/en\/wp-json\/wp\/v2\/posts\/31","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/blog.rebalai.com\/en\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blog.rebalai.com\/en\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blog.rebalai.com\/en\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/blog.rebalai.com\/en\/wp-json\/wp\/v2\/comments?post=31"}],"version-history":[{"count":17,"href":"https:\/\/blog.rebalai.com\/en\/wp-json\/wp\/v2\/posts\/31\/revisions"}],"predecessor-version":[{"id":491,"href":"https:\/\/blog.rebalai.com\/en\/wp-json\/wp\/v2\/posts\/31\/revisions\/491"}],"wp:attachment":[{"href":"https:\/\/blog.rebalai.com\/en\/wp-json\/wp\/v2\/media?parent=31"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blog.rebalai.com\/en\/wp-json\/wp\/v2\/categories?post=31"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blog.rebalai.com\/en\/wp-json\/wp\/v2\/tags?post=31"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}