{"id":161,"date":"2026-03-09T00:43:21","date_gmt":"2026-03-09T00:43:21","guid":{"rendered":"https:\/\/blog.rebalai.com\/en\/2026\/03\/09\/event-driven-architecture-in-2026-why-microservice\/"},"modified":"2026-03-18T22:00:07","modified_gmt":"2026-03-18T22:00:07","slug":"event-driven-architecture-in-2026-why-microservice","status":"publish","type":"post","link":"https:\/\/blog.rebalai.com\/en\/2026\/03\/09\/event-driven-architecture-in-2026-why-microservice\/","title":{"rendered":"Event-Driven Architecture in 2026: Why My Microservices Finally Stopped Talking Back"},"content":{"rendered":"<p>At 2:14am on a Tuesday, my phone buzzed. The order service was timing out. Which was timing out the inventory service. Which was failing to respond to the notification service. Which caused the analytics pipeline to back up and start throwing 503s. Six <a href=\"https:\/\/m.do.co\/c\/06956e5e2802\" title=\"Deploy Microservices on DigitalOcean\" rel=\"nofollow sponsored\" target=\"_blank\">microservices<\/a>, six separate REST calls in a chain, and the entire thing had collapsed like a card tower because one database query ran long.<\/p>\n<p>I fixed it by 3am, pushed a hotfix, and then sat there thinking: there has to be a better way to structure this.<\/p>\n<p>That was about 18 months ago. I&#8217;ve since migrated two projects \u2014 a small SaaS I co-own and a client engagement for a 12-person startup \u2014 to event-driven architectures built on top of Redpanda and Kafka respectively. Here&#8217;s <a href=\"https:\/\/blog.rebalai.com\/en\/2026\/03\/05\/rag-vs-fine-tuning-when-to-use-each-technique-for\/\" title=\"What I Actually Learned\">what I actually learned<\/a>, including the embarrassing parts.<\/p>\n<h2>The Core Problem With REST-Coupled Microservices<\/h2>\n<p>Look, REST makes <a href=\"https:\/\/blog.rebalai.com\/en\/2026\/03\/08\/edge-computing-in-2026-why-developers-are-adopting\/\" title=\"Sense for\">sense for<\/a> request-response. Client needs data, server provides it. Clean. Simple. But when you&#8217;re building <a href=\"https:\/\/m.do.co\/c\/06956e5e2802\" title=\"Deploy Microservices on DigitalOcean\" rel=\"nofollow sponsored\" target=\"_blank\">microservices<\/a> that need to <em>react to things that happened<\/em>, REST starts to fight you.<\/p>\n<p>The pattern I kept falling into was something like: Service A does a thing, then synchronously calls Service B, which calls Service C, which maybe calls Service D. You end up with temporal coupling \u2014 every service <a href=\"https:\/\/blog.rebalai.com\/en\/2026\/03\/08\/rag-deep-dive-chunking-strategies-vector-databases\/\" title=\"in the\">in the<\/a> chain needs to be healthy <em>right now<\/em> for any individual operation to succeed. And that 2am incident wasn&#8217;t a fluke. I counted seven similar cascades in <a href=\"https:\/\/m.do.co\/c\/06956e5e2802\" title=\"DigitalOcean <a href=\"https:\/\/blog.rebalai.com\/en\/2026\/03\/04\/fine-tuning-vs-rag-when-to-use-each-approach-for-production-llms\/\" title=\"for Production\">for Production<\/a> Workloads&#8221; rel=&#8221;nofollow sponsored&#8221; target=&#8221;_blank&#8221;>production<\/a> logs from the <a href=\"https:\/\/blog.rebalai.com\/en\/2026\/03\/09\/bun-vs-nodejs-in-production-2026-real-migration-st\/\" title=\"Three Months\">three months<\/a> before I finally migrated.<\/p>\n<p>Here is the thing: the services weren&#8217;t doing wrong things. The <em>architecture<\/em> was wrong for what we needed. Order placed? That&#8217;s an event. Something happened. The inventory service doesn&#8217;t need to be called synchronously \u2014 it needs to <em>know<\/em> that the order happened and update its state accordingly. Same with notifications, same with analytics.<\/p>\n<p>Event streaming flips the dependency direction. The order service doesn&#8217;t know (or care) who&#8217;s listening. It publishes an <code>order.created<\/code> event to a topic and moves on. Inventory, notifications, analytics \u2014 all consumers, all independent, all replayable if something goes wrong.<\/p>\n<p>The other thing that took me embarrassingly long to appreciate: replay. With REST, if your analytics service was down for two hours, those events are gone. With a stream, they&#8217;re retained. You bring the service back up, it consumes from its last committed offset, and catches up. I pushed this on a Friday afternoon the first time and spent about 20 minutes terrified watching the consumer lag spike \u2014 but it caught <a href=\"https:\/\/blog.rebalai.com\/en\/2026\/03\/05\/github-copilot-vs-cursor-vs-codeium-best-ai-coding\/\" title=\"Up in\">up in<\/a> under four minutes on roughly 40k missed events.<\/p>\n<h2>Choosing a Broker: Kafka, Redpanda, and NATS (My Honest Take)<\/h2>\n<p>I&#8217;ve used <a href=\"https:\/\/blog.rebalai.com\/en\/2026\/03\/08\/fastapi-vs-django-vs-flask-choosing-the-right-pyth\/\" title=\"All Three in\">all three in<\/a> <a href=\"https:\/\/m.do.co\/c\/06956e5e2802\" title=\"DigitalOcean for Production Workloads\" rel=\"nofollow sponsored\" target=\"_blank\">production<\/a> now. They&#8217;re not interchangeable despite what some blog posts suggest.<\/p>\n<p><strong>Kafka<\/strong> is the industry standard for a reason. If you&#8217;re running on a managed service like Confluent Cloud or <a href=\"https:\/\/aws.amazon.com\/?tag=synsun0f-20\" title=\"Amazon Web Services (AWS) Cloud Platform\" rel=\"nofollow sponsored\" target=\"_blank\">AWS<\/a> MSK, operationally it&#8217;s fine \u2014 you&#8217;re not managing ZooKeeper (or KRaft, which became the default with Kafka 3.3) yourself. The ecosystem is enormous. The Kafka Connect ecosystem alone saved me probably a week of work on one client project where we needed to sync events into Postgres. That said, for a team of two or three, even managed Kafka felt overweight <a href=\"https:\/\/blog.rebalai.com\/en\/2026\/03\/08\/edge-computing-in-2026-why-developers-are-adopting\/\" title=\"for Our\">for our<\/a> simpler SaaS needs.<\/p>\n<p><strong>Redpanda<\/strong> is where I landed for the SaaS project, and honestly I&#8217;m glad I did. It&#8217;s Kafka-compatible \u2014 you use the same client libraries, same topic\/partition\/consumer group model \u2014 but it&#8217;s a single binary with no JVM, no ZooKeeper, and meaningfully lower tail latencies on p99. The development experience is better too: <code>docker run -p 9092:9092 redpandadata\/redpanda<\/code> and you have a real broker running locally in about four seconds. Not a mock, not an in-memory fake \u2014 the actual thing.<\/p>\n<p><strong>NATS JetStream<\/strong> is the outlier. Lower operational overhead, simpler mental model, genuinely impressive throughput for its resource footprint. But the consumer semantics are different enough from Kafka that your engineers need to relearn some things, and the ecosystem is smaller. I used it for a side project and liked it. I probably wouldn&#8217;t reach for it on a team that already knows Kafka concepts.<\/p>\n<p>One thing I noticed: if you&#8217;re already deep in <a href=\"https:\/\/m.do.co\/c\/06956e5e2802\" title=\"Run Kubernetes on DigitalOcean\" rel=\"nofollow sponsored\" target=\"_blank\">Kubernetes<\/a> and have a platform team, Kafka (managed) is the obvious choice. If you&#8217;re a small team self-hosting, Redpanda is worth serious consideration. I am not 100% sure Redpanda <a href=\"https:\/\/blog.rebalai.com\/en\/2026\/03\/05\/github-copilot-vs-cursor-vs-codeium-best-ai-coding\/\" title=\"Holds Up\">holds up<\/a> at Confluent-scale workloads \u2014 I haven&#8217;t stress-tested it there \u2014 but for teams under 20 engineers, it&#8217;s been solid.<\/p>\n<h2>Schema Registry and the Mistake That Bit Me<\/h2>\n<p>Right, so \u2014 this is the part I wish someone had told me before I started.<\/p>\n<p>When you move to event streaming, your event schema becomes a contract between services. With REST, you control the API and its versioning. With events, consumers are decoupled and may be running different versions <a href=\"https:\/\/blog.rebalai.com\/en\/2026\/03\/09\/cloudflare-workers-vs-aws-lambda-which-edge-runtim\/\" title=\"at the\">at the<\/a> same time. If you just publish raw JSON and one day add a required field, you will break older consumers silently. Ask me how I know.<\/p>\n<p>I pushed a change to the <code>order.created<\/code> event on a Wednesday, added a <code>shipping_priority<\/code> field that the fulfillment service expected to always be present. The notification service \u2014 which I had <em>forgotten<\/em> was also consuming that topic \u2014 started throwing null pointer exceptions about six hours later when someone placed an order. It wasn&#8217;t obvious, it wasn&#8217;t loud, it just quietly started failing deserialization.<\/p>\n<p>The fix is a schema registry. Confluent ships one, Redpanda has one built in, and Apicurio is a good open source option if you want to run your own. You register your schemas (Avro and Protobuf are both well supported; I prefer Protobuf for the generated types), and producers\/consumers validate against them at runtime. More importantly, you can enforce compatibility rules \u2014 <code>BACKWARD<\/code> compatibility means new schemas can read old messages, <code>FULL<\/code> compatibility means both directions \u2014 and the registry will reject breaking changes before they reach <a href=\"https:\/\/m.do.co\/c\/06956e5e2802\" title=\"DigitalOcean for Production Workloads\" rel=\"nofollow sponsored\" target=\"_blank\">production<\/a>.<\/p>\n<p>Here&#8217;s a simplified producer setup showing schema validation in <a href=\"https:\/\/www.amazon.com\/s?k=python+programming+book&#038;tag=synsun0f-20\" title=\"Best Python Books on Amazon\" rel=\"nofollow sponsored\" target=\"_blank\">Python<\/a> using Confluent&#8217;s client:<\/p>\n<pre><code class=\"language-python\">from confluent_kafka import Producer\nfrom confluent_kafka.schema_registry import SchemaRegistryClient\nfrom confluent_kafka.schema_registry.protobuf import ProtobufSerializer\nfrom confluent_kafka.serialization import SerializationContext, MessageField\n\nfrom events_pb2 import OrderCreatedEvent  # generated from .proto\n\nschema_registry = SchemaRegistryClient({&quot;url&quot;: &quot;http:\/\/localhost:8081&quot;})\nserializer = ProtobufSerializer(\n    OrderCreatedEvent,\n    schema_registry,\n    {&quot;use.deprecated.format&quot;: False}\n)\n\nproducer = Producer({&quot;bootstrap.servers&quot;: &quot;localhost:9092&quot;})\n\ndef publish_order_created(order_id: str, customer_id: str, total_cents: int):\n    event = OrderCreatedEvent(\n        order_id=order_id,\n        customer_id=customer_id,\n        total_cents=total_cents,\n        # shipping_priority added later \u2014 default value handles old consumers\n    )\n    producer.produce(\n        topic=&quot;order.created&quot;,\n        value=serializer(event, SerializationContext(&quot;order.created&quot;, MessageField.VALUE)),\n        key=order_id,\n    )\n    producer.flush()\n<\/code><\/pre>\n<p>The consumer side mirrors this with a <code>ProtobufDeserializer<\/code>. The key point: if you try to register a schema that breaks <code>BACKWARD<\/code> compatibility, the registry rejects it. You catch the problem at <a href=\"https:\/\/m.do.co\/c\/06956e5e2802\" title=\"Deploy on DigitalOcean Cloud\" rel=\"nofollow sponsored\" target=\"_blank\">deploy<\/a> time, not six hours after the fact in <a href=\"https:\/\/m.do.co\/c\/06956e5e2802\" title=\"DigitalOcean for Production Workloads\" rel=\"nofollow sponsored\" target=\"_blank\">production<\/a>.<\/p>\n<h2>Choreography vs Orchestration \u2014 And When I Got It Completely Wrong<\/h2>\n<p>I thought choreography was always better. Services react to events independently, no central coordinator, maximum decoupling \u2014 what&#8217;s not to like?<\/p>\n<p>Choreography works beautifully for simple flows. <code>order.created<\/code> fires, inventory decrements, notification sends, analytics records. Each service does its thing. No one&#8217;s in charge. Very elegant.<\/p>\n<p>But I tried to apply it to a checkout flow that had conditional branches: if payment failed, roll back inventory reservation; if the customer was on a trial plan, skip certain fulfillment steps; if the address flagged a fraud check, pause the whole thing pending review. Four services, three conditional branches, two possible rollback paths.<\/p>\n<p>Six <a href=\"https:\/\/blog.rebalai.com\/en\/2026\/03\/09\/langchain-vs-llamaindex-vs-haystack-building-produ\/\" title=\"Weeks in\">weeks in<\/a>, I had an event topology I could no longer reason about. Which service emitted which events, under which conditions, with what side effects \u2014 it was all implicit, scattered across codebases, and nearly impossible to trace when something went wrong. One of my teammates described <a href=\"https:\/\/blog.rebalai.com\/en\/2026\/03\/09\/serverless-vs-containers-in-2026-a-practical-decis\/\" title=\"It as\">it as<\/a> &#8220;the distributed state machine that hides <a href=\"https:\/\/blog.rebalai.com\/en\/2026\/03\/08\/rag-deep-dive-chunking-strategies-vector-databases\/\" title=\"in the\">in the<\/a> dark.&#8221;<\/p>\n<p>For complex, stateful workflows with branching logic and rollback requirements, orchestration is better. Tools like Temporal.io (which I&#8217;ve been using for about eight months now) or Apache Flink for stream processing give you a place where the business logic <em>lives<\/em> \u2014 one service that explicitly manages the workflow state. The other services are still event-driven <a href=\"https:\/\/blog.rebalai.com\/en\/2026\/03\/09\/cloudflare-workers-vs-aws-lambda-which-edge-runtim\/\" title=\"at the\">at the<\/a> edges; they still emit and consume events. But there&#8217;s a coordinator that knows where <a href=\"https:\/\/blog.rebalai.com\/en\/2026\/03\/08\/rag-deep-dive-chunking-strategies-vector-databases\/\" title=\"in the\">in the<\/a> process you are.<\/p>\n<p>I&#8217;m not saying choreography is wrong. I applied it past the point <a href=\"https:\/\/blog.rebalai.com\/en\/2026\/03\/09\/webassembly-in-2026-where-it-actually-makes-sense\/\" title=\"Where It\">where it<\/a> was the right tool \u2014 that&#8217;s on me. My heuristic now: if you need to visualize the workflow on a whiteboard with more than two decision branches, you want orchestration for that flow.<\/p>\n<h2>What <a href=\"https:\/\/blog.rebalai.com\/en\/2026\/03\/09\/redis-vs-valkey-in-2026-why-the-license-change-for\/\" title=\"Actually Changed\">Actually Changed<\/a> After the Migration<\/h2>\n<p>The 2am pages stopped. That&#8217;s the simplest way to put it. Not entirely \u2014 distributed <a href=\"https:\/\/blog.rebalai.com\/en\/2026\/03\/04\/multi-agent-ai-enterprise-2026\/\" title=\"Systems Are\">systems are<\/a> still distributed systems \u2014 but the cascading timeouts that characterized our worst incidents went away because the dependency chains are gone. The order service can run and publish events even if the notification service is deploying a new version.<\/p>\n<p>Consumer lag monitoring became the new on-call skill. Where before I watched error rates and latency on individual endpoints, now I also watch whether consumers are keeping up with their topics. Grafana + Prometheus scraping Kafka\/Redpanda metrics for <code>consumer_lag_sum<\/code> is how I set that up, and the dashboard has been more predictive than anything I had before.<\/p>\n<p>Debugging changed character. The <em>good<\/em> news: every event is persisted on the topic, so you can replay and inspect exactly what happened. I&#8217;ve used this multiple times to reconstruct bugs that would have been untraceable with ephemeral REST calls. The <em>challenging<\/em> news: distributed tracing across async boundaries is harder to wire up properly. I&#8217;m using OpenTelemetry with trace context propagated in event headers, which works, but <a href=\"https:\/\/blog.rebalai.com\/en\/2026\/03\/08\/edge-computing-in-2026-why-developers-are-adopting\/\" title=\"It Took\">it took<\/a> a weekend to get right and the documentation is still a bit fragmented.<\/p>\n<p>Idempotency is not optional. Consumers can receive the same event more than once (at-least-once delivery is the default in Kafka and Redpanda). Every consumer I wrote now checks for duplicate processing using the event ID. Early on I missed this <a href=\"https:\/\/blog.rebalai.com\/en\/2026\/03\/08\/rag-deep-dive-chunking-strategies-vector-databases\/\" title=\"in the\">in the<\/a> inventory service and briefly had orders double-decrementing stock. Thankfully this was caught in staging.<\/p>\n<h2>What I Would Actually Recommend<\/h2>\n<p>If you&#8217;re running <a href=\"https:\/\/m.do.co\/c\/06956e5e2802\" title=\"Deploy Microservices on DigitalOcean\" rel=\"nofollow sponsored\" target=\"_blank\">microservices<\/a> that communicate primarily through synchronous REST chains \u2014 especially if you&#8217;ve seen cascade failures or you&#8217;re bolting on an increasing number of webhooks and polling jobs \u2014 event streaming is worth the migration cost. The learning curve is real and so is the operational overhead, but so is the alternative: 2am pages that don&#8217;t have to happen.<\/p>\n<p>For most small-to-mid-sized teams, I&#8217;d start with Redpanda (self-hosted) or MSK (if you&#8217;re <a href=\"https:\/\/aws.amazon.com\/?tag=synsun0f-20\" title=\"Amazon Web Services (AWS) Cloud Platform\" rel=\"nofollow sponsored\" target=\"_blank\">AWS<\/a>-native and want managed). Set up a schema registry from day one \u2014 not later, not &#8220;when we need it.&#8221; Use Protobuf or Avro for your schemas, not JSON. Pick <code>BACKWARD<\/code> compatibility as your default and be explicit about breaking changes.<\/p>\n<p>Start by identifying one flow in your system that has multiple downstream dependents and convert just that flow to events. Don&#8217;t try to migrate everything at once. The first event topic will teach you more about your system&#8217;s actual shape than any architecture diagram ever did.<\/p>\n<p>For simple reactive flows, choreography is fine. Once you&#8217;ve got conditional branches, rollbacks, or anything that needs to be visualized like a flowchart \u2014 reach for Temporal or an equivalent. Getting that distinction right early saves the six <a href=\"https:\/\/blog.rebalai.com\/en\/2026\/03\/08\/rag-deep-dive-chunking-strategies-vector-databases\/\" title=\"Weeks of\">weeks of<\/a> untangling I had to do.<\/p>\n<p>And monitor consumer lag like you monitor error rates. That number will <a href=\"https:\/\/blog.rebalai.com\/en\/2026\/03\/09\/setting-up-github-actions-for-python-applications\/\" title=\"Tell You\">tell you<\/a> when something&#8217;s wrong before your users do.<\/p>\n<p><!-- Reviewed: 2026-03-09 | Status: ready_to_publish | Changes: meta_description expanded to 155 chars, contracted \"Here is\" to \"Here's\" throughout, removed false-start sentence in choreography section, broke parallel \"I'm saying\" construction, tightened recommendation closing, minor voice tweaks --><\/p>\n","protected":false},"excerpt":{"rendered":"<p>At 2:14am on a Tuesday, my phone buzzed. The order service was timing out. Which was timing out the inventory service.<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"site-sidebar-layout":"default","site-content-layout":"","ast-site-content-layout":"default","site-content-style":"default","site-sidebar-style":"default","ast-global-header-display":"","ast-banner-title-visibility":"","ast-main-header-display":"","ast-hfb-above-header-display":"","ast-hfb-below-header-display":"","ast-hfb-mobile-header-display":"","site-post-title":"","ast-breadcrumbs-content":"","ast-featured-img":"","footer-sml-layout":"","ast-disable-related-posts":"","theme-transparent-header-meta":"","adv-header-id-meta":"","stick-header-meta":"","header-above-stick-meta":"","header-main-stick-meta":"","header-below-stick-meta":"","astra-migrate-meta-layouts":"default","ast-page-background-enabled":"default","ast-page-background-meta":{"desktop":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"tablet":{"background-color":"","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"mobile":{"background-color":"","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""}},"ast-content-background-meta":{"desktop":{"background-color":"var(--ast-global-color-4)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"tablet":{"background-color":"var(--ast-global-color-4)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"mobile":{"background-color":"var(--ast-global-color-4)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""}},"footnotes":""},"categories":[1],"tags":[],"class_list":["post-161","post","type-post","status-publish","format-standard","hentry","category-general"],"_links":{"self":[{"href":"https:\/\/blog.rebalai.com\/en\/wp-json\/wp\/v2\/posts\/161","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/blog.rebalai.com\/en\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blog.rebalai.com\/en\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blog.rebalai.com\/en\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/blog.rebalai.com\/en\/wp-json\/wp\/v2\/comments?post=161"}],"version-history":[{"count":20,"href":"https:\/\/blog.rebalai.com\/en\/wp-json\/wp\/v2\/posts\/161\/revisions"}],"predecessor-version":[{"id":510,"href":"https:\/\/blog.rebalai.com\/en\/wp-json\/wp\/v2\/posts\/161\/revisions\/510"}],"wp:attachment":[{"href":"https:\/\/blog.rebalai.com\/en\/wp-json\/wp\/v2\/media?parent=161"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blog.rebalai.com\/en\/wp-json\/wp\/v2\/categories?post=161"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blog.rebalai.com\/en\/wp-json\/wp\/v2\/tags?post=161"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}