At 2:14am on a Tuesday, my phone buzzed. The order service was timing out. Which was timing out the inventory service. Which was failing to respond to the notification service. Which caused the analytics pipeline to back up and start throwing 503s. Six microservices, six separate REST calls in a chain, and the entire thing had collapsed like a card tower because one database query ran long.
I fixed it by 3am, pushed a hotfix, and then sat there thinking: there has to be a better way to structure this.
That was about 18 months ago. I’ve since migrated two projects — a small SaaS I co-own and a client engagement for a 12-person startup — to event-driven architectures built on top of Redpanda and Kafka respectively. Here’s what I actually learned, including the embarrassing parts.
The Core Problem With REST-Coupled Microservices
Look, REST makes sense for request-response. Client needs data, server provides it. Clean. Simple. But when you’re building microservices that need to react to things that happened, REST starts to fight you.
The pattern I kept falling into was something like: Service A does a thing, then synchronously calls Service B, which calls Service C, which maybe calls Service D. You end up with temporal coupling — every service in the chain needs to be healthy right now for any individual operation to succeed. And that 2am incident wasn’t a fluke. I counted seven similar cascades in for Production Workloads” rel=”nofollow sponsored” target=”_blank”>production logs from the three months before I finally migrated.
Here is the thing: the services weren’t doing wrong things. The architecture was wrong for what we needed. Order placed? That’s an event. Something happened. The inventory service doesn’t need to be called synchronously — it needs to know that the order happened and update its state accordingly. Same with notifications, same with analytics.
Event streaming flips the dependency direction. The order service doesn’t know (or care) who’s listening. It publishes an order.created event to a topic and moves on. Inventory, notifications, analytics — all consumers, all independent, all replayable if something goes wrong.
The other thing that took me embarrassingly long to appreciate: replay. With REST, if your analytics service was down for two hours, those events are gone. With a stream, they’re retained. You bring the service back up, it consumes from its last committed offset, and catches up. I pushed this on a Friday afternoon the first time and spent about 20 minutes terrified watching the consumer lag spike — but it caught up in under four minutes on roughly 40k missed events.
Choosing a Broker: Kafka, Redpanda, and NATS (My Honest Take)
I’ve used all three in production now. They’re not interchangeable despite what some blog posts suggest.
Kafka is the industry standard for a reason. If you’re running on a managed service like Confluent Cloud or AWS MSK, operationally it’s fine — you’re not managing ZooKeeper (or KRaft, which became the default with Kafka 3.3) yourself. The ecosystem is enormous. The Kafka Connect ecosystem alone saved me probably a week of work on one client project where we needed to sync events into Postgres. That said, for a team of two or three, even managed Kafka felt overweight for our simpler SaaS needs.
Redpanda is where I landed for the SaaS project, and honestly I’m glad I did. It’s Kafka-compatible — you use the same client libraries, same topic/partition/consumer group model — but it’s a single binary with no JVM, no ZooKeeper, and meaningfully lower tail latencies on p99. The development experience is better too: docker run -p 9092:9092 redpandadata/redpanda and you have a real broker running locally in about four seconds. Not a mock, not an in-memory fake — the actual thing.
NATS JetStream is the outlier. Lower operational overhead, simpler mental model, genuinely impressive throughput for its resource footprint. But the consumer semantics are different enough from Kafka that your engineers need to relearn some things, and the ecosystem is smaller. I used it for a side project and liked it. I probably wouldn’t reach for it on a team that already knows Kafka concepts.
One thing I noticed: if you’re already deep in Kubernetes and have a platform team, Kafka (managed) is the obvious choice. If you’re a small team self-hosting, Redpanda is worth serious consideration. I am not 100% sure Redpanda holds up at Confluent-scale workloads — I haven’t stress-tested it there — but for teams under 20 engineers, it’s been solid.
Schema Registry and the Mistake That Bit Me
Right, so — this is the part I wish someone had told me before I started.
When you move to event streaming, your event schema becomes a contract between services. With REST, you control the API and its versioning. With events, consumers are decoupled and may be running different versions at the same time. If you just publish raw JSON and one day add a required field, you will break older consumers silently. Ask me how I know.
I pushed a change to the order.created event on a Wednesday, added a shipping_priority field that the fulfillment service expected to always be present. The notification service — which I had forgotten was also consuming that topic — started throwing null pointer exceptions about six hours later when someone placed an order. It wasn’t obvious, it wasn’t loud, it just quietly started failing deserialization.
The fix is a schema registry. Confluent ships one, Redpanda has one built in, and Apicurio is a good open source option if you want to run your own. You register your schemas (Avro and Protobuf are both well supported; I prefer Protobuf for the generated types), and producers/consumers validate against them at runtime. More importantly, you can enforce compatibility rules — BACKWARD compatibility means new schemas can read old messages, FULL compatibility means both directions — and the registry will reject breaking changes before they reach production.
Here’s a simplified producer setup showing schema validation in Python using Confluent’s client:
from confluent_kafka import Producer
from confluent_kafka.schema_registry import SchemaRegistryClient
from confluent_kafka.schema_registry.protobuf import ProtobufSerializer
from confluent_kafka.serialization import SerializationContext, MessageField
from events_pb2 import OrderCreatedEvent # generated from .proto
schema_registry = SchemaRegistryClient({"url": "http://localhost:8081"})
serializer = ProtobufSerializer(
OrderCreatedEvent,
schema_registry,
{"use.deprecated.format": False}
)
producer = Producer({"bootstrap.servers": "localhost:9092"})
def publish_order_created(order_id: str, customer_id: str, total_cents: int):
event = OrderCreatedEvent(
order_id=order_id,
customer_id=customer_id,
total_cents=total_cents,
# shipping_priority added later — default value handles old consumers
)
producer.produce(
topic="order.created",
value=serializer(event, SerializationContext("order.created", MessageField.VALUE)),
key=order_id,
)
producer.flush()
The consumer side mirrors this with a ProtobufDeserializer. The key point: if you try to register a schema that breaks BACKWARD compatibility, the registry rejects it. You catch the problem at deploy time, not six hours after the fact in production.
Choreography vs Orchestration — And When I Got It Completely Wrong
I thought choreography was always better. Services react to events independently, no central coordinator, maximum decoupling — what’s not to like?
Choreography works beautifully for simple flows. order.created fires, inventory decrements, notification sends, analytics records. Each service does its thing. No one’s in charge. Very elegant.
But I tried to apply it to a checkout flow that had conditional branches: if payment failed, roll back inventory reservation; if the customer was on a trial plan, skip certain fulfillment steps; if the address flagged a fraud check, pause the whole thing pending review. Four services, three conditional branches, two possible rollback paths.
Six weeks in, I had an event topology I could no longer reason about. Which service emitted which events, under which conditions, with what side effects — it was all implicit, scattered across codebases, and nearly impossible to trace when something went wrong. One of my teammates described it as “the distributed state machine that hides in the dark.”
For complex, stateful workflows with branching logic and rollback requirements, orchestration is better. Tools like Temporal.io (which I’ve been using for about eight months now) or Apache Flink for stream processing give you a place where the business logic lives — one service that explicitly manages the workflow state. The other services are still event-driven at the edges; they still emit and consume events. But there’s a coordinator that knows where in the process you are.
I’m not saying choreography is wrong. I applied it past the point where it was the right tool — that’s on me. My heuristic now: if you need to visualize the workflow on a whiteboard with more than two decision branches, you want orchestration for that flow.
What Actually Changed After the Migration
The 2am pages stopped. That’s the simplest way to put it. Not entirely — distributed systems are still distributed systems — but the cascading timeouts that characterized our worst incidents went away because the dependency chains are gone. The order service can run and publish events even if the notification service is deploying a new version.
Consumer lag monitoring became the new on-call skill. Where before I watched error rates and latency on individual endpoints, now I also watch whether consumers are keeping up with their topics. Grafana + Prometheus scraping Kafka/Redpanda metrics for consumer_lag_sum is how I set that up, and the dashboard has been more predictive than anything I had before.
Debugging changed character. The good news: every event is persisted on the topic, so you can replay and inspect exactly what happened. I’ve used this multiple times to reconstruct bugs that would have been untraceable with ephemeral REST calls. The challenging news: distributed tracing across async boundaries is harder to wire up properly. I’m using OpenTelemetry with trace context propagated in event headers, which works, but it took a weekend to get right and the documentation is still a bit fragmented.
Idempotency is not optional. Consumers can receive the same event more than once (at-least-once delivery is the default in Kafka and Redpanda). Every consumer I wrote now checks for duplicate processing using the event ID. Early on I missed this in the inventory service and briefly had orders double-decrementing stock. Thankfully this was caught in staging.
What I Would Actually Recommend
If you’re running microservices that communicate primarily through synchronous REST chains — especially if you’ve seen cascade failures or you’re bolting on an increasing number of webhooks and polling jobs — event streaming is worth the migration cost. The learning curve is real and so is the operational overhead, but so is the alternative: 2am pages that don’t have to happen.
For most small-to-mid-sized teams, I’d start with Redpanda (self-hosted) or MSK (if you’re AWS-native and want managed). Set up a schema registry from day one — not later, not “when we need it.” Use Protobuf or Avro for your schemas, not JSON. Pick BACKWARD compatibility as your default and be explicit about breaking changes.
Start by identifying one flow in your system that has multiple downstream dependents and convert just that flow to events. Don’t try to migrate everything at once. The first event topic will teach you more about your system’s actual shape than any architecture diagram ever did.
For simple reactive flows, choreography is fine. Once you’ve got conditional branches, rollbacks, or anything that needs to be visualized like a flowchart — reach for Temporal or an equivalent. Getting that distinction right early saves the six weeks of untangling I had to do.
And monitor consumer lag like you monitor error rates. That number will tell you when something’s wrong before your users do.