This post is part of a series on NServiceBus. In the previous articles, we looked at fundamental concepts for messaging and how to get started with NServiceBus. All the code for this series can be found in the NServiceBusTutorial repository on GitHub.
What is recoverability? 🔄
Let’s be honest, in the world of distributed systems, stuff breaks. Networks hiccup, databases time out, and that AI agent you’ve been using seems to hallucinate more each day 🤖. Recoverability is a concept in NServiceBus that provides a built-in strategy for keeping your system resilient when message processing fails.
At a high level, when a message handler throws an exception, NServiceBus will:
- Try again immediately a few times (Immediate retries ⚡)
- If it still fails, try again later with a delay (Delayed retries ⏱️)
- If it still can’t succeed, move the message to a dedicated error queue (so the issue can be fixed and retried using ServicePulse)
These are pretty standard features for messaging-based systems. However, configuring these policies with low level messaging APIs can be a pain. NServiceBus simplifies this with a fluent configuration API that makes it easy to set up and customize your recoverability strategies.
Pro tip: Your handlers should be idempotent so retries don’t accidentally perform duplicate operations like over-charging a customer or sending an email twice.
Configuring recoverability ⚙️
Each
EndpointConfigurationexposes a
Recoverabilityproperty that you can tune to your system's needs. In the code below, we configure immediate and delayed retries on the
Recoverability` property as well as route failed messages to a dedicated error queue.
var endpointConfiguration = new EndpointConfiguration(endpointName);
endpointConfiguration.Recoverability()
.Immediate(i => i.NumberOfRetries(2))
.Delayed(d => d.NumberOfRetries(1));
endpointConfiguration.SendFailedMessagesTo("error");
// Optional but highly recommended for audits/traceability
endpointConfiguration.AuditProcessedMessagesTo("audit");
The defaults for these values are sensible for most systems, but it’s worth tuning based on the failure modes and SLAs of your dependencies. For example, there may be some messages that should not be retried and immediately moved to an error queue. There may be other situations where delayed retries should have a backoff strategy. You read more about configuring recoverability.
Considerations for when to throw exceptions 🤔
- Throw exceptions for transient failures you want retried (timeouts, deadlocks, 5xxs). 🔄
- Don’t throw for expected business rejections (e.g., “Payment declined”). Consider deliberately rerouting failure scenarios via a custom domain exceptions and policies. 📬
- If a message is truly poisonous (corrupt payload), consider quarantining it to a special error queue and alerting. 🧪🚫
It’s important to remember: throwing triggers recoverability; not throwing counts as successful processing of a message.
Common pitfalls and tips 🧠
- Avoid long immediate retries for downstream outages; prefer delayed retries to back off. ⏳
- Keep handler logic small and focused; shorter units are easier to retry safely. 🔍
- Test failure paths! Simulate exceptions while running locally as well as in your NServiceBus tests to validate your recoverability behavior. 🧪
Next steps for error handling 👟
Alongside recoverability, we want to ensure that our message processing is robust and reliable. In upcoming blog posts we will take a look at these patterns in more detail but for now, here are some key design considerations and best practices:
- Unit of work: Ensure that all operations within a message handler are part of a single unit of work. This means either all operations succeed, or none do. Use database transactions or similar mechanisms to achieve this.
- Idempotency: Design your message handlers to be idempotent, meaning that processing the same message multiple times has no additional effect. This is crucial for safely handling retries.
- Outbox: Implement the Outbox pattern to ensure that messages are only sent after the main business operation has succeeded. This prevents message loss and ensures consistent state.
NServiceBus intentionally promotes messaging best practices and the Particular Service Platform makes monitoring and operation of distributed systems easier for teams. If you’re interested in seeing the Particular Service Platform in action, consider checking out NimblePro’s webinar on the Particular Service Platform in the context of the eShopOnWeb reference application.