Kafka Consumer Error Handling and Retries - Part 1

One of the aspects I truly love about my job as a Software Developer is the constant opportunity to learn new things. Even after years of working with Kafka, I continue to encounter new concepts that ultimately deepen my understanding of its inner workings. Through this blog series, I aim to share my insights on Kafka error handling and retries. Join me on this exciting journey as we unravel the intricacies of Kafka's error handling mechanisms.This is a three part series. In the first installment you will learn about fundamentals of error handling in Kafka with basic error handling scenarios. In the second and third part you will deepen your understanding with working examples.

In this article I will work with a fictitious online store, let’s call it Gir (Fun fact Gir is a forest like the Amazon, known for its majestic Lions). Gir is the most sustainable shop on the Planet selling only ethically sourced products handled in the most sustainable way possible.

Gir Architecture

Following describes the different components in Gir architecture and how they interact with each other.

Gir Architecture

Components

eCommerce Engine is responsible for order checkout experience for the customers
Order Management System (OMS) is responsible for managing the life cycle of the order, post order submission.
Warehouse Management System (WMS) is responsible for shipping/fulfilling of the ordered items to users.
Payment Gateway is responsible for processing the order payment and refunds.

Create Order Flow

Following describes what happens when an order is submitted. Place Order Flow

Customer placed an order
eCommerce application(which exposed the checkout functionality) publishes a Create Order event to Kafka topic named "digital_orders"
Order Management System (OMS) is one of the consumer for the "digital_orders" topic and it store order information in its database

This might seem like a straightforward use case—what could possibly go wrong? Well, as it turns out, a lot more than you'd expect!

Database down/failure
Kafka broker down/failure
Network failures preventing the application from communicating with downstream systems
Authentication authorization error
Kafka offset commit failed
Kafka rebalancing
Serialization/Deserialization error because of message format mismatch between producer and consumer
Duplicate messages
Message processed in wrong orders

Categorization of Consumer Errors

Now let’s discuss these errors and how you can handle them. Some of these errors are recoverable like database down or network failure could be an intermittent issue. You can handle these error by retrying the processing of these messages until the system selfheals.

The other type of errors are not recoverable like serialization/deserialization or fraud orders. No matter how many times you retry it will continue to fail indefinitely.

Non-Recoverable Errors

Non recoverable errors are errors that cannot be handled programmatically, this includes unknown errors or failures that require manual intervention. Serialization/Deserialization errors are non-recoverable, if a message is invalid (e.g. schema mismatch between producer and consumer), no matter how many times you’ll retry it will always fail. Authentication and authorization errors due to expired password or permission revoked are non-recoverable. The only way to fix it is to reconfigure your application with a new password or grant the required permission to the service account.

Handle Non-Recoverable Error

Following describes the handling of non-recoverable errors. Non-Recoverable Error Handling

Kafka Consumer message processing failed with a non-recoverable error (i.e. error that cannot be resolved by retrying)
If message failure is tolerable (e.g. duplicate order message causes OMS processing to fail as order already existed in OMS database) send message to DLQ topic.
If failure is not tolerable (Poison pill) then stop processing of messages and generate alerts which will trigger a manual intervention.

Recoverable Errors

Recoverable errors are intermittent errors that are temporary and usually get resolved by themselve. For example, the call to Payment Gateway might fail due to maintenance activity and hence Kafka consumer will have no control over it, all it can do is keep retrying and hope the Payment Gateway recovers itself. This can be considered a recoverable error.

Handling Recoverable Error

Following describes the handling of recoverable errors.

Recoverable Error Handling

Kafka Consumer message processing failed with a recoverable error (i.e. error that can may be resolved by retrying)
Wait for some time before retrying the operation that resulted in failure
Keep retrying after a delay, potentially utilizing exponential backoff (keep incrementing delay on subsequent failures)
After a certain number of retries, consider this message to be non-recoverable and stop retrying.
Now what to do with the failed message. If message failure can be tolerated, you publish a message to a Dead Letter Topic (DLT/DLQ) for further investigation.
Stop the consumer process if message failure cannot be tolerated, for example when messages are ordered and failure of one message could result in failure of subsequent messages.

In some cases determining whether a failure can be tolerated may not always be straightforward and may require implementing a circuit breaker pattern. Like if one call to write to DB failed, it may be tolerable but if 10 back-to-back calls failed, it may be a signal that there is a system wide failure of some sort and hence consumption of message should be stopped.

Determining the recoverability of errors can be challenging. An authorization request failure may occur due to authorization server being temporarily down, which is recoverable or revoked/misconfigured permission which cannot be recovered with retries. If Authorization request is a RESTful API call, you may be able to differentiate these scenarios by HTTP response status code, 4XX status code (Unauthorized/Forbidden) being non recoverable error and 5XX(Service Unavailable) recoverable.

Having covered the theoretical aspects, we will now delve into practical implementation. In the upcoming two part to this series, we will explore various error handling scenarios with hands-on examples, addressing both recoverable and non-recoverable errors.